Mar 7, 2022 2 min read Articles

Reimagining the Big Data warehouse: simple tips for increasing speed and agility

In spite of the Cloud, Big Data, and AI, why do age-old questions - such as, "how can we serve our customers quicker?", "how can we get more value out of data we collect?", and "how can we increase speed, whilst reducing complexity?" - just never seem to go away?

The world is changing, and with such paradigm shifting innovations as Big Data, The Cloud, and The Internet of Things, the opportunities to do the seemingly impossible now appear to be endless.

So, why do age-old questions - such as, "how can we serve our customers quicker answers?", "how can we get more value out of data we collect?", "how can we keep our costs low, whilst future-proofing our architectures and designs?" and "how can we increase speed, whilst reducing complexity?" - just never seem to go away?

Based on pertinent data capture being focused on data journeys and/or life-cycles (e.g., capturing data from the start of an order to the end of its fulfilment), data relationships (e.g., who "likes" what and who are their "friends" and what do their "friends" "like"?) and data milestones (e.g., first time we buy an item, the last, and/or our current situation), the tips below aim to contribute to these thoughts and offer a number of ideas - to help keep things "relatively" simple.

Tip #1 - Skinny Tables - Employ columnar Big Data platforms (such as Amazon's AWS Redshift) and keep tables narrow, to reduce Big Data manipulation overheads (e.g., I/O required to perform updates or deletes).

Tip #2 - Keep Hot Stuff Separate from Cold - by splitting Big Data tables horizontally, where appropriate (e.g., immutable data for last financial year in a separate slice from mutable current year data, etc). This would help reduce any data sorting overheads (e.g., resource intensive vacuuming of Amazon's AWS Redshift).

Tip #3 - Embrace Natural "data" Relationships - Avoid synthetic keys (e.g., traditional data warehouse generated dimension keys) if possible and instead consider the use of more flexible and durable natural keys (e.g., Customer ID combined with Source ID and/or Source Event Timestamp) as data link keys (or as primary and foreign). This would reduce Big Data processing complexity - including significantly reducing dependency management overheads - thereby also reducing to time taken to deliver data (and/or data latency).

Tip #4 - Smoke & Mirrors - Employ logical abstraction (e.g., using Big Data virtualisation platforms, views and/or metadata layers), as this would rapidly increase the flexibility and adaptability of underlying physical Big Data models.

Tip #5 - Biological "data" Links - Consider implementing slowly changing dimensions via natural keys and/or source event "effective" timestamps (as such timestamps are immutable). This would radically reduce dependency management, complexity and latency of Big Data pipelines.

Thoughts, debates, questions, comments - are most welcome and would enhance this piece.

Thank you