“The Database” (2017-08-20)

Once upon a time there was “The Database”. It was all-encompassing and it was good.

It provided a current snapshot of reality. It was “the data”. It was “the state of things”. It was the “source of truth”.

It was spoken of with reverence. A tribe of humans called “DBAs” spent their considerable waking hours (many of which were involuntary) worrying about data integrity and consistency and the status of many backups; maintaining the ability to restore as needed was paramount.

But it was not only the “source of truth”. It was the way in which all data was accessed, which had to be done quickly. Sometimes access patterns were well-anticipated. Sometimes, however, they weren’t, requiring non-trivial work to be performed on live data to bash it into the appropriate shape. Such efforts were rarely fun for the practitioners involved. Typically they took place during the wee hours, somewhere in the hazy hours between the day’s processing and the nightly offline processing, between the backups and the next morning when it would all start again.

And since it was the source of truth, it held all the answers. Or at least anything could form part of an answer under some circumstance was put there because, well, where else would you possible look? And since there was nowhere else, your application could read data left by my application and my application could read data left by yours OR change it. Or take some and add more. Or delete some. You could write data to it and read data from it

It was the integration platform! It was a flying flock of global variables just waiting to be created and destroyed mutated and associated. There was a context and it was wide.

And devilishly difficult to reason about, at least in any reasonable way.

Now this is not to say – or even imply – that it did not serve its purpose. Serving the database was the the organizing principal of most of the code. The state of the database was a reflection of all that had occurred in the past. Note that I say “a reflection” because we (only rarely) had all the information; if we had the value 5 we rarely had the 3 and the 2that had been added together to make that 5. Not a big deal. Well, not a big deal if we don’t decide that that 2 should have been a 6 upon further reflection, but later for that. The point, though, is that we were typically concerned with the current value; you want to know if widget #12345 is available now not the whole history of how that came to be.

There were good reasons for this approach. Reliable random access storage ain’t free. (For a historical – well up to three years ago – graphical display, see: http://www.mkomo.com/cost-per-gigabyte-update. Be sure to notice that the graph is logarithmic.) When a gigabyte cost tens of thousands of dollars or even tens of dollars the idea of only keeping a current snapshot – and considering that snapshot to be the durable source of truth worthy of care and feeding – made perfect sense. Now, perhaps, not so much. Strangely (well, actually not) when costs came down for storage instead of going to a model that traced the development of entities through time, what typically happened was that we began to use the database as a place to record events as they happened. After all, why not? It was a hunk of persistent storage that was well secured, was well cared for and available. And if anything happened to it, there was someone who’d come running to fix it.

The problem though, is that this was a completely different type of data. Individual events might have specific significance for a period of time, but this individual importance receded quickly; it was primarily significant in the aggregate. Further, its relationship to normalization was different. The state of any external references at the time of the event was likely to be more significant than their state when the data was read. Even worse, though, was the fact that these collections tended to grow. And grow. And grow. And ultimately outgrow the “nicely curated, nicely normalized snapshot data” – which tended to make operational tasks a nightmare. And, because locating individual pieces was often necessary in the first few hours or days of the data’s life, it’s well-indexed. Forever. At a non-trivial cost.

But it was, after all, “The Database”.

Times have changed. What was “The Database”  has become a virtual collection of different data repositories. databases both relational and not, object stores, streams, caches and all manner of things. Taken together with whatever applications are used for inquiry, update and maintenance, they constitute “The Database”. It’s not quite as neat (well, yet) and isn’t tied up quite so well (yet) but is hugely more flexible. The major challenge will be to keep it coherent.

Let’s consider what “the database” might consist of right now:

  • A few mysql instances
  • Some number of Kafka streams.
  • A bunch of S3 buckets
  • An ElasticSearch cluster
  • A redshift cluster
  • A Tableau cluster

Is it too much to deal with? Well, perhaps. But let’s turn things “on their side”, as it were: If you squint just a little and look at it right, ‘The Database” really comes down to two things: The mysql instances (mostly – and hopefully increasingly – used for the “store” data, inventory we can sell, reservations, move ins, stuff related to the day-to-day operations of the business) and the bunch of S3 buckets (which are the ultimate destination for everything that concerns the behavior of customers and potential customers, all the information from which we can make future predictions and educated guesses about what might/will work in the future).

Kinesis is a durable transport mechanism, allowing for realtime subscribing. (It’s The Stream!)

Redshift is an analytics tool that can derive its information either by subscribing to Kinesis and grabbing it as it flows gently down the stream or wait until it settles out and “lands” in S3.

Tableau is a further subscriber.

But the source of truth is mysql and S3. Everything else can be regenerated as needed. And the ability for contents to be regenerated sure takes a lot of the weight off developing a system!

Yes, there are more pieces. It’s not nearly so simple, and it will undoubtedly evolve over the course of time. But at its root, it’s still “The Database”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s