“The Database” (2017-08-20)

Once upon a time there was “The Database”. It was all-encompassing and it was good.

It provided a current snapshot of reality. It was “the data”. It was “the state of things”. It was the “source of truth”.

It was spoken of with reverence. A tribe of humans called “DBAs” spent their considerable waking hours (many of which were involuntary) worrying about data integrity and consistency and the status of many backups; maintaining the ability to restore as needed was paramount.

But it was not only the “source of truth”. It was the way in which all data was accessed, which had to be done quickly. Sometimes access patterns were well-anticipated. Sometimes, however, they weren’t, requiring non-trivial work to be performed on live data to bash it into the appropriate shape. Such efforts were rarely fun for the practitioners involved. Typically they took place during the wee hours, somewhere in the hazy hours between the day’s processing and the nightly offline processing, between the backups and the next morning when it would all start again.

And since it was the source of truth, it held all the answers. Or at least anything could form part of an answer under some circumstance was put there because, well, where else would you possible look? And since there was nowhere else, your application could read data left by my application and my application could read data left by yours OR change it. Or take some and add more. Or delete some. You could write data to it and read data from it

It was the integration platform! It was a flying flock of global variables just waiting to be created and destroyed mutated and associated. There was a context and it was wide.

And devilishly difficult to reason about, at least in any reasonable way.

Now this is not to say – or even imply – that it did not serve its purpose. Serving the database was the the organizing principal of most of the code. The state of the database was a reflection of all that had occurred in the past. Note that I say “a reflection” because we (only rarely) had all the information; if we had the value 5 we rarely had the 3 and the 2that had been added together to make that 5. Not a big deal. Well, not a big deal if we don’t decide that that 2 should have been a 6 upon further reflection, but later for that. The point, though, is that we were typically concerned with the current value; you want to know if widget #12345 is available now not the whole history of how that came to be.

There were good reasons for this approach. Reliable random access storage ain’t free. (For a historical – well up to three years ago – graphical display, see: http://www.mkomo.com/cost-per-gigabyte-update. Be sure to notice that the graph is logarithmic.) When a gigabyte cost tens of thousands of dollars or even tens of dollars the idea of only keeping a current snapshot – and considering that snapshot to be the durable source of truth worthy of care and feeding – made perfect sense. Now, perhaps, not so much. Strangely (well, actually not) when costs came down for storage instead of going to a model that traced the development of entities through time, what typically happened was that we began to use the database as a place to record events as they happened. After all, why not? It was a hunk of persistent storage that was well secured, was well cared for and available. And if anything happened to it, there was someone who’d come running to fix it.

The problem though, is that this was a completely different type of data. Individual events might have specific significance for a period of time, but this individual importance receded quickly; it was primarily significant in the aggregate. Further, its relationship to normalization was different. The state of any external references at the time of the event was likely to be more significant than their state when the data was read. Even worse, though, was the fact that these collections tended to grow. And grow. And grow. And ultimately outgrow the “nicely curated, nicely normalized snapshot data” – which tended to make operational tasks a nightmare. And, because locating individual pieces was often necessary in the first few hours or days of the data’s life, it’s well-indexed. Forever. At a non-trivial cost.

But it was, after all, “The Database”.

Times have changed. What was “The Database”  has become a virtual collection of different data repositories. databases both relational and not, object stores, streams, caches and all manner of things. Taken together with whatever applications are used for inquiry, update and maintenance, they constitute “The Database”. It’s not quite as neat (well, yet) and isn’t tied up quite so well (yet) but is hugely more flexible. The major challenge will be to keep it coherent.

Let’s consider what “the database” might consist of right now:

  • A few mysql instances
  • Some number of Kafka streams.
  • A bunch of S3 buckets
  • An ElasticSearch cluster
  • A redshift cluster
  • A Tableau cluster

Is it too much to deal with? Well, perhaps. But let’s turn things “on their side”, as it were: If you squint just a little and look at it right, ‘The Database” really comes down to two things: The mysql instances (mostly – and hopefully increasingly – used for the “store” data, inventory we can sell, reservations, move ins, stuff related to the day-to-day operations of the business) and the bunch of S3 buckets (which are the ultimate destination for everything that concerns the behavior of customers and potential customers, all the information from which we can make future predictions and educated guesses about what might/will work in the future).

Kinesis is a durable transport mechanism, allowing for realtime subscribing. (It’s The Stream!)

Redshift is an analytics tool that can derive its information either by subscribing to Kinesis and grabbing it as it flows gently down the stream or wait until it settles out and “lands” in S3.

Tableau is a further subscriber.

But the source of truth is mysql and S3. Everything else can be regenerated as needed. And the ability for contents to be regenerated sure takes a lot of the weight off developing a system!

Yes, there are more pieces. It’s not nearly so simple, and it will undoubtedly evolve over the course of time. But at its root, it’s still “The Database”.

Immutability, Purity and All Kinds of Stuff FTW (2017-07-14)

Immutability

I’m going to show you a little bit of code (don’t worry about the language, it might not exist)… Then I’m going to ask you a question, OK? Cool!

    x = 1
    y = 2
    ...
    ...
    ...

(Yes, I’m not going to show you what’s there. This is my game and I can do what I want.)

    if (x != 1) {
        destroyWorld();
    }

And yes, it’s all within the same scope. There are no branches or returns or exceptions thrown; there’s nothing that will keep the conditional from being executed.

So I ask a question made famous by an ex-cowboy actor on the streets of a fictional San Francisco a bit over forty years ago:

“Do ya feel lucky, punk?”

Well, not so much. Certainly not in the languages we use or in the way we use them. At this point in the code:

    i

could be anything at all.

And that’s the thing: In an immutable or, at least, a single-assignment you wouldn’t have to feel lucky. You’d know that the ‘destroyWorld’ function was not about to be called. You’d know that the value 1 was bound to the name ‘x’ and that was that.

Hell, I feel safer already.

“But,” you ask, “If I can’t change anything, how do I, like change anything?”

Well, you don’t. When you need one that’s different, you make a new one.

“Wouldn’t that be wasteful?”

Well, maybe, but not necessarily. When you make a new one, you don’t have to make a whole new one. You take the changes, say ‘and everything else is like this‘ and point to the old one.

“But what if the old one ch…. OH…..”

Yup. You’re starting to get it.

Purity

Functions, functions, functions. Everywhere we look, there’s a function. Well, kinda.

When we think of functions from mathematics, they have this nice property: Every time you call the same function on the same arguments, you get the same answer. So 1 + 1 is always 2. That’s the way functions are defined; they depend only upon their arguments. It’s the property of being referentially transparent. Same input? Same output. Always.

In programming languages, however, things are not that simple. The value returned by a function might not be completely determined by its input. Class methods, for example, have an additional scope that might contribute to the value returned by a function, the values of class variables. Instance methods have the values of class variables andinstance variables. And closures have whatever free variables are available in an enclosing scope, either implicitly (in, say, JavaScript or Python) or explicitly (in, say, PHP). And sometimes the source is even more external – like the state of the universe!

time()

anyone?

Of course, it gets even more complicated than that. Functions can have side effects. As opposed to just returning a value like nice tidy mathematical functions, they can go behind your back and change the state of the universe or at least some more limited enclosing scope. So you call

grabMyPhoneFromTheOtherRoom()

and you go back in there later and discover that My Mother the Car, dubbed into Basque, is playing on the TV, the guitar has been returned to an open D# and al the books on the shelves have been rearranged, sorted by the second letter in the author’s names.

When a function behaves like a mathematical function, taking arguments and returning a value and not messing with anything else, we call it a pure function. Pure functions are cool! They’re easy to reason about! They’re easy to test! You throw arguments at them and they give you an answer – and you can check that answer! And if you ask them again, you’ll get the same answer again. And they won’t leave a mess. No muss, no fuss.

“So we don’t want side effects, right?”

Well, it’s not quite that simple. We want side effects a lot of the time – you know, output! Persistence. Stuff like that. We like programs that do things. But it’s exactly that doingof things that makes code harder to reason about, harder to test. Why? Because we often have to do a lot of work to build up an environment in which those side effects can take place. How many times have we written the “create a bunch of objects…call some methods on the objects to get them in the right starting state…then call the method we want to test…then call a bunch of methods and collect their results to see if what we wanted to happen to the state of the objects involved actually did happen to the state of the objects involved… And then, we throw it all away and start over!

The worst part is that we do all this by writing (expressly non-trivial) code – WHICH WE KIND OF DON’T FULLY TRUST IN THE FIRST PLACE!!!

So, while we can’t get rid of side effects (well, we can but it’s a considerable undertaking), we can limit them and segregate them. We don’t have to smear that uncertainty about a program’s overall state throughout all the code.

Most side effecting code is really quite reasonable. Aside from the aforementioned changes to the outside world, we often induce state changes in objects because:

  • We can’t just create a new one because the object is in scope to other code.
  • The change is dependent upon a lot of properties that make up the object’s current state – and who wants a dozen variables in a parameter list?

Even some of these cases can be resolved with design. Wide and flat can be your worst enemy.

In general though, one of the great advantages of pure functions is that they’re easier to test. And the tests themselves are much more likely to be testing the right things the right way.

Types anyone?

Types are cool. Typing is cool, too.

Wait? What?

“We prefer dynamically-typed languages because we were damaged by our experience with Java.” (This effect is considerably more acute among those who used Java before the introduction of generics.) “Types restrict you too much.” “I don’t like programming in B&D languages.” “Statically-typed languages are not as expressive.”

But whether we’re looking at going in a statically-typed direction (unlikely) or a gradually-typed direction (could be) or a predicate-based system (see http://learnyousomeerlang.com/dialyzer or https://clojure.org/guides/spec) it’s all about being able to provide more information about a program within a program. It also gives us a way to reject programs that are not coherent at a point before run time (which is usually a better time for it to fail). And it saves us from having to write a whole tier of tests we’d otherwise have to write.

And with a good type system, you might even get more information about the pieces of data you’re working on.

So…

All right. We’re not going to change our stacks radically any time soon – nor should we (necessarily). But we could use some techniques that are available, no matter what the stack looks like, to make things a little bit more reliable, a little bit easier (and hopefully quicker) to work on, and certainly easier to reason about. We can do it a lot or do it a little, anything should make us better. And that’s what matters.

Name Things! (2017-06-23)

One of the best Computer Science jokes (okay, go along with it for now…) goes like this:

There are only two hard problems in CS:

  • Naming things
  • Cache invalidation
  • Off by one errors

We’ve gone great lengths to mitigate the effect of the last of these. Rarely do we use index-based, perhaps nested, loops to traverse the elements of an array. More often we’ll use a foreach construct or a comprehension or a map or whatever. It’s an improvement, it’s less error-prone and it reflects the semantics of what we’re trying to accomplish.

Cache invalidation in our increasingly massively concurrent distributed age remains a problem – and a significant one. We tend to either err on the side of synchronization (old-style) or we just accept the notion of eventual consistency, accepting the idea that it can be worthwhile to trade the pure assurance of linearized operations for a little bit of timeline uncertainty and gain much speed (and, at least sometimes, simplicity) in return.

And then, perhaps the most fundamental scourge of them all: Naming things!

Yes, naming things is hard. And yes, there exist approaches like point-free style and the use of long sequences of anonymous callbacks to avoid doing it. There are a lot of instances where that approach makes sense: intermediate values, classic continuation-passing-style sequences and so on. And we’ve all (well most of us) have lived through various naming conventions that encode all sorts of information about an entity in its name; Hungarian without any particular Magyar influence. And don’t get me started about pattern-based naming conventions; ‘FixtureEntityFactoryManagerFactoryFactory’ may provide a lot of information but it’s unreadable – especially if it shows up more than once within a single field of vision.

But it’s still important to name things.  Both values and functions. (No, not every single one. Of course not. But often more than we typically do.)

But how should we name things? Ah. It’s time to be controversial. While using a consistent naming format (‘getThingFromSomewhere’, ‘setThingToSomewhere’) is good from one perspective (leveraging an IDE, for example) it can be hard to read from the standpoint of having everything look alike. It helps when the semantics stand out. Objects have behavior associated with them; they are not just bags into which we put data bits (well sometimes they are, but bear with me). Naming behaviorally can greatly aid in understanding the intentions of a piece of code; one is typically much less likely to get lost in a landscape with distinguishable features! Consistency is more likely to be a hindrance than a help!! Exclamation points can be annoying!!!

(This is not to say that consistency with regard to how a piece of data functions is a bad thing; calling the primary key of a table called whatever “whatever_id” is valuable to be sure. But calling the data items in the table ‘whatever_this’, ‘whatever_that’ and ‘whatever_the_other_thing’ is probably visual pollution.)

So how should we name things? Name things for what they are. Name pure functions/methods for what they return. Name impure functions/methods for the side effect they have on the environment. Named things are worthwhile because they encapsulate semantics. Named things can exist in the (more limited) solution domain as opposed to the programming domain (which, by being more general, is more complex). It’s the power of abstraction – which, really makes it all work. There’s a reason we don’t write so much assembler these days.

So names are important. Make them descriptive. Make them convey actual information about the domain they’re serving. Make them distinguishable.

And use them.

Cleanliness is next to … (2017-06-14)

Well, I wouldn’t go that far.

BUT – Warnings in code are not good things. Some are inevitable. Most are not.

On inevitable ones, there are two choices: Either shut ’em up in the IDE or comment why they’re there. The latter is very often preferable. (YMMV)

We do the vast majority of our work in dynamic languages without distinguished compile and link steps. Our builds aren’t really “builds” at all; they are resource aggregation/artifact creation/test activities. And, by default, they never even really look at the code – at least not carefully.

This is where static analysis comes in, whether it’s added as a separate step in the “build” or just turned on in the IDE.

To a great extent, the use of static analysis facilities merely replace the kinds of type-checking you would get for free in a statically-typed context. But, statically-typed contexts present their own problems (insert Java rant). Even statically-typed contexts with enlightened type systems can present problems – either you get into situations where there are types you cannot represent (but type inference is decidable) or types can be represented (with type inference undecidable). Esoteric stuff, perhaps, but not terribly surprising. It’s just the knowledge of the twentieth century (that you can’t know everything, no matter how hard you try) coming home to roost.

The inherent difficulty of typing systems has made dynamic typing more attractive in many cases – but at a cost: Without some kind of annotation, you don’t necessarily know much about the kinds of entities you’re operating upon in any given context, so you don’t know what they can do… Although we write code for computers to execute, we write code for humans to read. It is they, after all who need to understand it well enough to change it, to make it do different things as required. We also want to be able to efficiently navigate the code, trace what the call stack looks (or will look) like. Being certain of what’s actually being called is an incredible boon to grokking code – especially when you have to <gasp> fix it … more especially when you have to <gasp> gasp> fix it under pressure.

There’s nothing like that little green check to tell you that at least one class of (unruly) surprises has been eliminated from your code grovel.

Cleanin’ up the campsite == A GOOD THING!!

Tests and Testability and Debugging and Maintainability (2017-05-18)

Tests. Tests are good. They enable us to make sure that code works correctly – or, to be more precise, that it works according to the restatement of the requirement contained in the test.

Having to write tests is worthwhile all by itself. There is nothing like having to write tests to make you write code that’s inherently testable. And what’s most testable? Functions with well defined inputs and outputs that neither affect nor depend upon any external state. Pure functions, for example.

But, as the huckster said, there’s more

Any time you can limit the ‘universe’ that any piece of code can see, you also limit what you have to look at when things go wrong. And that can ease both debugging and future changes. Locality matters. Nearby is always better than far away. Unfortunately, it appears the predominant methodology of building web applications only serves to encourage spreading the semantics of code all over the codebase, segregating it by how it is used structurally by the program rather than by what it models. As a result adding functionality can become a chore. Rather than being able to just add all the behavior of a modeled type in one place, it must be added in several; the persistent data aspect is added in one place, the behavior of the modeled data in another, verification (i.e. “Can a coherent instance be built from these components?”) in another, and how it is to be displayed to or received from external sources in one or several more.

Now I don’t know about you, but in my case the more assets I need to touch, the more I can break. (I’m not even sure we can say it’s limited to being a linear relationship.)

So I tend not to be thrilled about having to make changes or additions all across a codebase.

Let’s take a step back: When a test fails, what does that mean? Obviously it means that some expectation of the code is not met; it also means “we have work to do!” But what isthat work? Where is it? What has gone wrong? Too often, because of the diffuse way we tend to structure things, debugging the problem requires taking on a rather significant cognitive load; you have to know too much and dig too deep to debug. Painful? Sure. Inefficient? Definitely.

Brief interlude:

Why do we like to use garbage-collected languages?

Because having to allocate and deallocate memory is hard and annoying? Well, sure. But what’s the real reason?

Because debugging things like memory leaks or, especially segfaults, SUCKS!!!

And why does it suck? Because the segfault itself is typically separated from its cause both in code and in time. As a result you have to work backwards through a richly constructed state to find the problem.

Sound familiar? You’re working on a codebase that’s new to you, adding a feature. You write some code, tests are in place and it all seems good. Suddenly (insert ominous music here) there’s a problem: Test #74, confirmFacilityDataIsUpdated fails with the message “0 is not greater than 0”. And, as a result, your change is not going out. At least not yet. Naturally, the first thing you do is look at the test itself. (Actually the second; the invective, whether private or shared, likely comes first.) Looking at the test’s code you see state being built up either in the form of mocks or, perhaps, by dragging data out of some database or another (and just assuming it’s there). Then some code is called to somehow mutate that state. Then, finally, some more code (the assertion itself) checking to see if the state has been mutated in some particular way. Wow.

Armed with all this knowledge, what’s the first thing you do? RERUN THE TESTS, THAT’S WHAT YOU DO! (And maybe, just maybe, it goes green, you issue the PR for your own – nicely tested – new feature and all is well.)

More likely, though, you start digging through the codebase like an ever increasing wavefront, eventually find a Heisenbug, think just a little bit less of your colleagues and move on. Though, yes, it’s part of what we do, it tends to be more painful than it needs to be. And pain, as a developer, reduces your bandwidth and makes you less effective. And often, we tend to test at a feature level – where a lot of state has already been munged – because imperative code has a nasty habit of swallowing up state changes, not producing intermediate, more fundamental results that can be tested more directly. And, perhaps, that’s the whole point,

Tests, at their best, provide several benefits; the first two are obvious, telling you that your code is:

  • ‘Correct’ (i.e. conforming to the specification expressed by the test itself)
  • ‘Still correct’ (i.e. nothing has broken that contract since the last time the code was run)

But don’t stop there. You can write the tests first, at which point they become very much akin to being a formal specification. And the tests will be simple. And the code you write against them will be more value oriented (when you write a test first, you’re going to write it as simply as you can; it’ always preferable to look at a value as opposed to collecting a bunch of state).

This post has gotten a bit long – but one more thing before we get out of here: In a TDD world, a failing test is a signal that code needs to be written or changed. And it’s the same if you’re debugging or enhancing existing code. So why not use the test suite as a place to hole information about that’s expected of the code? (Which is really what it does anyway.) Yes, zero may not be greater than zero – but knowing why that’s meaningful would sure be helpful.

[Addendum: On a podcast, I recently ran across the the idea of BDD (Behavior-Driven-Development) vs. TDD (Test-Driven-Development), corresponding, roughly, to unit tests and somewhere between ‘feature’ and integration tests, where some kind of end-to-end behavior is tested. There’s clearly some value here. Some of the effort devoted to the most trivial of unit tests could be utilized in testing the larger behaviors; any failure of behavior tests without corresponding failure of underlying unit tests would pretty well confine the error to the higher level behavior-oriented logic.]