Just an intro…

I’ve written code on these and these and these and these. I wrote some SL/I (a dialect of PL/I), where the semicolon was spelled “.,”. Made a living in Algol. Even wrote some APL code.

I have a copy of “The C Programming Language” (not the ANSI C EDITION).

Procedural? Check. OO? Check. Homoiconic? Check. Immutable? Check. Functional? Check, check.

I like this stuff and I sometimes write about it.

The previous entries here are items originally written for Sparefoot’s Engineering Blog, which I use here with permission. Thank y’all for that.

Today’s Aphorisms (2017-10-04)

Create value or create language by which your colleagues can create value. It’s all value.

“People > Process” means just that. It’s unambiguous.

Narrow contexts compose a hell of a lot better.

Saying “we’ll simplify later” is the worst kind of technical debt.

Complication eats effort. And yields nothing.

Dealing with complication is like paying protection money; all it does is cost.

TEAM is a powerful concept. Multiple sets of eyes is a benefit, not a cost.

Locality Matters (2017-09-07)

One of the things that can make our codebase somewhat difficult to work with (making defects much too easy to create) is that we often lack locality of concept.

(Hmmm. Sounds like something Gold would say. But what the hell does it mean?)

Code both does things and decides how things are to be done. Is this valid? No? Kick it out! Is it red? Do this. Is it blue? Do this. Is it purple? Ah, that’s a special case, do thisthis and this.

Despite the possible complexity of these decisions, it’s not too hard to keep all the business logic contained therein straight – at least so long as all of that business logic lives in the same place. Often, for what are good, pragmatic reasons at the time of writing, some of this logic ends up getting smeared about the codebase. And before long, some of the behavior associated with various objects becomes somewhat less clear, especially when new functionality is introduced.

So keep that in mind. No one really wants to have to do a deep code grovel to understand how an instance of a class behaves; besides, having to do such a grovel greatly enhances the chances of getting things wrong.

So keep it together, because locality matters!

Getting It Write the First Time (2017-08-20)

We write code. That’s what we do.

Well, that’s not entirely correct. We solve problems. We provide solutions. When we can do it without code, we declare it a win.

But most of the time we write code.

How do we write code? We don’t chisel it into stone. We don’t have to pull individual pieces of type from various drawers and physically typeset it page by page. We don’t carve it into wood. We don’t spend the time and effort of bending pipe-cleaners into letters and numbers (and punctuation) and glue it to a large board.

No. We open an editor or an IDE or whatever and just start hitting the keys on the keyboard. Some are faster than others. (And most are faster than me.)

But is it hard to write code? Well, no. Certainly no harder than writing about code. Or writing about the design of code. So what we do is easy, right?

Well, sometimes. But overall not so much. Yes, writing code is easy. Writing good code, on the other hand, code that scales well, code that is robust, code that does what we think it does and what it should do, code we can read, code that can be changed reasonably when such changes in requirements occur, code that runs fast enough for our requirements, well THAT’S HARD.

Frequently we nerd wrestle over system design; we ask whether this functionality should go here or go there, whether we should take this approach or that approach. We stand at whiteboards and draw an endless array of pictures and diagrams, boxes, arrows, dotted lines, all that kind of stuff. Now I am, by no means, suggesting that these are not valuable activities. Thrashing out our hypotheses on things we don’t know yet, on systems yet to be created is our opportunity to learn, our opportunity to form a sufficient model of a system to actually implement it.

Sometimes, though, se can get a little stuck. Boxes and pictures and arrows can be somewhat ambiguous. Different people can come away from such a session with different understandings of what’s been proposed. What if (he asked) we had some kind of notation, some kind of consistent abstraction in which to express our system design decisions? Hmmm. What could something like that possibly look like?

“The Database” (2017-08-20)

Once upon a time there was “The Database”. It was all-encompassing and it was good.

It provided a current snapshot of reality. It was “the data”. It was “the state of things”. It was the “source of truth”.

It was spoken of with reverence. A tribe of humans called “DBAs” spent their considerable waking hours (many of which were involuntary) worrying about data integrity and consistency and the status of many backups; maintaining the ability to restore as needed was paramount.

But it was not only the “source of truth”. It was the way in which all data was accessed, which had to be done quickly. Sometimes access patterns were well-anticipated. Sometimes, however, they weren’t, requiring non-trivial work to be performed on live data to bash it into the appropriate shape. Such efforts were rarely fun for the practitioners involved. Typically they took place during the wee hours, somewhere in the hazy hours between the day’s processing and the nightly offline processing, between the backups and the next morning when it would all start again.

And since it was the source of truth, it held all the answers. Or at least anything could form part of an answer under some circumstance was put there because, well, where else would you possible look? And since there was nowhere else, your application could read data left by my application and my application could read data left by yours OR change it. Or take some and add more. Or delete some. You could write data to it and read data from it

It was the integration platform! It was a flying flock of global variables just waiting to be created and destroyed mutated and associated. There was a context and it was wide.

And devilishly difficult to reason about, at least in any reasonable way.

Now this is not to say – or even imply – that it did not serve its purpose. Serving the database was the the organizing principal of most of the code. The state of the database was a reflection of all that had occurred in the past. Note that I say “a reflection” because we (only rarely) had all the information; if we had the value 5 we rarely had the 3 and the 2that had been added together to make that 5. Not a big deal. Well, not a big deal if we don’t decide that that 2 should have been a 6 upon further reflection, but later for that. The point, though, is that we were typically concerned with the current value; you want to know if widget #12345 is available now not the whole history of how that came to be.

There were good reasons for this approach. Reliable random access storage ain’t free. (For a historical – well up to three years ago – graphical display, see: http://www.mkomo.com/cost-per-gigabyte-update. Be sure to notice that the graph is logarithmic.) When a gigabyte cost tens of thousands of dollars or even tens of dollars the idea of only keeping a current snapshot – and considering that snapshot to be the durable source of truth worthy of care and feeding – made perfect sense. Now, perhaps, not so much. Strangely (well, actually not) when costs came down for storage instead of going to a model that traced the development of entities through time, what typically happened was that we began to use the database as a place to record events as they happened. After all, why not? It was a hunk of persistent storage that was well secured, was well cared for and available. And if anything happened to it, there was someone who’d come running to fix it.

The problem though, is that this was a completely different type of data. Individual events might have specific significance for a period of time, but this individual importance receded quickly; it was primarily significant in the aggregate. Further, its relationship to normalization was different. The state of any external references at the time of the event was likely to be more significant than their state when the data was read. Even worse, though, was the fact that these collections tended to grow. And grow. And grow. And ultimately outgrow the “nicely curated, nicely normalized snapshot data” – which tended to make operational tasks a nightmare. And, because locating individual pieces was often necessary in the first few hours or days of the data’s life, it’s well-indexed. Forever. At a non-trivial cost.

But it was, after all, “The Database”.

Times have changed. What was “The Database”  has become a virtual collection of different data repositories. databases both relational and not, object stores, streams, caches and all manner of things. Taken together with whatever applications are used for inquiry, update and maintenance, they constitute “The Database”. It’s not quite as neat (well, yet) and isn’t tied up quite so well (yet) but is hugely more flexible. The major challenge will be to keep it coherent.

Let’s consider what “the database” might consist of right now:

  • A few mysql instances
  • Some number of Kafka streams.
  • A bunch of S3 buckets
  • An ElasticSearch cluster
  • A redshift cluster
  • A Tableau cluster

Is it too much to deal with? Well, perhaps. But let’s turn things “on their side”, as it were: If you squint just a little and look at it right, ‘The Database” really comes down to two things: The mysql instances (mostly – and hopefully increasingly – used for the “store” data, inventory we can sell, reservations, move ins, stuff related to the day-to-day operations of the business) and the bunch of S3 buckets (which are the ultimate destination for everything that concerns the behavior of customers and potential customers, all the information from which we can make future predictions and educated guesses about what might/will work in the future).

Kinesis is a durable transport mechanism, allowing for realtime subscribing. (It’s The Stream!)

Redshift is an analytics tool that can derive its information either by subscribing to Kinesis and grabbing it as it flows gently down the stream or wait until it settles out and “lands” in S3.

Tableau is a further subscriber.

But the source of truth is mysql and S3. Everything else can be regenerated as needed. And the ability for contents to be regenerated sure takes a lot of the weight off developing a system!

Yes, there are more pieces. It’s not nearly so simple, and it will undoubtedly evolve over the course of time. But at its root, it’s still “The Database”.

Immutability, Purity and All Kinds of Stuff FTW (2017-07-14)

Immutability

I’m going to show you a little bit of code (don’t worry about the language, it might not exist)… Then I’m going to ask you a question, OK? Cool!

    x = 1
    y = 2
    ...
    ...
    ...

(Yes, I’m not going to show you what’s there. This is my game and I can do what I want.)

    if (x != 1) {
        destroyWorld();
    }

And yes, it’s all within the same scope. There are no branches or returns or exceptions thrown; there’s nothing that will keep the conditional from being executed.

So I ask a question made famous by an ex-cowboy actor on the streets of a fictional San Francisco a bit over forty years ago:

“Do ya feel lucky, punk?”

Well, not so much. Certainly not in the languages we use or in the way we use them. At this point in the code:

    i

could be anything at all.

And that’s the thing: In an immutable or, at least, a single-assignment you wouldn’t have to feel lucky. You’d know that the ‘destroyWorld’ function was not about to be called. You’d know that the value 1 was bound to the name ‘x’ and that was that.

Hell, I feel safer already.

“But,” you ask, “If I can’t change anything, how do I, like change anything?”

Well, you don’t. When you need one that’s different, you make a new one.

“Wouldn’t that be wasteful?”

Well, maybe, but not necessarily. When you make a new one, you don’t have to make a whole new one. You take the changes, say ‘and everything else is like this‘ and point to the old one.

“But what if the old one ch…. OH…..”

Yup. You’re starting to get it.

Purity

Functions, functions, functions. Everywhere we look, there’s a function. Well, kinda.

When we think of functions from mathematics, they have this nice property: Every time you call the same function on the same arguments, you get the same answer. So 1 + 1 is always 2. That’s the way functions are defined; they depend only upon their arguments. It’s the property of being referentially transparent. Same input? Same output. Always.

In programming languages, however, things are not that simple. The value returned by a function might not be completely determined by its input. Class methods, for example, have an additional scope that might contribute to the value returned by a function, the values of class variables. Instance methods have the values of class variables andinstance variables. And closures have whatever free variables are available in an enclosing scope, either implicitly (in, say, JavaScript or Python) or explicitly (in, say, PHP). And sometimes the source is even more external – like the state of the universe!

time()

anyone?

Of course, it gets even more complicated than that. Functions can have side effects. As opposed to just returning a value like nice tidy mathematical functions, they can go behind your back and change the state of the universe or at least some more limited enclosing scope. So you call

grabMyPhoneFromTheOtherRoom()

and you go back in there later and discover that My Mother the Car, dubbed into Basque, is playing on the TV, the guitar has been returned to an open D# and al the books on the shelves have been rearranged, sorted by the second letter in the author’s names.

When a function behaves like a mathematical function, taking arguments and returning a value and not messing with anything else, we call it a pure function. Pure functions are cool! They’re easy to reason about! They’re easy to test! You throw arguments at them and they give you an answer – and you can check that answer! And if you ask them again, you’ll get the same answer again. And they won’t leave a mess. No muss, no fuss.

“So we don’t want side effects, right?”

Well, it’s not quite that simple. We want side effects a lot of the time – you know, output! Persistence. Stuff like that. We like programs that do things. But it’s exactly that doingof things that makes code harder to reason about, harder to test. Why? Because we often have to do a lot of work to build up an environment in which those side effects can take place. How many times have we written the “create a bunch of objects…call some methods on the objects to get them in the right starting state…then call the method we want to test…then call a bunch of methods and collect their results to see if what we wanted to happen to the state of the objects involved actually did happen to the state of the objects involved… And then, we throw it all away and start over!

The worst part is that we do all this by writing (expressly non-trivial) code – WHICH WE KIND OF DON’T FULLY TRUST IN THE FIRST PLACE!!!

So, while we can’t get rid of side effects (well, we can but it’s a considerable undertaking), we can limit them and segregate them. We don’t have to smear that uncertainty about a program’s overall state throughout all the code.

Most side effecting code is really quite reasonable. Aside from the aforementioned changes to the outside world, we often induce state changes in objects because:

  • We can’t just create a new one because the object is in scope to other code.
  • The change is dependent upon a lot of properties that make up the object’s current state – and who wants a dozen variables in a parameter list?

Even some of these cases can be resolved with design. Wide and flat can be your worst enemy.

In general though, one of the great advantages of pure functions is that they’re easier to test. And the tests themselves are much more likely to be testing the right things the right way.

Types anyone?

Types are cool. Typing is cool, too.

Wait? What?

“We prefer dynamically-typed languages because we were damaged by our experience with Java.” (This effect is considerably more acute among those who used Java before the introduction of generics.) “Types restrict you too much.” “I don’t like programming in B&D languages.” “Statically-typed languages are not as expressive.”

But whether we’re looking at going in a statically-typed direction (unlikely) or a gradually-typed direction (could be) or a predicate-based system (see http://learnyousomeerlang.com/dialyzer or https://clojure.org/guides/spec) it’s all about being able to provide more information about a program within a program. It also gives us a way to reject programs that are not coherent at a point before run time (which is usually a better time for it to fail). And it saves us from having to write a whole tier of tests we’d otherwise have to write.

And with a good type system, you might even get more information about the pieces of data you’re working on.

So…

All right. We’re not going to change our stacks radically any time soon – nor should we (necessarily). But we could use some techniques that are available, no matter what the stack looks like, to make things a little bit more reliable, a little bit easier (and hopefully quicker) to work on, and certainly easier to reason about. We can do it a lot or do it a little, anything should make us better. And that’s what matters.

Name Things! (2017-06-23)

One of the best Computer Science jokes (okay, go along with it for now…) goes like this:

There are only two hard problems in CS:

  • Naming things
  • Cache invalidation
  • Off by one errors

We’ve gone great lengths to mitigate the effect of the last of these. Rarely do we use index-based, perhaps nested, loops to traverse the elements of an array. More often we’ll use a foreach construct or a comprehension or a map or whatever. It’s an improvement, it’s less error-prone and it reflects the semantics of what we’re trying to accomplish.

Cache invalidation in our increasingly massively concurrent distributed age remains a problem – and a significant one. We tend to either err on the side of synchronization (old-style) or we just accept the notion of eventual consistency, accepting the idea that it can be worthwhile to trade the pure assurance of linearized operations for a little bit of timeline uncertainty and gain much speed (and, at least sometimes, simplicity) in return.

And then, perhaps the most fundamental scourge of them all: Naming things!

Yes, naming things is hard. And yes, there exist approaches like point-free style and the use of long sequences of anonymous callbacks to avoid doing it. There are a lot of instances where that approach makes sense: intermediate values, classic continuation-passing-style sequences and so on. And we’ve all (well most of us) have lived through various naming conventions that encode all sorts of information about an entity in its name; Hungarian without any particular Magyar influence. And don’t get me started about pattern-based naming conventions; ‘FixtureEntityFactoryManagerFactoryFactory’ may provide a lot of information but it’s unreadable – especially if it shows up more than once within a single field of vision.

But it’s still important to name things.  Both values and functions. (No, not every single one. Of course not. But often more than we typically do.)

But how should we name things? Ah. It’s time to be controversial. While using a consistent naming format (‘getThingFromSomewhere’, ‘setThingToSomewhere’) is good from one perspective (leveraging an IDE, for example) it can be hard to read from the standpoint of having everything look alike. It helps when the semantics stand out. Objects have behavior associated with them; they are not just bags into which we put data bits (well sometimes they are, but bear with me). Naming behaviorally can greatly aid in understanding the intentions of a piece of code; one is typically much less likely to get lost in a landscape with distinguishable features! Consistency is more likely to be a hindrance than a help!! Exclamation points can be annoying!!!

(This is not to say that consistency with regard to how a piece of data functions is a bad thing; calling the primary key of a table called whatever “whatever_id” is valuable to be sure. But calling the data items in the table ‘whatever_this’, ‘whatever_that’ and ‘whatever_the_other_thing’ is probably visual pollution.)

So how should we name things? Name things for what they are. Name pure functions/methods for what they return. Name impure functions/methods for the side effect they have on the environment. Named things are worthwhile because they encapsulate semantics. Named things can exist in the (more limited) solution domain as opposed to the programming domain (which, by being more general, is more complex). It’s the power of abstraction – which, really makes it all work. There’s a reason we don’t write so much assembler these days.

So names are important. Make them descriptive. Make them convey actual information about the domain they’re serving. Make them distinguishable.

And use them.