Monday, June 2, 2014

Compression Driven Development

Good Code is Hard to Measure

And all too often, even good programmers get into a battle of the-least-code-wins. And not to the point of code golf, I mean real production code that's tightened as much as possible.

Codebases all eventually get bloated and buggy and hard to navigate. We all know that less code means less code to trudge through, and fewer headaches. And of course, where bloated code is annoying, duplicated code is an inexcusable bug waiting to happen. This would seem to indicate that excess code causes all of our woes.

A recent blogpost touts these facts as the reasons for "Compression-oriented Programming," telling programmers to develop in this process:
Begin by just typing out exactly what [needs] to happen in each specific case, without any regard to “correctness” or “abstraction” ...when doing the same thing a second time somewhere else, ...pull out the reusable portion and share it.
But before we slip back into focusing on the battle of the least code, let's explore the consequences of taking this methodology too far.

What Does Bad Compression Look Like?

Let's start with some repetitive code, and let's see how badly this advice could be followed.
cairo_set_color(cairo, 0, 20, 20, 30);
cairo_line_to(cairo, 7, 8);
cairo_set_color(cairo, 0, 30, 30, 20);
cairo_line_to(cairo, 8, 9);
cairo_set_color(cairo, 0, 20, 20, 30);
cairo_line_to(cairo, 9, 10);
cairo_set_color(cairo, 0, 30, 30, 20);
cairo_line_to(cairo, 10, 11);

This code draws from 7,8 to 8,9 to 9,10 to 10,11 in alternating colors. It shouldn't be hard to figure that out, and it shouldn't be that hard to maintain.

But let's follow the compression-oriented programming guide and say that these eight lines are certainly doing something more than once. Let's see what we are doing repeatedly that we can streamline.

We are
  • setting the color to 0, 20, 20, 30 twice
  • setting the color to 0, 30, 20, 20 twice
  • drawing from x, y to x+1, y+1 three times
  • using 'cairo_set_color(cairo' four times
  • using 'cairo_line_to(cairo' four times.
So let's "clean this code up."
#define my_cairo_set_color(a, r, g, b) (cairo_set_color(cairo, a, r, g, b))
#define my_cairo_line_to(x, y) (cairo_line_to(cairo, x, y))
#define cairo_move_to_next_xy my_cairo_line_to(x, y); x++; y++;

void cairo_set_subtle_blue_grey() {
  my_cairo_set_color(0, 20, 20, 30);
}

void cairo_set_subtle_red_grey() {
  my_cairo_set_color(0, 30, 30, 20);
}

run() {
  int x = 7; int y = 8;
  cairo_set_subtle_blue_grey();
  cairo_move_to_next_xy;
  cairo_set_subtle_red_grey();
  cairo_move_to_next_xy;
  cairo_set_subtle_blue_grey();
  cairo_move_to_next_xy;
  cairo_set_subtle_red_grey();
  cairo_move_to_next_xy;
}

Once you understand the domain language, the 'cleaned up' code is certainly more literate. But at a huge cost! I have deliberately chosen bad abstractions, such as macros using invisible variables, and hidden alteration of state.

But our code in run() is still so redundant! And after all that work! Lets see if we can fix it in another pass of our compressor.
run() {
  int x = 7; int y = 8;
  for(int i = 0; i < 4; i++) {
    i % 2 ? cairo_set_subtle_red_grey() : cairo_set_subtle_blue_grey();
    cairo_move_to_next_xy;
  }
}

Now this is what we wanted! It might invoke runtime costs but hopefully our optimizing compiler will see through it. It might take a split second to determine whether red or blue will come first, and there's still hidden variables and hidden alteration of state, but overall it's much more compact.

I'm going to 'clean up' some duplication in our macros and helper functions, and then here is our final result.
#define call_cairo(fn) fn(cairo,
#define my_cairo_set_color(a, r, g, b) (call_cairo cairo_set_color a, r, g, b))
#define my_cairo_line_to(x, y) (call_cairo cairo_line_to x, y))
#define cairo_move_to_next_xy my_cairo_line_to(x, y); x++; y++;
#define select_hue(expected) (hue == expected ? 30 : 20)
#define HUE_RED 0
#define HUE_BLUE 1
#define HUE_GREEN 2

void cairo_set_subtle_grey_hue(int hue) {
  my_cairo_set_color(0, select_hue(HUE_RED), select_hue(HUE, GREEN), select_hue(HUE, BLUE)); 

void cairo_set_subtle_blue_grey() {
  my_cairo_set_subtle_grey_hue(HUE_BLUE);
}

void cairo_set_subtle_red_grey() {
  my_cairo_set_subtle_grey_hue(HUE_RED);
}

run() {
  int x = 7; int y = 8;
  for(int i = 0; i < 4; i++) {
    i % 2 ? cairo_set_subtle_red_grey() : cairo_set_subtle_blue_grey();
    cairo_move_to_next_xy;
  }
}

Hey! I Can See Through Your Tricks

The problem that I forced in my example compression is that I chose the wrong abstractions in the wrong order, and focused on the wrong things.

I compressed code paths that are incidental to the program, instead of paths that are inherent to the problem domain.

And, as a cherry on top, I compressed in ways that I knew would simply lead to another round of compression.

How extreme is this example? I don't think it's that extreme. Once the code has been split out to this level abstraction, I don't think any programmer is going to fix it any time soon. I've made a pretty tangled hairball in just 27 lines of code, who would risk fixing it if it were spread out across a hundred lines? A thousand? Five thousand?

That novice developer focused on removing duplicate code went too far, and nobody stopped him, and nobody will undo his mistake.

How it Should Have Ended, in Six Lines

int x = 7; int y = 8;
for(int i = 0; i < 4; i++) {
  i % 2 ? cairo_set_color(cairo, 0, 20, 20, 30) : cairo_set_color(cairo, 0, 30, 20, 20);
  cairo_move_to(x, y);
  x++; y++;
}

This code is more maintainable than the eight-line version, but it is also harder to read. Given how this repetitive code is rare, we've maybe turned an 8k simple-to-follow codebase into a 7k harder-to-follow codebase. Its not exactly worth a raise.

Abstraction Trees

Code is hierarchical, and that hierarchy often (though not always) needs to be understood in order to solve bugs, add new features, and refactor code. Any time you consolidate code you heighten the abstraction tree.

Following the abstraction tree of the good example, we must
  1. understand that we'll run the code four times
  2. understand the state change each time
  3. understand the color swap each time.
In our bad example, we must additionally
  1. understand that my_cairo_move_to uses & modifies state
  2. understand call_cairo's syntax trickery
  3. understand how hues in RGB work
  4. understand that select_hue chooses a high/low value to match

We've doubled what we need to know in order to follow the code from top to bottom, and that is a huge price to pay.

So when is compression-driven development worth pursuing? When is it an antipattern?

Compression-driven Development is an Antipattern

But not always.

You are turning yourself into a compression algorithm with the restriction of syntactically valid results. Don't do it.

Approach the problem of duplicated code as a domain-specific problem. Don't solve it with the first abstraction that comes to mind, solve it in the way that the problem-space is necessarily tied to.

Abstraction should be used when the behavior you are compressing is clearly part of the problem domain, and when the difficulty of traversing the abstraction tree from top to bottom is clearly lower than the cost of modifying duplicated code.

Compression-oriented programming can give you insights on good vs bad abstractions. Compression-driven development (where compression is the means as well as the end) is only ever incidentally good to a codebase at best, and actually encourages the creation of monolithic, bloated codebases at worst.

4 comments:

  1. I think you are slightly misinterpreting Casey's point of view, in his article Casey says:

    "And just to be clear, I mean semantically smaller, as in less duplicated or similar code, not physically smaller, as in less text, although the two often go hand-in-hand."

    ReplyDelete
  2. I like how you use already "compressed" code (those are already shared functions!), and then attempt to abstract it to a ridiculous degree, in order to prove that compression oriented programming is a bad idea.

    Realistically, what Casey would have done is almost certainly either left the code alone if that's all that was necessary, or if he needed to do those operations a significant number of times he likely would have done something closer to the for loop you made, though probably without the modulo, since modulo is pretty slow compared to other operations. You could do a pair of if statements in the loop to do the job in a way that runs faster.

    I can't speak to whether or not Casey's "Compression Oriented Programming" is worth it or not, I'm not a very experienced programmer (I don't write code for a living), but even I can smell the bullshit you tried to sling here.

    ReplyDelete
    Replies
    1. Firstly let me say that my example is contrived and you are correct to note that. But there is a problem -- you claim that my starting example is already "compressed." This is because it is an example. However, it would be quite easy to come up with a similar example that is far far longer and can be "compressed" in the same ways. Here's my try at it. Instead of cairo_set_color and cairo_line_to, what if the code were this:

      check_err_occured(err);
      IFERR fprintf(stderr, "Step %d failed!", step_number);
      IFERR return err;
      cairo_method_call_t method_call;
      method_call.name = CAIRO_SET_COLOR;
      method_call.arg1 = 7;
      method_call.arg2 = 8;
      method_call.errptr = &err;
      err = cairo_can_run(cairo, method_call);
      IFERR fprintf(stderr, "Step %d is malformed!", step_number++);
      IFERR return err;
      err = cairo_run_method(method_call);

      I count that as 12 lines per set_color or line_to. Now my original example is 8 * 12 = 96 lines of code. Lets then say that we do it forty times instead of just 4. That is now 960 lines of code, and begging to be compressed.

      I can then follow these same poor steps I detailed here. The end result has 26 lines of abstraction code which would be unchanged, and one line that contains the "repeated" part, which would now have to be essentially those 12 lines. Doing the math, we would have saved ourselves 922 lines of code.

      But it would still be crappy code...in fact, aside from the fact that the 960 line version is a pain to modify and being terribly bug prone, it is still way way way way easier to follow than the now 38 line of code version.

      The amount that the code is compressed or not is not the problem (otherwise the 36 line of code version would be indisputably better than the 960 line of code version). The problem is that the abstraction tree is deep and highly incidental. This has a cost. At some point, it is worth having an increased risk of copy-paste errors to avoid a cost of having your code being incredibly hard to follow. Focusing on "compression" is fundamentally the wrong goal, we should be focusing on making good abstractions.

      Delete
  3. I stand with Jeffrey Wells on that one... Casey isn't another automaton pushing a new paradigm down our throats and one which we should conform to and apply systematically. I do think he is arrogant at times having been fed up with the industry's standards, but when I watch Casey code, I truly get the impression of an artist working with clay. That is something refreshing!

    When you love to code and get bogged down by OOP paradigms, that although do seem to better organize thoughts and code bases, you get to a point where programming becomes less and less of an art form and more and more of a mechanical process devoid of the ability to mutate and adapt to the constantly evolving demands of a project. That is, IMO, a MAJOR hindrance to any productivity.

    I'm not yet sold on the idea of compression oriented design, and still have a strong tendency to write abstract interfaces and organize code in hierarchies. But to me, poor productivity and losing focus and creativity (and don't get me started on AP) by taking all these detours to organize code and ideas is far more damaging and should be the number one evidence something is wrong with the mainstream idea of what development is.

    Again, in my opinion, programming is a creative process very much organic in its nature and should be allowed to remain that way and not be restrained by over analysing. On the flip side, in no way should this be an excuse for lousy coding either! I think what makes a great programmer is one that can break the rules and at the same time still produce readable and maintainable code others can learn from and build upon. That, just like an artist, is true mastery of the craft!

    ReplyDelete