The Programmer's Paradox: 2007

Wednesday, December 26, 2007

A Silly Little Storm

This posting looks into a small and common problem, but one that frequently causes larger issues. Even the tiniest of problems, which starts as a small rock in a stream disrupts behind it an increasingly larger segment of water. This type of turbulence exceeds the sum of itself as more rocks combine together to cause larger and larger ripples. With enough rocks you get rapids; the whole being much larger than its pieces.

Sometimes when I am commenting on other writer's blog entries I just want to end my comment with a little bit of shameless self-promotion. Usually my name and then a link back to my blog to make it easy for readers to stop by and visit me:

Paul.
http://theprogrammersparadox.blogspot.com

These are my favorite last two lines in most comments.

Of course, to avoid mistyping the URL I generally go back to my blog's root page, highlight the URL, copy it and then paste that into the comment box a couple of times. Pasting decreases the likelihood of typos; usually a good thing.

Someone pointed out that in Blogger I should wrap the URL in an anchor tag so that it is easier for any readers to click on the link. No problem. That seems easy enough. The simplest way to accomplish this is to type in the starting tag '<a href=', paste in the URL, close it with '>'; then paste in the URL again -- to make it the visible text as well -- and finally add in the closing '</a>'. How hard is that?

In Blogger, when I go to the root page for my blog, I get an extra forward slash '/' character because it is at the 'top' level, and not pointing to any specific entry. If I selected a blog entry then the URL ends with the '.html' extension because it is directly referencing a file. But normally if I want the top level URL for my blog, I'll go to the main page and copy the URL with its extra forward slash at the end.

Funny enough, in Blogger there is this stupid little 'issue' where an anchor tag in the form of "<a href=XX />link</a>" that has an extra '/' at the end of the open tag works perfectly in the preview, but is slightly modified when actually posted. In the preview it appears as if the forward slash is ignored -- everything looks fine, which is no doubt a browser dependent behavior -- and FireFox interprets the tag as expected.

But once you post the comment:

<a href=http://theprogrammersparadox.blogspot.com/>
http://theprogrammersparadox.blogspot.com/</a>

it gets turned into:

<A HREF="http://theprogrammersparadox.blogspot.com" REL="nofollow"/>

Visually the hyperlink text disappears and any of the following text becomes the link instead. As well, the newlines get turned into breaks and there are paragraph markers inserted at the beginning and end of each paragraph.

I am guessing, but when you actually post the comment it is sent to the database for storage. On the way in or out, but usually in, the comment text is modified by some well-meaning, but obviously destructive code. That 'function' is looking to perform some modifications on the comment text; from above we see it added an REL option to the anchor tag; it also converted the tag to uppercase. Because it sees the anchor tag as unary, it seems to strip the orphaned closing '</a>' tag. It probably does lots of other stuff as well. The parsing for that function saw the close forward slash and assumed that it was part of the HTML tag syntax, unlike the browser that sees it as part of the earlier directory path syntax. The browser's parsing behavior is described as 'greedy' because it tries for the longest possible token, which in this case is the full directory path with the appended forward slash. The internal function on the other hand is probably ignoring the directory path syntax and is just looking directly for '<', '>' and any '/'s to modify them. Since it isn't fully parsing the HTML it doesn't have to understand it as well as the browser does.

So, wham, bam, I copy the URL into the text a couple of times, check it in the preview and kapow, I post it. "Oops, what happened there?" Suddenly the text and the link looks very different from the preview. If you combine both of the above issues that simple little extra '/' picked up in the URL causes quite an annoying problem, but not one that is detectable in preview. Everything looks great until you actually post. A very embarrassing error.

You may have noticed that I didn't wrap the href argument in double quotes. That certainly would have fixed the problem, but why toss in an extra set of double quotes when the preview has already shown that not having them will work properly. It is just extra stuff that doesn't seem necessary. Dropping the trailing '/' works too. But the 'problem' isn't what is actually typed, the problem is the preview. It is incorrectly verifying that stuff is working, when clearly it is not.

The cause here beyond my own sloppiness starts with two different points in history where at least two different programmers assigned two different meanings to the same underlying forward slash character: one as the separator between directory names in a path string, and the other as a modifier to the tag syntax to indicate a unary tag or an ending tag depending on placement. These two distinct meanings 'overload' the meaning of the character so that when combined in this specific circumstance the true meaning is 'nearly' ambiguous; one block of code sees it one way, while the other sees it differently. When we 'paste' together different 'idioms' that have been crafted over time, we often get impedance mismatches. Little blots of dangerous complexity.

Added to that, two different pieces of code are parsing the same HTML fragment very differently. However close Blogger's HTML parsing routine is to FireFox, it is different enough to cause problems. I could try testing this in Internet Explorer, but it doesn't really matter to me who is right and who is not following the standard. FireFox is being greedy, Blogger is not. More likely is that to greedily parse the directory path, Blogger's code would have to go to a huge amount of extra work, most of which was an unnecessary level of complexity beyond what they actually needed to manipulate the comments. They won't ever parse the HTML fragment correctly, because they do not need to go to that level of understanding.

What is important is that the same basic 'data' is being interpreted in multiple ways by multiple different programs and that while it is not ambiguous, it might as well be, since any two coders cannot and will not parse it identically. Partial parsing will always produce ambiguities. Overloading will too. The meaning of our data is so often relative to how we handle it in our code. We fail to separate the two; implicitly buring knowledge into some piece of code.

Topping it all off, some programmer implemented the preview function without fully understanding what he or she was doing. If there are 'translations' done in other parts of the system then it is exactly these 'translations' that need to exist in the preview function. Without them preview is useless. Incomplete. A 'mostly' preview function is a sloppy dangerous featurette that mainly misleads people and diminishes the tool. In the end had this been implemented correctly, none of the other issues would matter or be of any consequence. I would have seen the broken link in the preview and fixed it. Although earlier problems set up the circumstance, the 'buck' stops at the final and most obvious place.

We might quickly dismiss this type of issue as trivial -- perhaps I should just learn to quote my arguments correctly -- if it wasn't for the fact that this type of thing is standard all over our software. It is one hell of a turbulent stream of never-ending little issues that together keep this stuff from being trustworthy. I have the greatest mental tool known to mankind on my desktop, but I can't trust it to cut-and-paste properly? We've so many old legacy problems acting as rocks in the stream that our users are constantly drowning in rapids of our own making. And so often, all we ever do is tell them that they shouldn't have added that extra forward slash; "everybody knows you can't have extra forward slashes"; and we wonder sometimes why so many people hate computers?

A related theory is that you can judge the overall complexity of a 'machine' by the size of the effects caused by changes to its small details. In a simple system, small changes have only a small impact, but as the overall complexity grows chaos becomes more prominent so that small issues have progressively larger and larger impacts. The inherent complexity of the problem leads us to a base change effect of a specific size, but with every increasing dose of artificial complexity those effects become magnified. In a big system if you can draw some conclusions about the size of the initial problems vs. the size of their effects, over time you might get some gauge on how quickly the artificial complexity is accelerating within the system. This is important if you are interested in how many more changes you could make to the existing system without applying some type of major cleanup before it becomes likely that the changes will become impossible to make. Once it has passed some threshold, no significant change can be made to an overly complex system without it causing sever and damaging side-effects.

UPDATE: I'm not even going to try to put words to the amount of frustration I just experienced trying to move this piece between Google Docs and Blogger. The formatting, anchor tags, are not happy; consider yourself lucky if you can even read this...

Wednesday, December 19, 2007

The Nature of Simple

The phone rang; it was so long ago that I hardly remember it. It all comes back in that hazy historic flashback where you're never really sure if those were the original facts or just ones that got inserted later.

It was another support call. I was the co-op student back then; one of my duties was to handle the very low volume of incoming calls. For some software that might not be hard, but in this case, handling support for a symbolic algebra calculator was anything but easy.

This latest call was interesting. A Ph.D. candidate working on an economics degree was asking questions about simplifying complex equations. All in all, they were not hard questions to understand, not like the ones where I end up struggling to answer questions on how to relate some obscure branch of mathematics back into something that could be computed. Both ends of the question, the mathematics, and the computer science were often fraught with details and complexities of which I had no idea. An average question would send me back towards the university searching for anyone who could at least inform me of the basics. In those days, we didn't have Wikipedia, and certainly no World Wide Web. The Internet existed, but mostly as universities connected together with newsgroups and FTP sites. If you wanted to find out about something, that meant hitting the library and scouring through books in the old fashion way. So much easier today. But it does take some of the fun out of it.

The core of this specific set of economics-related questions had to do with simplifying several complex microeconomic models so that the results could be used in a textbook. The idea was to use the symbolic algebra calculator to nicely arrange the formulas to be readable. The alternative was a huge amount of manual calculations that opened up the possibility that they might be handled incorrectly. Mistakes could creep into the manipulations. Automating it was preferable.

Once the equations were inputted into the calculator, the user could apply a simplified function. It was fairly easy. The problem, however that prompted the phone call was that the simplified results were not the results that were expected.

In the software, there were lots of parameters that the user could control and some nifty ways to get into the depths to alter the overall behavior, but time has washed all of those details from my mind. What I do remember is that after struggling with it for an extended period of time, both myself and the grad student came to realize that what the computer intrinsically meant by 'simplifying' was not really the same as what any 'human' meant by it. We -- through some less-than-perfect process in our thinking -- prefer things to not be rigorously simplified. Instead, we like something that is close, but not exactly simple. Our rules are a little arbitrary, the algorithm in the software wasn't.

Some things hit you right away, while others take years of slowly grinding away in the background before you really understand their significance. This one was slow; very slow. Simplification, you see is just a set of rules to reduce something, but there is more to it than that. You simplify with respect to one and only one variable. There is a reason for that.

SIMPLIFICATION MADE EASY

In any comparison of multi-dimensional things, we cannot easily pin one up against the other. Our standard way of dealing with this for two dimensions -- for example -- is to take the "magnitude" of the vector representing some point on a Cartesian graph. That is a fancy way of saying we take the two dimensions and using some formula -- in this case magnitude -- apply it to both variables and then 'project' the results onto a single dimension; one variable. In two dimensions we like to use the length of the vector, but that choice is somewhat arbitrary. Core to this understanding is that we are projecting our multi-variables down to one single measure because that is the only way we can make an objective comparison between any two independent things. Only when we reached two discreet numbers, a and b, x and y, 34 and 23, can we actually have a fully objective way of saying that one is larger than the other.

Without a measure, you cannot be objective so you really don't know if you've simplified it or made it worse. In any multi-variable system of equations, you might use something like linear programming to find some 'optimal' points in the solution space, but you cannot apply a fixed simplification, at least not one that is mathematically rigorous. Finding some number of local minima is not the same as finding a single exact minimum value.

Simplification then is an operation that must always be done with 'respect' to a specific variable, even if that 'variable' is actually a 'projection' of multiple variables onto a single dimension. For everything that we simplify, it must be done with respect to a single value. For any multi-variable set, there must be a projection function.

Falling back outwards -- way away from mathematics -- we can then start to analyze things like that fact that I want to "simplify my life". On its own, the statement is meaningless because we do not know in regards to which aspect in my life I would like to make easier. If I buy more appliances, to reduce my working time -- say a food processor, for example -- I save time in chopping. But it comes at the expense of having to buy the machine, washing it afterward, and occasionally performing some type of maintenance to keep it in working order. It simplifies my life with respect to 'chopping time', but if you've ever tried to wash a food processor you may know that it takes nearly as long to get it clean as it does to actually chop something. For a 'single chop' it's not worth bothering about, only when there is a great deal of cutting, does it really make a difference. If you used it every time you cooked, it would hardly be simpler.

If we look at other things, such as a computer program, underneath there are plenty of variables to which we can simplify with respect to them. Overall, the nature of our simplification depends on how we project some subset of these variables down onto a single measure. This leads to some interesting issues. For instance, the acronym KISS, meaning "keep it simple stupid" is used frequently as dogma by programmers to justify some of their coding techniques. One finds in practice that because the number of variables and the projection function are often undefined, it becomes very easy to claim that some 'reduction' is simpler -- which it is in respect to a very limited subset of variables -- when in fact it is not simpler in respect to some other projection based on a larger number of variables. We do this all of the time without realizing it.

FILES VS. FUNCTIONS

My favorite example of this comes from an older work experience while working on a VMS machine. Back in those days, we could place as many C functions as we wanted into individual files. One developer, while working, came up with the philosophy of sticking one, and only one function into each file. This, he figured was the 'simplest' way to handle the issue. A nice side effect was you could list out all of the files in a directory and because the file name is the same as the function name, this produced a catalog of all of the available functions. And so off he went.

At first, this worked well. In a small or even medium system, this type of 'simplified' approach can work, but it truly is simplified with respect to so few variables that we need to understand how quickly it goes wrong. At some point, the tide turned, probably when the system had passed a hundred or so files, but by the time it got to 300, it was getting really ugly. On the older boxes, 300 things in a directory scrolled hideously across the screen; you needed to start and stop the paging on the terminal in VMS to see what was in the directory. That was a pain. Although many of the functions were related, the order in the 'dir' was not. Thus similar functions were hard to place together. All of this was originally for a common library, but as it got larger it became hard to actually know what was in the library. It became less and less likely that the other coders were using the routines in common. Thus the 'mechanics' of accessing the source became the barrier to prevent people from utilizing the source, which was rapidly diminishing the usefulness and work already put into it.

Oddly, the common library really didn't have that many things in common, not with respect to any library that we often see today. 300 functions that actually fit into about twelve packages was fairly small by most standards. But the 'simplification' of arranging the files enhanced the complexity of other variables, causing the library to be useless because of its overall extreme amount of complexity.

By now, most developers would easily suggest just collapsing the 300 files into 12 that were ordered by related functions. With each file containing between 10 and 30 functions -- named for the general type of functions in the file, such as date routines -- navigating the directory and paging through a few files until you find the right function is easy. 12 files vs. 300 is a huge difference.

Clearly, the solution of combining the functions together into a smaller number of files is fairly obvious, but the original programmer refused to see it. He had "zoomed" himself into believing that one-function-per-file was the 'simplest' answer. He couldn't break out of that perspective. In fact, he stuck with that awkward arrangement right up to the end, even when he was clearly having doubts. He just couldn't bring himself to 'complicate' the code.

SIMPLE DRIVING

This post was driven by my seeing several similar examples lately. It is epidemic in software development that we believe we are simplifying something when we are actually not. Not that we should confuse how humans like simplifications that are not as rigorous as mathematics. It is, however, that people tend towards simplifying with respect to far too few variables, and then they willfully ignore the side-effects of that choice.

I frequently come across documentation where someone is bragging about how simple their solution is when clearly it is not. That files vs. functions problem pops up its ugly head all over Computer Science in many different forms. Ultimately, lots of people need to get a better understanding of what it actually means to really simplify something. That way their code, their architectures, their documents, and all that they present will actually be better; not just simplified with respect to some artificially small set of variables; causing all of the other variables in the solution to become massively more complex.

When you go to simplify something, it is important to understand a) the variables and b) the projection function. If you don't, then you are simplifying essentially randomly. The expected results of that will be: random, of course. If you've spent time and effort to clean up your code, then having that come crashing down on you because you didn't see the whole problem is rather painful. Something to be avoided.

SIMPLICITY AND COMPLEXITY

To simplify something is actually quite a complex operation. To do it with many many variables takes a great deal of processing. Given a universe where the possible variables are essentially infinite, it actually becomes an incredibly challenging problem. However many of us can do it easily in our minds; it is one of the mysteries of human intelligence. With that in mind I saw this quote:

"Don’t EVER make the mistake that you can design something better than what you get from ruthless massively parallel trial-and-error with a feedback cycle. That's giving your intelligence much too much credit." -- Linus Torvalds.

To be fair this quote was preceded by a discussion of how the human body was the most complex thing we know. The trial-and-error feedback that is referenced is the millions of years that it took nature to gradually buildup the mechanics of our bodies. We should note that the feedback cycle has left its mark in terms of artificial complexity, particularly in funny things like our tail bones. We did have tails, so it is not hard to understand why we have tail bones, but we don't have them anymore. Also, the parameters of the feedback loop itself were subject to huge variations as many environmental and geological factors constantly changed over the millions of years. Those disruptions have left significant artificial complexity in the design. We are far from simple.

But it is not unlikely that as we gain more of an understanding of the underlying details, we are able to visualize abstract models of the human body and all of its inner workings in our minds. After all, we do that for many things currently. We work with lots of things at a specific higher-level of abstraction so that we are not impaired by the overall full complexity of their essence.

Eventually, one day we will reach a point where we understand enough of the biomechanics that a single human could draft a completely new design for the human body. For that design, I fully expect that they would be able to eliminate most of the unnecessary buildup of complexity that nature had previously installed. I give our ability to think abstractly a great deal of credit. But more importantly, I give any system of variable changing adaptations, such as evolution, even more, credit in its ability to create artificial complexity.

Applying this type of thinking towards things like software development can be interesting. Many people believe that a huge number of software developers working together on a project in a grand scale will -- through iteration if done long enough -- arrive at a stage in their development that was superior to that of just one single guiding developer. Their underlying thinking is that some large scale trial-and-error cycle could iterate down to a simplified form. If enough programmers work on it, long enough, then gradually over time they will simplify it to become the ultimate solution. A solution that will be better (simpler) than one that an individual could accomplish on their own. Leave it to a large enough group of people for long enough and they will produce the perfect answer. This I think is incorrect.

We know that with some large group of people, each individual will be 'simplifying' their portion of the overall system with respect to a different set of variables and a different projection function. Individual simplifications will collide and often undo each other. The interaction of many simplifications with respect to different variables is the buildup of complexity at the overlapping intersections. In short, the 'system', this trial-and-error feedback loop, will develop a large degree of artificial complexity deriving directly from every attempt at simplifying it.

It gets worse too. Unless the original was fairly close to the final simplified version, the group itself will drive the overall work randomly around the solution space. It is like stuffing hay into a burlap sack with lots of holes: the more you stick it in one side, the more it falls out the other holes. You will never win, and it will never get done. Such is the true nature of simplification. We've certainly seen this enough in practice, there are lots of examples out there in large corporations. In evolution, for instance, huge numbers of the earlier less optimal models get left lying around. Nature isn't moving towards a simpler model, its just creating lots of them that are very adaptive.

Oddly, only one human -- who can squeeze into their brain -- a large enough number of variables and process them, is the only real way into which we can find something that is close enough to the simplest version. But just one person. I once had a boss who adamantly claimed that "nothing good ever came from a committee", and in many ways because of the 'blendedness' of the various concepts of each of the members his statement rang very true. However, it just might be more appropriate to say "nothing simple ever came from a committee". If we want it to be simple in our human space or the mathematical one, it needs the internal consistency of having just sprung from one mind and preferably at one time. Once you get too big for a single human, it needs to be abstracted at multiple levels in order to fit properly. Strangely our knowledge itself gets built up into many layers of abstractions in a never-ending feedback cycle, but one in which only human intelligence can take advantage.

FINAL SUMMARY

Ok, time to wrap it up. We talk a good talk about simplifying things, but fairly often we are actually unsure of what that really means underneath. It is easy to wear blinders or get one's head fixed on a specific subset simplification, but it is very hard to see it across the whole. In that way, the term itself: 'simplify' is very dangerous. We use it carelessly and incorrectly when we do not affix it to a specific variable. We do not simplify things, we simplify them with respect to a specific measure. Once that is understood, you see why the road to complexity is paved with so many attempts to simplify problems; it is an important key to understanding how to build reliable computer software programs.

My final thought is that if the number of variables for use in simplifying something is infinite -- you can use one of those Turning inspired proofs about creating meta-variables to prove this -- this means that from a universal perspective, you could never get a 100% projection together. So this implies that ultimately you cannot absolutely simplify anything. There is no universal simplification. It just doesn't exist, possibly proving the statement "nothing is simple". Neat huh?

Friday, December 14, 2007

Fixing your Development Environment

In any work environment there are always circumstances that are a little less than perfect. However, if your job is not completely awful, then there is still potential. You just have to alter your environment enough to make it closer to your desires.

Some people have the pessimistic view that they are simply peons; it is all beyond their control. They must live with things the way they are. The truth however, is that each one of us has a huge influence over our own working environment, whether or not we really understand it. For this entry I want to examine some of the things that software developers can and should effect in their work places. We get out of our work what we put into it. Positive changes make work easier and make you feel better about the time you spend working.

Obviously, if you not the boss you can't just leap out and make huge sweeping changes. That doesn't mean you won't have any effect, it just happens slowly. Patience and perseverance are necessary. Continually making gradual small changes is the key to success. We always need to pick our battles very carefully, fixing each problem one by one. Nothing changes if you give up, accept the problems and then live with them. All your potential goes into reinforcing the status quo instead of improving it.

Wherever I go, I always start with trying to correct those problems that are closest to me. I like to start nearby and work my way out; getting the simple things fixed quickly, then gradually tackle larger and larger issues. This keeps up a positive momentum of successful changes.

For any rogue software development environments, the first big problem usually encountered is missing or poor source code control. Given the number of good software packages available for handling this, it is still a surprise to see development shops that don't use anything. Honestly, if you are not using any source code control, you are not seriously programming; you're just playing around. It's not an optional tool. It's not a pain. It always pays for itself.

This issue is much closer to home than actually designing or coding. Chaos with handling source code is far worse than bad source code. Fix this first or pay heavily for it later.

Set up a working scheme that protects the code, makes it easy to build and handles any secondary issues like sharing one code base for multiple instances. For complex projects expect a complex setup. In some cases the source code control software already exists but is being used poorly. Clean it up; fix any of the control problems. For very large teams using several hierarchical pools at different levels may be the best solution. You should be getting stability, reliability and traceability from your setup, anything less should be fixed.

The next most significant problem with most development is automating the build, release and update procedures. Depending on the length of each iteration in the project, manual procedures can be very expensive. Too much time is wasted and the results are inconsistent. Generally these days an IDE like NetBeans or Eclipse or a tool like Ant or Maven make automating the build procedure simple and painless, but few developers move on to the other two major automations.

Packaging for releases and upgrading are big issues, which should be fully automated to make sure there are no silly problems slipping in at the last moment. Inconsistent releases are embarrassing. For clients updating their systems, nothing looks worse than a long series of hard to follow instructions; more than one step is too many.

One-button releases that give the users one-button upgrades are the gold-standard of a development project. They are always possible, but few people invest enough effort to get there, choosing instead to waste more of their time in dealing with manual inconsistencies. You sleep far better at night if there is less to worry about, and messing up the final bits in a manual release is a frequent enough problem. You just need to get beyond your fear of spending too much time on this type of automation, it does prove itself to be worth it in the end. At minimum, for each release, allocate a fix amount of time -- say two weeks -- to trying to automate the big three processes. Gradually, you'll get it working, but even if its only at 80% it still pays off nicely.

Moving out a little bit, we often find in most development positions that we get responsibility for a previously existing part of the system. This 'legacy' problem is a pain, but it doesn't have to be horrible. A great way to learn the code and fix it at the same time is to apply a series of non-destructive refactorings to it. Generally, one wants to compress the code somewhat, but also handle the functionality in the most consistent manner. If you find that there are three ways in the code to handle something, harmonize on one. Fixing the consistency is necessary, breeds familiarity and will save lots of pain later. Updating the comments is another way to get through the code, making positive but non-destructive changes.

The worse part of any development job is flailing away at bad code; no matter who authored it. So it is important to make sure that all of the code you can change is simple, clean, consistent and readable and that any notes on understanding, fixing or extending it are are self-contained. The time we put into code formatting is minimal compared to the time we spend agonizing over problems. The two balance each other out; you'll always do less debugging if you start with better initial code. This is always true; for any system; for any type of code. Debugging is a very important skill to have, but it is something to be avoided wherever possible because it is very time consuming with little payoff.

If you've managed to get your environment under control, your code is clean and it is working reasonable well, then it is time to move on to the system as a whole.

If your project is large enough, there are several developers, some of whom are having problems with their own development pieces. Assisting another developer without getting in their way is a very tricky skill, but another one worth learning. Most development problems stem from sloppiness, so if they are having trouble with their logic the likely culprit is messy code. The most basic help is to gently convince them that a little more discipline goes a long way. Clean it up first, then debug it. Generally this is more of a problem for younger less experienced developers and high-stress environments.

Sometimes the problems come from trying to build something sophisticated without fully understanding the underlying model enough. We often have to jump into development years before we fully grasp the problem, but once the code is set -- even if it is obviously broken -- it becomes hard for most developers to admit their initial approach was flawed. They would rather struggle through with bad code than toss it and start over again. In a complex problem, if you don't have a simple clean abstraction, any amount of artificial structural fiddling will obscure the base structure, making it very hard to fix. Most developers will err to the side of making their implementations more complex than necessary. All of the extra 'if' and 'for' statements cloud the code and make it unstable. It is like stuffing hay into a sack with lots of holes; as you stuff it in one side, it falls out the other. It never ends. Refactoring could work, but only if it goes deep enough to really change the model. A clean rewrite of only the 'core' code is usually the best answer.

When sloppiness or overcomplexity aren't the main problems, overloading variables with several independent values is also very common. Splitting the information into two or more separate variables generally leads to a working solution. Its a relativity painless refactoring, which should be a standard skill for all programmers because overloading the meaning of their variables is such a common problem.

For any of these programming issues, the best assistance comes from getting the other programmer to use you as a sounding board for their code. As they try to explain how it works to you, they often see the error of their ways and how they should correct it. Expressing the solution verbally is a powerful tool. That, consistency, debugging and refactoring are the absolutely required skills for any good developer.

Looking further outwards at the bigger picture, we draw 'lines' in the code by breaking off the dependencies between the pieces; separating them into independent sections of code. If there is a clearly defined 'interface' that one piece uses to access the other piece, then there is a line between them. A system with few or no lines at all is a huge mess, usually referred to as a big ball of mud.

Balls of mud are a common reason that most large corporate in-house systems fail to work properly. 250,000 lines of code with no structure is a nightmare waiting to be deleted. Not to worry, one can add architecture after-the-fact, although it is more work than getting it right initially. The general idea is to draw enough lines in the system, partitioning the code into horizontal and vertical pieces. Breaking the program down into processes, threads, modules, libraries, components, sections, layers, levels, etc. draws important features into the architecture. The more pieces you have -- if they are not too small, and encapsulated within reason -- the better the architecture. When you can easily explain how the system works to another developer in a simple white board diagram in 5 minutes with a couple of diagrams, then you've probably drawn enough lines into the code to be useful.

Another reason to draw lines in your code is to swap out bad algorithms. If you know the code is doing something poorly, refactor it so that the whole algorithm is contained together in a single component or layer. Once you've isolated it, you can upgrade to a newer version with a better algorithm. Beyond the time for the work that is involved, it is boring yet not hard to do. But you need to pick your target selectively or you'll spend too much time refactoring code without enough effect. As an example, switching from coarse-grained locking to fine-grain locking in a web system, means burying all of the lock calls behind an API, then harmonizing the API arguments, then just swapping it out with a newer locking version. All of the related code needs to take and release locks, what happens underneath should be encapsulated away from the caller.

As a complete produce or tool, most systems are seriously in need of consistency and good graphic design. Consistency you can do yourself, and its always worth the effort. Your users will definitely notice and usually if you have contact with them, they will thank you. Hiring a graphic designer and an editor always pay for themselves in terms of the final appearance of the system. Too many software developers think their own poorly designed systems actually look good; users don't complain because it is not worth the effort and they doubt it will ever get fixed; so why waste the time.

If you've gotten this far in changing your working environment, the system is pretty good, so we can move on to bigger issues. This time we go to the users.

We build tools for our users. Our overall direction of the development is to build better tools. This means increasing your 'real' knowledge of the problem domain. Programmers always think they know more than they actually do, so being wary of this will lead you to start spending more time with the end-users to understand what they are actually trying to accomplish with their tools. Keep in mind that we may understand the type and frequency of the data, but that doesn't mean we really understand the problem. Complains are especially important because they highlight some part of the system that needs to be fixed. Rarely are the users actually wrong, it is the system that is stiff and hard to us. It is likely that you could find some tool that better fits the user's needs if you search harder enough for it. Just don't fall into the trap of completely trusting the user's own analysis of their own needs. They know what they want to do -- they are the experts on that -- but they don't really don't know which tools would best suit them, that should be your expertize.

If we come to understand how to build our system for the user, the next step is to branch out into creating more tools that solve their problems. There is an endless degree of trying to compact more and more advanced functionality into an underlying system without actually denigrating it. Too often the professional products 'arc' in their features, reaching a point where successive releases are actually worse than the their predecessors. It is a very complex problem to combine several tools together under the same 'roof' in a way that actually improves the functionality; just smashing them together is weak. Given that so many of our tools are stuck in silos, with poor ways to interact with each other, this is a huge problem in the industry that needs to be solved. Too much time is wasted by importing and exporting data through various different silos, when the computer as a tool is powerful enough to handle all of this for us transparently.

Building your own code is fun when you start, but as you've been developing longer and longer the real challenge comes from trying to direct larger and larger groups to build bigger systems. Bigger tools demand bigger teams. Three guys building a system is a very different arrangement problem than trying to do it with two hundred. Honestly, I know of no large shops of programmers that actually produce great or good works. Small teams craft the initial versions, then large teams slowly grind the code base into uselessness. Even if they have good people, a large team bends towards fugly code. We don't currently have any other methodologies or processes to get around this; it is another huge open problem in software development.

If there are many development efforts on-going, then trying to change them from giant brute-force exercises in re-writing the same code hundreds of times, into a leveragable abstract foundation for building a large series of complex tools is both a technical problem and a personal management problem. Technological issues are fun because they are isolated, but the really challenging problems come from mixing humans and technology; whether it is running the system or just building it. Master this and you've really mastered software development; just knowing how to whack out a couple of 'for' loops doesn't count.

That's enough of a general direction to get anyone from a simple development project into a grand plan to re-develop all of the companies software projects. As you can see, there is an awful lot of simple things that we can do to effect our environment and make it better. Patience is necessary, but we can all do it. You can make enough small changes so that work is is interesting, challenging, but not frustrating. Although it seems contradictory to what many of those evil management consultants are implying, work can be fun, rewarding and you'll actually get more done for your company. A positive workplace is always more effective.

Thursday, December 6, 2007

Pedantically Speaking

Spending your days programming a computer -- if you do it long enough -- will start to have a noticeable effect on your personality. Not that it is a big surprise, one's profession has always -- no matter how complex or simple -- gradually morphed their personality. If you think that all of the accountants you've met are basically the same, you're not that far off base.

Programming is a discipline where we bury ourselves in the tiniest of details. Perfection is necessary, at least to the degree that your work won't run on the computer if it is too sloppy, and it is exceptionally difficult to manage if it is even a bit sloppy. This drives in us, the need to be pedantic. We get so caught up in the littlest of things that we have trouble seeing the whole.

Most often this is a negative, but for this post I wanted to let it loose into its full glory. It is my hope that from the tiniest of things, we may be able to draw the greatest of conclusions.

Originally, the thoughts for this blog entry were a part of a comment I was going to make for another entry, but on reflection I realized that there was way too much of it to fit into a simple comment field. This post may be a bit long winded with lots of detail, but it is the only real way to present this odd bit of knowledge. You get some kinda award if you make it to the end in one piece.

I was going to be rigorous in my use of a programming language for any examples. Then I thought I might use pseudo code instead. But after I had decided on the opening paragraph, I figured that the last thing I actually needed was to be more rigorous. If I am going to be pedantic then the examples should be wantonly sloppy. Hopefully it all balances out. The best I can do for that is to mix and match my favorite idioms from various different programming languages. The examples might be messy -- they certainly won't run -- but hopefully beyond their usefulness there should be minimal extra artificial complexity. Straight to the point, and little else.

For me the classic example of encapsulation is the following function:

string getBookTitle() {
    return "The Programmer's Paradox";
}

This is an example of cutting out a part of the code into a function that acts as an indirect reference to the title of a book. The function getBookTitle is the only thing another programmer needs to add into their code in order to access its contents. Inside, encapsulated away from the rest of the world is a specific instance of 'book title' information that references a specific title of a book. It may also happen to be the title of a blog, but that is rather incidental to the information itself. What it has in common with other pieces of similar information out there is possibly interesting, but not relevant.

The title in this case both explains the underlying action for the function -- based on the verb: get -- but it also expresses a categorization of the encapsulated information: a book title. It may have been more generalized, as in getTitle, or less generalized, as in getMyBookTitle, but that is more about our degree of abstraction we are using then it is about the way we are encapsulating the information.

In a library, if the code were compiled and the source wasn't available, the only way other programmers could access this information is by directly running this function. This 'information hiding', while important is just one attribute of encapsulation. In this case my book title is hidden, but it is also abstracted to one degree as a generic book title. Used throughout a system, there is no reference to the author or any necessity to know about the author.

Now this "slice and dice" of the code is successful if and only if all of the other locations in the code access the getBookTitle function not the actual string itself. If all of the programmers agree on using this function, and they are consistent in applying that agreement, then we get a special attribute from this. If you consider that this is more work than having the programmers just type in the string each time they needed it, then that extra work should be offset by getting something special in return. In this case, the actual title string itself exists in one and only one place in the code, so it is a) easy to look up, b) easy to change consistently for the whole program and c) easy to verify correctness visually. All strong points that make this more than just a novel approach for coding.

So far, so simple. Now consider the following modification to the function:

string getBookTitle(string override) {
    if(override is NULL)
        return "The Programmer's Paradox";
    else
        return override;
}

This is a classic example of what I often refer to as partial encapsulation. While the base function is identical, the programmer has given the option to any other programmers to override it with their own individual title. Not a realistic example, you say? Well, if you consider how many libraries out there allow you to override core pieces of their information, you'll see that this little piece of code is actually a very common pattern in our software. In a little example like this, it seems silly to allow an override, but it is done so often that it always makes me surprised that people can't see that what they are doing is really isomorphic to this example. It happens frequently.

The beauty of encapsulation is that we can assume that if all is good in coding-land and everyone followed the rules, that we can easily change the book title to something else. If we rebuild then the impact of that change will be obvious and consistent. More great attributes of the work we did. The problem with partially encapsulating it however, is that these great attributes are no longer true. We no longer know were in the code programmers used the original string and where they chose to override it with one of their own. While we could use a search program like grep to list out all of the instances of the call and check each one manually, that is a lot of extra time required in order to fully understand the impact of a change to the code. At 4am in the morning, that makes partially encapsulating something a huge pain in the ass. We lose the certainty of knowing the impact of the change.

With the exception of still having the ability to at least search for all of the calls for the getBookTitle function, the partially encapsulated version is barely better than just having all of the programmers brute force the code by typing in the same string each time, over and over again. This is still one advantage left, but consider how many great attributes we lost by opening up the call. If it was just to be nice to the calling programmers and give them a 'special' feature then the costs do not justify the work.

We can move on to the next iteration of the function:

string getBookTitle() {
    file = openFile("BookInformation.cfg");
    hash = readPropertiesFromFile(file);
    closeFile(file);
    return hash{"BookTitle"};
}

Now in this example, we moved our key information out of the code into a location that is far more configurable. In a file it is easy for programmers, system administrators and sometimes the users themselves to access the file and change the data. I ignored any issues of missing files or corrupt data, they aren't relevant to this example, other than to say that the caller high up the stack is concerned about them. The BookTitle functionality is really only concerned with returning the right answer. Presumably the processing stops if it is unavailable.

When we push out the information to a file, we open up an new interface for potential users. This includes anyone who can or will access the data. A configuration file is an interface in the same way that a GUI is one. It has all of the same problems requiring us to check all of the incoming data in a bullet proof manner to make sure we are protecting ourselves from garbage.

More to the point, we've traded away our BookTitle information, which we've encapsulated into the file definition in exchange for the location of the file itself and the ability to correctly parse it. We've extended our core information.

In the outside world, the other programmers don't know about the file and if we are properly encapsulating this information, they never will. If we open up the ability to pass in an overriding config file, we are back to partially encapsulating the solution, but this time it is the config file access info itself, not the original title info that has been opened up.

Now the BookTitle is an accessible file that even if we set read only it is hard to actually determine if it was edited or not, so in essence it is no longer encapsulated at all. It should be considered to be publicly available. But, even though its actual value can no longer be explicitly controlled, its usage in the system is still fully encapsulated. The information is no longer hidden, but most of the encapsulation properties still hold. We have the freedom to handle any value, but the encapsulation on how it is used within the system. The best of both worlds.

Now we could consider what happens when we are no longer looking at a single piece of information, for example:

(title, font, author, date) = getBookInformation();

In this case, the one function is the source for multiple pieces of related information for a book. This has the properly of 'clumping' together several important pieces of information into the same functionality that in itself tends towards ensuring that the information is all used together and consistently. If we open it up:

(title, font, author, date) = getBookInformation(book_identification);

Then a programmer could combine the output of multiple different calls together and get them mixed up, but given that that would take extra work it is unlikely to end up being that way in production code. Sometimes the language can enforce good behavior, but sometimes all you can do is lead the programmer to the right direction and make it hard, but not impossible for them to be bad.

In this last case, the passed in key acts as a unique way of identifying each and every book. It is itself an indirect reference in the same way that the function name was in the very first example. Assuming that no real information was 'overloaded' into the definition of the key itself, then the last example is still just as encapsulated as the very first example in this post.

Passing around a key does no real harm. Particularly if it is opaque. If the origin of the key comes from navigating directly around some large data source, then the program itself never knows either the key or the book information.

It turns out that sticking information in a relational database is identical to sticking it in a file. If we get the key from a database, along with the rest of the information then it is all encapsulated in the calling program. There are still multiple ways to manipulate it in the database, but if at 4am we see the value in the database we should be safe to fully assume that wherever that key is used in the code it is that specific value, and that property is also true of all other values. We have enough encapsulation to know that the value of the data itself is not significant.

That is, if I see a funny value on the screen then it should match what is in the database, if not then I can make some correct assumptions about how the code disrupted it along the way.

There are many more examples that are similar, but to some degree or another they are all similar to encapsulation or partial encapsulation. We just need to look at whether or not the information is accessible in some manner and how much of it is actually accessible. Is all comes down to whether or not we are making our jobs harder or easier.

So far this has been fairly long piece, so I'll need to wrap it up for now. Going back over it slightly, we slice and dice our code to make it easier to build, fix and extend. Encapsulation is a way in which we can take some of the information in the system and put it out of harm's reach so that our development problems are simpler and easier to manage. Partial encapsulation undoes most of that effort, making our work far less productive. We seek to add as many positive attributes to our code as possible so that we can achieve high quality without having to strain at the effort.

We can construct programs that have little knowledge of the underlying data beyond its elementary structure. For that little extra in complexity, we can add many other great attributes to the the program, making it more dynamic and encapsulating the underlying information. The best code possible is always extremely readable, simple and easy to fix. This elegance is not accidental, it must be built into the code line by line. With the correct focus and a bit of practice programmers can easily build less complex and more robust systems. PS. I was only kidding about the reward, all you'll get from me is knowledge. Try the shopping channel for a real reward...

Wednesday, November 28, 2007

From Our Inner Depths

The biggest problem with the World Wide Web is that it is just too hard to find anything "unique" anymore. In an instant, you can gather together at least fifty hits for just about anything. Are there any two word phrases left that don't at least get one hit? It is hard to imagine.

Despite the existence of many other published uses, I'm still going to proposed a new definition for a pair of old words. Sometimes you can't let prior history hold you back, particularly when the terminology is so appropriate.

If you've been using software for a while, you've probably been there. It is version X of the program; some function you relied upon, while still in the system no longer works correctly. Sometimes it a bug, sometimes it is a legacy problem, but often it is just because the new programmers maintaining the code were clueless. They had no idea how the system is really used in the real world, so they removed or broke something vital. They get so busy adding new crappy features and ignoring the existing bugs that they don't realize how much they are diminishing the code base.

Software -- we know -- starts out needing a lot of features. In its life, often the big programs have many, perhaps hundreds of developers pouring over the code and enhancing it. As the development grows, the quality and functionality get better and better, until at some point it reaches its maximum. Never a good thing. It is an inverted curve, going up often quite quickly, but after the high there is no where else to go but down. Down, down and down.

Now you'd think in this day and age we'd know when not to mess with a good thing, but the horrible little side-effect of capitalism is that nothing can ever remain static. Ever. If it ain't growing, then it must be dying. Or at least not living up to its full revenue generating potential; often the same thing in business school.

And so, all of our favorite software packages 'arc'. Reaching their highs in some early release, only to be followed by a long slow steady decline into the abyss. If you've hung around long enough, you know this to be true. What goes up; as they like to say...

Back to my career in creative terminology.

You should always honor those that paved the way, it is only fitting. As we take in consideration the above issue about the lifespan of software we need to fall back a bit and consider the user's perspective. Each of us who have aged a bit, has had the misfortune of being bitten by some new dis-functionality in the software. While that may be interesting -- often depending on how pressed we are for time -- our reaction to finding this out is less than amiable. Like automotive drivers completely losing it on a busy highway, I've seen many instances of software users succumbing to "Word Rage" in my day. When the user shakes their fist at the screen, curses the maker of the product and vows to set forth as much negative karma as possible on those responsible for the atrocity, you know you are seeing a genuine instance.

Word Rage. Hits when that #$%$% auto-formating stuff re-arranges a perfectly good document. It hits when you try to turn off those brain-dead auto typing crapifiers. When the menus are truncated. When that hyperlink nonsense doesn't work. Any of it. When the backup copies are no longer saved, or the save files are no longer readable. When the documentation tells you nothing, and there are fifty million stupid meaningless options that are choking up the dialog; obscuring that one stupid thing that needs to be unset; again. Whenever that program carelessly wastes hours and hours of your time, while you were only trying to do something both obvious and practical that any decent minded programmer should have though necessary. Word Rage!

But alas, while I coined the term after the masters of inciting it, many many other software packages produce equally dramatic effects. Just recently, I found that the plugin I used in a well know photo manipulation program wouldn't work with the built-in batch capabilities. I couldn't create an action for changing either the height or width that depended on the orientation and that the plugin I used for saving wouldn't work in batch either. Of the four simple things I needed to automate, three of them concealed bugs, or at very least serious weaknesses. The coders allowed the necessary capabilities to be divested into the plugin facility, but failed to integrate that into the batch capabilities; making batch useless. At least we get nice spiffy plugins; that are only partly useless. Way to go guys!

I could spend all day ranting about this sort of stuff; there is enough for perhaps years and years worth of blogging. I just doubt anyone would read it; it gets old, fast.

Our software is full of inconsistencies, bad ideas, stupid mistakes and poorly thought out designs that make it all rather irritating to try and get anything done. I find I don't have any Word Rage if and only if I'm not actually trying to get anything accomplished. The stupid machines work fine, just so long as you don't touch em. In them good ole days we didn't have a lot of functionality, but at least if it existed you could probably trust it. Now we have the world, but we waste more time fiddling with silly problems. The tools often bite their masters.

Lately I've been swearing off technology; insisting that I'm going to switch to woodworking instead. I need to rely on something that works properly. When you get to the point were you think being a Luddite would actually help make "better" tools, you know you've been enraged once too often.

Wednesday, November 21, 2007

Mind the Gap

Sometimes to get a clearer perspective on a problem we need to leave its proximity and venture a little deeper into the woods.

For this particular post, we need to go way out into the forest before we can come back and make sense of what we are seeing. With a little patience and time, I think you'll find that the rewards from exploring are well worth the stumbling around in the bushes and trees.

As an abstraction, pure mathematics is nothing short of absolute perfection. It exists in its own idealistic dimension with no interference from our world in any way.

As such, arithmetic for example is always consistent no matter where you are in the universe. It never changes, it is never wrong, it cannot be reduced to anything lower and it cannot be refactored into anything simpler. Quite an accomplishment and a model of perfection. The same is true of many other branches of pure mathematics.

As more of the real world gets involved in the equations -- such as physics -- the details become messier, but with respect to what they subscribe to, it is still close to perfect. The details cause complexity, but it is inherent in the problem domain, and thus acceptable.

Going out even further to some mathematically based soft science such as Economics we see even more real world detail, but also another type of complexity that is derived directly from humanity. I'd call it "errors", but that's not really fair. It is better described as the "not perfectness" that is driven by human personalities, history, interaction and thinking. It is working knowledge that is full of artificial complexity; where some reduction always exists that could simplify it. Unlike inherent complexity, it is not necessary.

So, we have this point of perfection that we often drift way from because of the complexity of the underlying details and also because of our own introduced complexity. These points are all important.

There is nothing we can do about real world complexity. It is as complex as it needs to be in order to function. That complexity comes directly from the problem domain and is unassailable.

The complexity we have created ourselves, however, is more interesting. Underneath it lies some minimal real world complexity, but more because of how we got there the complexity we are dealing with is significantly larger. This leads to a "gap" between the perfect solution and the one that we currently have. It is an important gap, one that bears more digging and understanding.

Looking at software, Computer Science revolves around building tools to manipulate data. In both our understanding of the data and our understanding of the tools themselves we have significant room for introducing problems.

With a bit of practice, it is not hard to imagine the perfect tool manipulating the perfect data representation. We are always driven to complete tasks, so on a task by task basis we can close our eyes and picture the shortest path to getting the task finished. We can compare this perfect version of what we are trying to accomplish to the tools and data representations that we actually have intuitively measuring the gap between the two. And it is a surprisingly large gap.

I am often very aware of just how big of a gap. I was lucky to work on some very complex but extremely well written systems, so I have a good sense of how close we can really come to a perfection solution. I can guess at the size of the minimum gap, and while some gap must always exist the minimum can be way smaller than most people realize. There is a huge latitude for our software to improve, without actually changing any of the obvious functionality.

We don't need fancy new features, we just things that work properly.

Over time the trend that I have seen is for our software to get larger and more sophisticated. It is clearing improving, but at the same time however, the gap has also been steadily growing; wiping away many of the improvements.

Given our current level of technological sophistication, today's software should be far better than it actually is. We have the foundations now to do some amazing things but we don't have the understanding of the process to implement them. We have a huge gap. Our software is much farther way from its potential than it was twenty years ago. A trend that is only getting worse with time.

The size and growth rate of the gap are important. It not only gives us an indication of how well we have done with our science, but also what our future directions will be. Given that the gap is growing faster than ever, my virus-laden PC that is choked by bad software and malicious attempts at crime and profit comes as a real embarrassment to the software industry. In the past we have been lucky and the expectation of the users has diminished, but one should really brace themselves for a backlash. We can only make so many excuses for the poor behavior of our systems before people get wise.

For instance, a virus obviously isn't necessary to make a computer function properly. As such, it is clearly unnecessary complexity that comes from the wrong side of the gap. If things had been written better, creating a virus would be impossible. The gap widened to allow spyware and viruses to became reality; then it widened again to make anti-software a necessity. Now we consume mass amounts of human effort on problems that shouldn't exist.

Computers are deterministic machines. Specifically what I mean is that they behave in completely predictable ways and with some digging all of the visible actions of the machine should be explainable. It is surprising then to encounter experienced professionals who tell stories of their machines "just acting weird". If it were one or two people we might write it off as a lack of training or understanding, but who amongst us has not experienced at least one recent unexplained behavior with their PC?

More to the point, for professionals working for years, hasn't this been getting more and more common? All of these instabilities come from our side of the gap. Being deterministic, our computer systems could be entirely built to be explainable. They should just function. Weird behavior is just another aspect of the gap.

The gap is big and getting bigger, but is it really important? When we build tools, we need our foundations to be stable so that we understand the effects of the tool. Building on a shaky foundation makes for a bad implementation. The gap at the base only gets wider as you build on top of it. Nothing you build can shrink it. If it is too wide, you can't even bridge it. The usefulness of a computer and the quality of our tools is dependent on the size of the gap.

So, standing back here in the forest, we see where we are currently with our latest and greatest technologies. Somewhere over the next hill, down into the valley and up the other side is the place where we could be if we could find a way to minimize the existing gap. We can see it from here, but we can never get there until we admit that 'here' is not where we want to be. If we stay were we are the things we build only get farther away from our potential. We amuse ourselves with clever bits of visual trickery, but progress doesn't mean more dancing baloney, it really means closing the gap. Something we should do as soon as possible.

Wednesday, November 14, 2007

Acts of Madness

Just a quick posting:

What kinda madness would convince people to base Open Source projects on proprietary technology, particularly when the technology belongs to a company best-known for riping off other people's technologies?

How much pain do people self-inflict because they fail to learn lessons from history?

Apropos for this: "if you lie down with dogs, you get up with fleas".

Wednesday, November 7, 2007

Is Open Source hurting Consulting?

In an earlier blog entry entitled Is Open Source hurting Computer Science? I looked into whether the glut of Open Source code was causing problems for software developers. In the comments, Alastair Revell correctly pointed out that my arguments were biased towards shrink-wrapped software. While that represents a significant chunk of the software market, it is by no means the majority of development.

As I was pondering how to respond, I realized that the circumstances for in-house development and consulting were far more complex than for packaged software. It was a dense enough subject to make for a good solid blog entry; a chance that shouldn't be wasted.

We can start by noting that there is a trend away from in-house development. That's not a surprise. For decades the software industry has been selling companies on developing their own custom solutions, only to have the majority of the projects fail; this has left many organizations weary when it comes to development. It is expensive and prone to failure.

I believe that it was Frederick P. Brooks who predicted in the Mythical Man Month that shrink-wrapped software would replace in-house development, arguing that it would bring the costs down. While he was right about moving away from internal projects, it does seem as if consulting has been the big winner, not commercial packages.

This shouldn't come as a surprise; most companies still feel that a unique process is a competitive advantage, and since software implements that process it too must be unique. The extra costs are worth it.

With the diminishing of in-house development, the key focus for this entry is on how significant volumes of free Open Source software are effecting the consulting industry. For that we need to first go back in time a bit.

Consulting has seen significant changes in the last couple of decades. At the end of the dot-com boom, staff was so hard to find that many programmers cashed in by becoming independent consultants. Staffing big projects was extremely difficult, so the rates were incredibly high. With the bust, the jobs disappeared and left a lot of people scrambling for work, often forced to finding it outside of software. In the turmoil, most of the independents and small companies disappeared; leaving an industry dominated only by very large companies.

It is at this point that we need to examine the rise of Open Source code. In a market of only big companies, what is the effect of lots of free code?

For a typical consulting project, access to large amounts of code would appear to be a good thing. Projects take forever, so it vital to get a quick version of the system up and running. This calms the fears and makes it easy to extend the project. A glut of code means that it is far easier to whack together a fast prototype and extend it to meet the requirements.

Quality problems with the code you might assume would be bad, but in fact extending the project beyond its original tenure is the life blood of consulting. An endless need for small fixes is the perfect situation. A smart lawyer and a polite client manager are all that is needed to turn a short-term implementation contract in to a long-term cash cow, milking it for all it is worth. Low quality means more work.

All this might be fine, if it were that individuals could startup consulting companies and gradually morph them into full fledge software companies. Fresh blood entering into the market helps keep the prices under control and drives innovation.

With all of the freely available code you'd think it should be easy for a small consultant to compete? However, the industry contains many significant obstacles.

Everybody wants the latest technology, a target that is changing too fast for most programmers to keep pace with. As well, many of the newer technologies are so eccentric that you need to have experts that specialize only in them in order to get them to work. A basic development project may cover ten or twenty such technologies; too much for any one person.

This means that the glut of code lends itself to being assembled by groups of developers not individuals. Given the basic instabilities of the software, projects made up entirely of independent consultants can be quite volatile. They need significant and patient management to keep them working together.

Past experience with software management drives many firms away from that, choosing instead, the big consulting companies that can design, manage, implement and often operate their projects for them. Of course big firms always have a preference for big "suppliers", making it hard for the little guy to get an opportunity. Little firms go for shrink-wrapped systems.

You'd think companies would have learned, after all many of them are stuck with these new, fancy, sexy but highly defective systems. These have been added to their roster along side of the fragile client/server systems of the nineties, and their ever reliable, but slow and hugely expensive mainframes; the real workhorses of most corporate IT. A trifecta of defective technologies.

CIOs in big companies are drawn toward the big consulting firms for their big new shiny projects. These consulting companies promise quick development and fast turn-around, but sadly the working problems linger on forever. Dragging out the contract is standard. It is just "another" period in the software industry where the big money users -- usually large corporations -- are getting shafted. Another of many that have already occurred.

With so much money to be made, consulting firms no doubt love Open Source. It and any quality problems it has only help the bottom line. Bugs are profitable for those that make money fixing them.

It would be no surprise then that consulting companies don't want to change the circumstances. Innovation does not help. It is good right now. With no new competition, this is unlikely to change until companies wise up.

In an off way this could also explain the increasing popularity of the newer lightweight methodologies. If you want fast results and low quality, then you certainly don't want a whole lot of process slowing things down and accidentally making it better do you? A methodology designed around slapping together quick and dirty prototypes with the least amount of planning fits well into the model of "built it quick and fix it forever". It is a scary thought.

In my previous blog entry I had reached a point where I felt that the glut of code was killing innovation and making it hard for small companies to get into the market. In this entry, I feel that the same glut also makes it hard for individual consultants and small consulting companies to get into the market, thus killing any competition or innovation. Also, related or not, in-house development is dying; it is gradually being replaced by external projects. As things are profitable right now, innovation is an unnecessary obstacle. Not unsurprisingly this implies that too much free code is causing significant problems in all corners of the the software industry. Too much of a good thing is never a good thing, it seems.

Monday, November 5, 2007

The Art of Encapsulation

In all software projects we quickly find ourselves awash in an endless sea of annoying little details.

Complexity run amok is the leading cause of project drownings; controlling it is vital. If you become swamped, it usually gets worse before you can manage to get it under control. It starts as a downwards spiral; picking up speed as it grows. If it is not explicitly brought under control, it swallows the entire project.

The strongest concept we have to control complexity is "encapsulation". The strength behind this idea is the ability to isolate chunks of complexity and essentially make them disappear. Well, to be fair they are not really gone, just buried in a black box somewhere; safely encapsulated away from everything else. In this way we can scale down the problem from being a monstrous wave about to capsize us, into a manageable series of smaller waves that can be easily surmounted.

Most of us have been up against our own personal threshold for keeping track of the project details, beyond which things become to complex to understand. We known that increasing the size of the team can push up the ability to handle bigger problems, but it also increases the amount of overhead. Big teams require bigger management, with each new member becoming increasingly less effective. Without a reasonable means to partition the overall complexity, the management cross-talk for any significant sized project would consume the majority of the resources. Progress quickly sinks below the waves. Just adding more bodies is not the answer, we need something stronger.

Something that is really encapsulated, hides the details from the outside world. By definition, it has become simpler. All of the little details are on the inside, and none of them need to be known on the outside to be effective. In this way, that part of the problem has been solved and is really to go. You are free to concentrate on the other unresolved issues, hopefully removing them one-by-one until the project comes down to just a set of controlled work that can be completed. At each stage, the complexity needs to be found, dealt with and contained.

The complexities of design, implementation and operation are very different from each other. I've seen a few large projects that have managed to contain the implementation complexity, only to lose it all by allowing the operational complexity to get out of hand. Each part in the process requires its own understanding to control its own special complexity. Each part has its own way of encapsulating the details.

Successful projects get there because at every step, the effort and details are well understood. It is not that they are simpler, just that they are compartmentalized enough that changes do not cause unexpected complexity growth. If you are, for example adding some management capabilities to an existing system, the parts should be encapsulated enough that the new code does not cause a significant number of problems with the original code, but it should also be tied closely enough to it too leverage its capabilities and its overall consistency. Extending the system should be minimal work that builds on the existing code base. This is actually easy to do if the components and layers in the system are properly encapsulated; it is a guaranteed side effect of a good architecture. It is also less work and higher quality.

Given that, it is distressing how often you see architectures, libraries, frameworks and SDKs that should encapsulate the details, but they do not. There is some cultural aspect to Computer Science where we feel we cannot take away the choices of other programmers. As such, people often build development tools to encapsulate some programming aspects, but then they leave lots of the details free to percolate upwards; violating the whole encapsulation notion. Badly done encapsulation is not encapsulation. If you can see the details, then you haven't hidden them, have you?

The best underlying libraries and tools are ones that provide a simplified abstraction over some complicated domain, hiding all of the underlying details as they go. If we wanted to get down and dirty we wouldn't use the library, we would do it directly. If we choose to use the library, we don't want to have to understand it and the underlying details too. That's twice as much work, when in the end, we don't care.

If you are going to encapsulate something, then you really should encapsulate it. For every stupid little detail that you let escape, the programmers above you should be complaining heavily. If you hide some of the details, but the people above still need to understand them in order to call things properly, then it wasn't done correctly. We cannot build on an infrastructure that is unstable because we haven't learned how it works. If we have to understand it to use it, it is always faster in the long run just to do it ourselves. So ultimately building on a bad foundation is just wasting our time.

It is possible to properly encapsulate the details. It is something we need to do if we want to build better software. There is no need to watch your projects slip below the churning seas. It is avoidable.

Monday, October 29, 2007

Repeated Expressions

I'll apologize in advance. This blog entry gets a little weird, but it is the type of conceptual weirdness that comes directly from the underlying problems.

You may have to squint your eyes a bit to get a really fuzzy abstract view of reality in order to understand the basis, but I think that you'll find it is worth the effort.

Jumping right in, we start with the notion that users create piles of information that they beat on with their software tools.

That's not too hard to handle, but if you dig a bit deeper into the idea of how to "express" the tools themselves things can get interesting.

In a very abstract sense, a software design specification for a system defines a bunch of functionality operating on a set of data. Generally it is very imprecise, but it is a way of actually "expressing" a solution in one or more problem domains. The specification, which consists of various design documents with specific syntax represents a "language" capable of defining the capabilities. While it doesn't run directly, and it is generally very sloppy, it does define the absolute boundaries for the problem.

The code itself, once it is created is another way to express the solution to the same problem. It is far more specific and needs to be more correct, although rarely does it actually need to be perfect. Different versions of the system, as they progress over time are different ways of expressing a similar solution. Generally, for each iteration the depth of solution improves, but all of the solutions implemented were acceptable.

Testing -- even though in a philosophical way it is the reverse -- bounds the same problem domain, so it to is yet another way of expressing the same solution. In pencil drawing courses, they often do exercises with "negative space" by drawing the shapes where the subject is not located. Specifying testing is similar; you bounding the problem by the things that it should not do. Testing occurs at many different levels often providing a great degree of overlap in the expressiveness of the solution. Generally layer upon layer of tests are applied, regardless of how effectively they work.

In an ideal world, for a perfectly defined system the specifications, code and tests are all just different, but complete ways of expressing the same solution.

Given that hard to fathom perspective, we essentially solve the same problem at least three times during development, in three different ways. This gives rise to various speculations about what work we really don't need.

For instance, Jack Reeves felt that the code was the design not the specification. That perspective could lead to spending less effort on the initial design; correcting the problems only at the coding level.

Newer practices such as test driven development (TDD) imply that the testing could be used as the design. You layout the functionality by matching it back towards the tests. These are interesting ideas indirectly arising from the repetitive nature of development.

One easily senses that if could we only expressed the design twice instead of three times, we would be saving ourselves labor. If we only expressed it once, that would be amazing. It would seem as if we might be able to save ourselves up to two thirds of the work.

It is an intriguing idea, but highly unlikely for any thing over a medium sized project.

Some individuals can get away with solidifying their designs initially entirely within their consciousness. We know this. There are many programmers out there that for medium to small systems don't need to write it down on paper. Interesting, but we need to remember that such instantiation is just another valid form of expressing the solution, even if the original programmer cannot necessarily share that internal knowledge and understanding.

Even the small systems need at least two distinct versions to exist.

For large projects, the work needs to be distributed over a big group of people so it needs to be documented in a way that is sharable. Without that, all of the developers will go their own way which will always create a mess. As well, all of the testers will make their own assumptions which again will leave huge gaps in the quality of the testing.

Without a common picture chaos will rule and the project will fail.

Even though it sounds like a waste, specifying the same design in three different ways actually helps to insure that the final output is more likely to be correct. In the same way that you could enhance the quality of data entry by having three typists input the same file, iterating out the problem as a design, code and tests gives three distinct answers which can all be used to find the most likely one.

Even though we probably can't reduce the number of times we express the solution, understanding that we are repeating ourselves can lead to more effective ways of handing it.

Ok, so it is not all that weird, but it does put a different perspective on the development tasks.

In the end, we are left with a pile of data, some functionality that needs to be applied to that data and a number of different ways in which we have to express this in order to insure that the final work meets its initial constraints. We put this all together, expressing it in various ways until in the end it becomes a concrete representation of our near idea solution to the specific set of given problems.

Call it what you will, it all comes down to the various ways in which we express our solutions to our problems.

Tuesday, October 23, 2007

Goto The Next Goto

Goto statements raised angst because they were seen as the underlying cause of bad programming. Most believed that they accelerated the degree of 'spaghetti'-ness of the code. They allowed programmers to construct obscenely complicated logic that was nearly impossible for a human to follow, and consequently could never be fixed or expanded.

Getting rid of goto statements, or at lest discouraging their abuse was a very good idea.

Then along came pointers in C. The language was so accommodating, that the programmers were allowed to write all sorts of things that didn't work properly. You'd think we'd learn our lessons about abusing specific aspects of a language, but with each new generation of coders there seems to need to go back towards making a mess again.

So we went though a prolonged period of having hanging pointers and strange crashes. It was, some people felt, a step forward to get away from C and start using some of the newer languages. In time there were excellent tools written to find hanging pointer problems, but mostly the had came along to late to be effective.

The intent in Java was to protect the programmers from themselves. At least that was the original idea. In spite of that, there are at least two great popular messes in Java right now that are the continuing source of bad programming problems.

The first is threads. Few programmers really understand them, so much of the multi-threaded code is exceptionally poor and prone to glitches. When modern applications occasionally act a little funny, chances are your crossing some type of threading bug. By now I've seen so many, in so many well known applications, that it safe to say that multi-threading is probably the single worst language feature since C pointers.

The result is that we end up with lots of people shooting themselves in the foot in ways that are incredibly hard to detect and nearly impossible to reproduce. Bad code that's destined to never get found or fixed.

Another great mess are the anonymous callback functions in Java. Mixing that with some good olde async calls can make for a wonderful pile of spaghetti.

Syntactically, it is really ugly because you end up dropping a stray semi-colon into the middle of a code block. Semantically it is ugly because you stuffing declarations into places where it should be running code. And structurally it is truly hideous because the flow of control is bouncing all over the place into a multitude of little functions. Nothing taste more like spaghetti than being able to fragment the logic into a huge number of different locations. It is the power of a "goto" all over again.

We not too clever in Computer Science. We keep reinventing new ways to make bad code. You'd think by now we would have developed an attitude close to that of Niklaus Wirth, whose aim I think, with languages like Pascal, Module-2 and Oberon is to tightly control the code so that there is less room for badly written software. What is the point of having access to this huge computer tool if you keep opening holes around it to shoot yourself in the foot. Shouldn't we be using that computational power to help us find our problems before we release them? Do we really like pain that much that we have to do it all ourselves and not take advantage of our tools?

Monday, October 22, 2007

First Principles and Beyond

My brain uses an efficient algorithm for garbage collection. Sometimes, it can be a little too efficient. I'll be in the middle of using some fact and then "poof" it is gone.

"Gee, thanx brain! Now I'll have to ask for that phone number again, ... for the third time". Wetware can be such a pain.

Besides the obvious disadvantages of having such quick fact cleanup there are in fact a few bright sides to limited short-term retention. One important side effect is that I am constantly forced to go back to first principles, mostly because I've forgotten everything else. Re-examining the same ground over and over breeds familiarity. You not only come to understand it, but you also come to simplify that understanding; one of the harder things to do well. Because I can remember so little, everything must be packed as tightly as possible or else it is forgotten.

That perspective is interesting if I used it to examine something such as software development. We always build up our knowledge overtime, sometimes to the degree where we make things more complex than necessary. Sorting back through our understanding is an important exercise.

Building software can be a very complicated pursuit, but underneath creating software is actually fairly simple. We are doing nothing more than just building tools to manipulate data. In spit of all of the fancy features and appearances, when you get right down to it, the only things a computer can actually do are view, add, delete and edit data.

Playing music on a portable mp3 player is a good example. The computer copies the data from its internal storage over to some circuitry that generates the currents for the sounds. We hear the currents vibrating through the speakers, but the computer's actual responsibility stopped just after it copied over the information. When it is no longer just a collection of binary bits, it is no longer a computer that is doing the work.

We can go deeper. When you get right down to it, all software can be seen as nothing more than just algorithms and data working together within some framework. At the higher level we perceive this as software applications with various functionality, but that is an illusion. Software is deceptively simple.

We can take advantage of this simplicity when we build systems. Orienting our understanding of what we build based on the data it manipulates simplifies the overall design for a system. The data is the key. There can be a great deal of functionality, but it is always restricted by the underlying range of available data.You have to get the data into the system first, before you can start manipulating it.

One key problem with badly written software code is the amount of repeated data in the system. Big balls of mud and poor architectures are identified by the redundant copying and translating of the same data around the system in a huge varieties of ways. This not only wastes CPU and memory, but it also makes understanding and enhancing the system difficult. A sense of complexity can be measured from the volume of redundant code within a system. If, with enough time you could delete half the code, yet keep the same functionality you have a system in need of some serious help.

Understanding the importance of data not only effects the code, but it translates all of the way up to the architecture itself. The blueprints that define any system need only consist of outlining the data, algorithms and interfaces within the system. Nothing more.

In most cases, the majority of the algorithms in system are well defined and come from text books. Probably less than 20% require anything other than just a brief reference, such as "that list is 'sorted'".

Interfaces are frequently the worst handled part of modern software development, generally because only the key ones are documented. The rest are just slapped together.

Anywhere that people interact with the computer should be considered an interface, which not only includes GUIs and command-lines, but also configuration files, libraries and your database schema. Rampant inconsistency is a huge problem these days. Consistency at all levels for an interface reduces its size by at least an order of magnitude. Most interfaces would be just as expressive, at less than a quarter of their size.

For large projects the most natural lines for subdividing the work are the natural boundaries of the data. Those boundary lines within the system -- either horizontal (components) or vertical (layers) -- themselves should be considered interfaces. Specifying them as part of the architecture ensures that the overall coherency in the system remains intact.

Within parts of the system there may also be boundary lines to delineate different functionally independent components. This helps for iterative development as well as supporting team efforts. If the lines are clear, it is easy to pick a specific set of functionality, such as a resource manager, and replace it with another more advanced one. Growing the system over a series of iterations reduces a large amount of development risk at the cost of a relatively small amount of extra work. When compared to going completely off course and having to start over, iterative development is extremely cost effective.

Orientating the development around the data not only makes the code cleaner and more advanced; it also provides the lines to break down and encapsulate the components. But, even more critically, it makes analyzing the underlying problem domain for the tool much simpler as well.

When you understand that the users need a tool to manipulate only specific subsets of data in very specific ways, that drives the way you analyze the problem. You don't get caught worrying about issues that are outside the boundary of the available data. They are outside of the software tool's scope.

In looking at the users, you see the limited amount of data they have available. They want to do things with their data, but only a limited number of those things makes sense and are practical. Technology problems become so much simpler when seen this way. If you can't acquire the data to support the functionality, you could never implement that code. You don't need to get lost in a sea of things that are impossible.

Getting back to data. The great fear with programmers specifying their systems before they write them is that they feel this will make their jobs trivial. Some think that they can stumble around in the dark, and eventually get to the right system. Other just want to build fast and light in the hopes that over time they find the right tools.

Truthfully, you cannot solve a complex problem without first thinking deeply about it. The fruits of your thinking are the design of the system. If you don't want to waste your time, you need to have worked through all of the critical issues first. If you don't want to waste your thinking, you need to get that information down into a design.

An endlessly large list of requirements and functionality isn't a coherent design.

A decent architecture of data, algorithms and interfaces shouldn't be a massive document. In fact, for simple systems, it might even be just a few pages. You can refer to standards and conventions to save time in re-specifying facts which are well known. Nobody reads the obvious, so why waste time in writing it.

Such a design, if it breaks the system down into clear parts with well-known lines, doesn't remove the joy of programming. Instead it instills confidence in the initial labors, and confidence that the project will be able to succeed with few surprises.

Instead of making programming a drudgery, it actually makes it more fun and less painful. I've learned this the hard way many a time, but thanks to my efficient garbage collection, I've also forgotten it a few times already.

The simplicity underlying software is the key to being able to build increasingly complex systems. Too often we get lost in the little details without being able to see the overall structure. Most developers still oriented their understanding of software around its implemented functionality, but this just illuminates the inner workings in its most complex fashion. The data for most systems is simple and it absolutely bounds the code, so orienting your perspective on it simplifies your issues and broadens your overall understanding. We really don't want to make our lives more complicated than they need to be; that threatens our ability to be successful. I've learned that with my limited memory the only real problem I can afford to face in a huge system is the complexity of the system itself. To keep it manageable I need to continually simplify it. To keep it simple I often need to go back to first principles and find the right perspective. It is amazing how much of the complexity can just fall away with the right vantage point.