Post scarcity one: It's not the size that matters

by Simon Brooke


Auchencairn, Scotland, Feb 27, 2006

For years we've said that our computers were Turing equivalent, equivalent to Turing's machine U. That they could compute any function which could be computed. They aren't, of course, and they can't, for one very important reason. U had infinite store, and our machines don't. We have always been store-poor. We've been mill-poor, too: our processors have been slow, running at hundreds, then a few thousands, of cycles per second. We haven't been able to afford the cycles to do any sophisticated munging of our data. What we stored - in the most store intensive format we had - was what we got, and what we delivered to our users. It was a compromise, but a compromise forced on us by the inadequacy of our machines.

The thing is, we've been programming for sixty years now. When I was learning my trade, I worked with a few people who'd worked on Baby - the Manchester Mark One - and even with two people who remembered Turing personally. They were old then, approaching retirement; great software people with great skills to pass on, the last of the first generation programmers. I'm a second generation programmer, and I'm fifty. Most people in software would reckon me too old now to cut code. The people cutting code in the front line now know the name Turing, of course, because they learned about U in their first year classes; but Turing as a person - as someone with a personality, quirks, foibles - is no more real to them than Christopher Columbus or Noah, and, indeed, much less real than Aragorn of the Dunedain.

In the passing generations we've forgotten things. We've forgotten the compromises we've made; we've forgotten the reasons we've made them. We're no longer poor. The machine on which I'm typing this - my personal machine, on my desk, used by no-one but me - has the processor power of slightly over six thousand DEC VAXes; it has the one hundred and sixty two thousand times as much core store as the ICL 1900 mainframe on which I learned Pascal. Yet both the VAX and the 1900 were powerful machines, capable of supporting dozens of users at the same time. Compared to each individual user of the VAX, of the 1900, I am now incalculably rich. Vastly. Incomprehensibly.

And it's not just me. With the exception of those poor souls writing embedded code for micro-controllers, every programmer now working has processor and store available to him which the designers of the languages and operating systems we still use could not even have dreamed of. UNIX was designed when 32 bit machines were new, when 16,384 bytes was a lot of memory and very expensive. VMS - what we now call 'Windows XP' - is only a few years younger.

The compromises of poverty are built into these operating systems, into our programming languages, into our brains as programmers; so deeply ingrained that we've forgotten that they are compromises, we've forgotten why we chose them. Like misers counting grains on the granary floor while outside the new crop is falling from the stalks for want of harvesting, we sit in the middle of great riches and behave as though we were destitute.

One of the things which has made this worse in recent years is the rise of Java, and, following slavishly after it, C#. Java is a language which was designed to write programs for precisely those embedded micro-controllers which are still both store and mill poor. It is a language in which the mind-set of poverty is consciously ingrained. And yet we have adopted it as a general purpose programming language, something for which it is not at all suitable, and in doing so have taught another generation of programmers the mind-set of poverty. Java was at least designed; decisions were made for reasons, and, from the point of view of embedded micro-controllers, those reasons were good. C# is just a fit of pique as software. Not able to 'embrace and extend' Java, Microsoft aped it as closely as was possible without breaching Sun's copyright. Every mistake, every compromise to poverty ingrained in Java is there in C# for all the world to see.

It's time to stop this. Of course we're not as wealthy as Turing. Of course our machines still do not have infinite store. But we now have so much store - and so many processor cycles - that we should stop treating them as finite. We should program as if we were programming for U .

It's not the size that matters

So let's start with what we store, what we compute on: values. For any given column within a table, for every given instance variable in a class, every record, every object is constrained to have a value with a certain format.

This is, of course, historical. Historically, when storage was expensive we stored textual values in fields of fixed width to economise on storage; we still do so largely because that's what we've always done rather than because there's any longer any rational reason to. Historically, when storage and computation were expensive, we stored numbers in twos-complement binary strings in a fixed number of bytes. That's efficient, both of store and of mill.

But it is no longer necessary, nor is it desirable, and good computer languages such as LISP transparently ignores the difference between the storage format of different numbers. For example:

(defun factorial (n)
  (cond 
    ((eq n 1) 1)
    (t (* n (factorial (- n 1))))))

;; a quick way to generate very big numbers...                            

We can add the value of factorial 100 to an integer, say 2, in just the same way that we can add any other two numbers:

(+ (factorial 100) 2)    

933262154439441526816992388562667004907159682643816214685929638952 175999932299156089414639761565182862536979208272237582511852109168 64000000000000000000000002

We can multiply the value of factorial 100 by a real number, say pi, in just the same way as we can add any other two numbers:

(* (factorial 100) pi)        

2.931929528260332*10^158

The important point to note here is that there's no explicit call to a bignum library or any other special coding. LISP's arithmetic operators don't care what the underlying storage format of a number is, or rather, are able transparently to handle any of the number storage formats - including bignums - known to the system. There's nothing new about this. LISP has been doing this since the late 1960s. Which is as it should be, and, indeed, as it should be in storage as well as in computation.

A variable or a database field (I'll treat the two as interchangeable, because, as you will see, they are) may reasonably have a validation rule which says that a value which represents the longitude of a point on the Earth in degrees should not contain a value which is greater than 360. That validation rule is domain knowledge, which is a good thing; it allows the system to have some vestige of common sense. The system can then throw an exception when it is asked to store 764 as the longitude of a point, and this is a good thing.

Why then should a database not throw an exception when, for example, a number is too big to fit in the internal representation of a field? To answer, here's a story I heard recently, which seems to be apocryphal, but which neatly illustrates the issue just the same.

The US Internal Revenue Service have to use a non-Microsoft computer to process Bill Gate's income tax, because Microsoft computers have too small an integer representation to represent his annual income.

Twos complement binary integers stored in 32 bits can represent plus or minus 2,147,483,648, slightly over two US billion. So it's easily possible that Bill Gates' income exceeds this. Until recently, Microsoft operating systems ran only on computers with a register size of 32 bits. Worryingly, the default integer size of my favourite database, Postgres, is also 32 bits.

This is just wrong. Nothing in the domain of income places any fixed upper bound on the income a person may receive. Indeed, with inflation, the upper limit on incomes as quantity is likely to continue to rise. Should we patch the present problem by upping the size of the integer to eight bytes?

In Hungary after the end of World War II inflation ran at 4.19 ? 1016 percent per month - prices doubled every 15 hours. Suppose Gates' income in US dollars currently exceeds the size of a thirty two bit integer, it would take at most 465 hours - less than twenty days - to exceed US$9,223,372,036,854,775,808. What's scary is how quickly you'd follow him. If your present annual salary is just thirty three thousand of your local currency units, then given that rate of inflation, you would overflow a sixty-four bit integer in just 720 hours, or less than a month.

Lots of things in perfectly ordinary domains are essentially unbounded. They aren't shorts. They aren't longs. They aren't doubles. They're numbers. And a system asked to store a number should store a number. Failure to store a number because it's size violates some constraint derived from domain knowledge is desirable behaviour; failure to store a number because it size violates the internal storage representation of the system is just bad, outdated, obsolete system design. Yes, it's efficient of compute power on thirty-two bit processors to store values in thirty-two bit representations. Equally, it's efficient of disk space for a database to know in advance just how mush disk it has to reserve for each record in a table, so that to skip to the Nth record it merely has to skip forward (N * record-size) bytes.

But we're no longer short of either processor cycles or disk space. For a database to reject a value because it cannot be stored in a particular internal representation is industrial archaeology. It is a primitive and antiquated workaround from days of hardware scarcity. In these days of post-scarcity computing, it's something we should long have forgotten, long have cast aside.

This isn't to say that integers should never be stored in thirty-two bit twos complement binary strings. Of course they should, when it's convenient to do so. It's a very efficient storage representation. Of course, when a number overflows a thirty two bit cell, the runtime system has got to throw an exception, has got to deal with it, and consequently the programmer who writes the runtime system has still got to know about and understand the murky aspects of internal storage formats.

Perhaps the language designer, and the programmer who writes the language compiler should, too, but personally I don't think so. I think that at the layer in the system - the level of abstraction - at which the compiler writer works, the operator 'plus' should just be a primitive. It takes two numbers, and returns a number. That's all. The details of whether that's a float, a double, a rational or a bignum should not be in the least relevant at the level of language. There is a difference which is important between a real number and an integer. The old statistical joke about the average family having 2.4 children is funny precisely because it violates our domain knowledge. No family has 2.4 children. Some things, including children, are discrete, however indiscreet you may think them. They come in integral quantities. But they don't come in short quantities or long quantities. Shorts and longs, floats and doubles are artefacts of scarcity of store. They're obsolete.

From the point of view of the runtime designer, the difference between a quantity that can be stored in two bytes, or four, or eight must matter. From the point of view of the application designer, the language designer, even the operating system designer, they should disappear. An integer should be an integer, whether it represents the number of toes on your left foot (about 5), the number of stars in the galaxy (about 1x1011) or the number of atoms in the universe (about 1x10 79). Similarly, a real number should be just a real number.

This isn't to say we can't do data validation. It isn't to say we can't throw a soft exception - or even a hard one - when a value stored in a variable or field violates some expectation, which may be an expectation about size. But that should be an expectation based on domain knowledge, and domain knowledge alone; it should not be an expectation based on implementation knowledge.

Having ranted now for some time about numbers, do you think I'm finished? I'm not. We store character values in databases in fields of fixed size. How big a field do we allocate for someone's name? Twenty four characters? Thirty-two? We've all done it. And then we've all found a person who violates our previous expectation of the size of a name, and next time we've made the field a little bigger. But by the time we've made a field big enough to store Charles Philip Arthur George Windsor or Sirimavo Ratwatte Dias Bandaranaike we've negated the point of fixed width fields in the first place, which was economy. There is no natural upper bound to the length of a personal name. There is no natural upper bound to the length of a street address. Almost all character data is a representation at some level of things people say, and the human mind doesn't work like that.

Of course, over the past fifty years, we've tried to make the human mind work like that. We've given addresses standardised 'zip codes' and 'postcodes', we've given people standardised 'social security numbers' and 'identity codes'. We've tried to fit natural things into fixed width fields; we've tried to back-port the inadequacies of our technology onto the world. It's stupid, and it's time we stopped.

So how long is a piece of string? How long is a string of characters? It's unbounded. Most names are short, because short names are convenient and memorable. But that does not mean that for any given number of characters, it's impossible that there should be something with a normal name of that length. And names are not the only things we store in character strings. In character strings we store things people say, and people talk a lot.

At this point the C programmers, the Java programmers are looking smug. Our strings, they say, are unbounded. Sorry lads. A C string is a null terminated sequence of bytes. It can in principle be any length. Except that it lives in a malloced lump of heap (how quaint, manually allocating store) and the maximum size of a lump of heap you can malloc is size_t, which may be 231, 232, 263 or 264 depending on the system. Minus one, of course, for the null byte. In Java, similarly, the size of a String is an int, and an int, in Java, means 231.

Interestingly, Paul Graham, in his essay ' The Hundred Year Language', suggests doing away with stings altogether, and representing them as lists of characters. This is powerful because strings become S-expressions and can be handled as S-expressions; but strings are inherently one-dimensional and S-expressions are not. So unless you have some definite collating sequence for a branching 'string' it's meaning may be ambiguous. Nevertheless, in principle and depending on the internal representation of a CONS cell, a list of characters can be of indefinite extent, and, while it isn't efficient of storage, it is efficient of allocation and deallocation; to store a list of N characters does not require us to have a contiguous lump of N bytes available on the heap; nor does it require us to shuffle the heap to make a contiguous lump of that size available.

So; to reprise, briefly.

A value is just a value. The internal representation of a value is uninteresting, except to the designer and author of the runtime system - the virtual machine. For programmers at every other level the internal representation of every value is DKDC: don't know, don't care. This is just as true of things which are fundamentally things people say, things which are lists and things which are pools, as it is of numbers. The representation that the user - including the programmer - deals with is the representation which is convenient and comfortable. It does not necessarily have anything to do with the storage representation; the storage representation is something the runtime system deals with, and that the runtime system effectively hides. Operators exposed by the virtual machine are operators on values. It is a fundamental error, a failure of the runtime designer's most basic skill and craft, for a program ever to fail because a value could not be represented in internal representation - unless the store available to the system is utterly exhausted.

Ends. | [NITF] | Link this story: Del.isio.us | Digg it! | Google bookmarks | Reddit | Stumble Upon!

Rate this story

Respond to this article