Counting cycles

I picked up a little buzz about Google software engineers planning to rework the guts of some major open-source software to make it run faster. Since it wasn’t software I use, I didn’t read enough to remember what software, but it brought up memories…

Walking to school in the snow, 3 miles, uphill, both ways

No, this isn’t going to be one of those posts. I only wish we’d had the kind of raw processing power in my early years (decades?) as a systems analyst/programmer that we take for granted now. Most people today spend more time on what needs to be done, and that’s as it should be.

This is just a little harmless nostalgia, none of it longing for those days.
(If you want my take as of three years ago as to how I think I’d deal with being young again, here’s your post.)

Early on, cycles really didn’t count

As I’ve noted elsewhere, my first systems analysis and programming involved an IBM 188 Collator. (Hmm. 20% of all Google results for the search [IBM ‘188 collator’] are my handiwork. That may be depressing.] In some ways, the 188 was a marvelous machine, particularly in 1961 when it was introduced: IBM’s first punch card equipment using solid-state circuitry and core.

That’s right, core memory–visible devices, just a wee bit larger than today’s RAM bits. I honestly don’t remember how much core the 188 had–maybe 64 bytes, but that’s vague memory. I do remember how you programmed it: with a double-wide board full of holes, into some of which you put jumpers to make circuit connections. Hard-wired programming…
For the circulation system, it wasn’t a question of using too many computing cycles. You got 650 cycles per minute–that is, one cycle for each card feed. Your program did whatever logical comparisons between two cards (one from each reader) as it could, given the limited core and your ingenuity, then either fed both cards into a common bin or one or both cards into other bins.
Sounds primitive. Was primitive. Worked.
(More technologically interesting, in some ways, was IBM’s last card sorter–by far the fastest, and using vacuum feed rather than pushers to move the cards and an optical sensor rather than brush contact, so that a card would last for thousands of sorts without wearing out. Without the speed and gentleness of the IBM 84 [2000 cards per minute, which is fast for a mechanical device processing little pieces of stiff paper], the circ system would never have kept up with Doe Library’s volume of business.)

A bit later, every cycle counted

Comparing computing power of, say, the IBM 360/65 that I did early programming on (indirectly, sending decks of cards over from Berkeley to UCSF) and the Intel Core 2 Duo notebook I’m writing this on is a chump’s game. Looking at some sources, I see “1.25 million calculations a second” for the ’65, which had one megabyte of RAM (rather a lot in those days). How does that compare with two CPUs, each with 1.6 billion processing cycles per second, and 4 gigabytes of RAM? You got me; I’m not sure there is a real answer to that question.

The thing is, doing library processing on a machine with that kind of power required a lot of optimization. The ideal language for the work I wanted to do was clearly PL/I, for its combination of logic and string processing–but the head of the systems office properly wouldn’t let me use PL/I because the early compilers just didn’t produce tight code. Instead, I used assembler (BAL)…
When PL/I (Optimizer) came along (and we’d moved up to a somewhat faster S/360), I could start using the high-level language–but not without paying attention. I remember a classic example: Cases where I needed to do translates to normalize characters for sorting purposes. The classy way to do that would be to include two strings of characters in the TRANSLATE statement, the source and the object. But, after trying that and seeing the results, I moved to using two 256-character strings (not variables), containing the source and object sets.
Why? Because it made a difference of at least 10:1 in the overall running time of the program–changing it from something we couldn’t use to something we could. And once you understood some assembler and learned to read PL/I’s pseudo-assembler output, you could see why:
If you were translating using variables, then the compiler would generate code that built two 256-character strings each time the translate was performed, then do the translate–a big, unwieldy loop of code.
If you were translating using fixed strings, then the compiler generated one assembler statement. One. I think the difference for the translation steps was at least two orders of magnitude, maybe even worse.
That’s just one example. There were many others. In the ’70s and early ’80s, I’d probably spend as much time optimizing code as writing it in the first place, maybe more–and after the first two programs, my first code was already fairly optimal.

Don’t take me back…

With more abstract tools and less need to worry about cycles, I could have (potentially, at least) accomplished a lot more. So could we all. I think it’s great that a modern PC (Mac, Unix or Vista) can devote perhaps 90% of its cycles to system overhead–and still have plenty left for actual computation.

Still, sometimes things really do run slower than you’d like–and there are still lots of programmers who understand code efficiency. (I’d bet Google has hundreds of them!) They may be counting cycles at a more abstract level, but they’re still coming a little closer to the machine side of the man:machine boundary to get the job done.

Comments are closed.