The Laugh Was On Someone
The Question Is: On Whom?
by Ray Borrill
In 1967 I was employed as the manager of computer services with the Computer Systems Group. I was responsible for the maintenance of about ten computer systems which were based on the SDS 910, 920, Sigma 2, Sigma 7 and a few other brands and models such as the DDD DDP116, DEC PDP-8 etc,
I was, and always will be a firm believer of the value and necessity of Preventive Maintenance, supplemented by, or performed in conjunction with, Scheduled Maintenance. Since all of our systems were real-time, on line data acquisition and control applications, we developed the policy that we would schedule maintenance only when the experimental facility to which the computer system was interfaced was down for its service. All preventive maintenance would be done “on-the-fly”, that is, with the system running. To facilitate this we installed “Hobbs Meters” (essentially, electric clocks) on everything that required P.M. This included the vacuum motors on the tape drives, the on-line peripherals and the like. We had Hobbs Meters on anything in the system that could be turned on/off while the system kept on working, These included the cooling blowers, and just about everything electromechanical. Then when the Hobbs Meters told us it was time, we would take the indicated unit off line and do what needed doing and then put it back on line. The cost of the equipment to do this was not trivial, but compared to the time lost to shut the system down and give as many as eight research groups nothing to do for as long as it took to do the PM task was far more expensive than the total cost of the equipment required to do it the way we did. What is more significant is that our systems compiled very impressive records for uptime, MTBF and MTTR.
The word of this got around to other institutions, and one day my supervisor, mentor and the Head of our Group, Bob Spinrad, brought Jim Mollenhour of Bell Labs, Murray Hill, in to see me. Jim, whom I already knew, explained that Bell Labs had an SDS 910 or 920 system (I can’t recall which and it doesn’t matter) interfaced to an Emperor Tandem Van De Graff generator at Rutgers University in New Jersey, and were not happy with the service being provided by the SDS service dept. which was “on call” while Bell Labs awaited delivery of the new SDS 925 which was to replace the old machine. Because the SDS service manager couldn’t schedule service calls, his people would only come at their convenience after Bell Labs called, JIM Mollenhour asked me if I would be willing to come over to Rutgers and service the computer whenever they called. He said they would pay for my expenses and my time, and laughingly promised an unspecified bonus if I found something wrong that should have been caught by the SDS people “long ago!” We got the OK of the deal from my higher ups and I just went on with the every day routine and awaited a call from Bell Labs.
It wasn’t a long wait, A few days later I got a call asking me if I could come over to Rutgers the next day. L checked my schedule and found nothing that demanded my immediate attention and my techs were all going to be at work so I told the Bell Labs people that I could come, I left sufficiently early the next day to drive the approximately 120 miles and beat the rush hour traffic. I arrived at Rutgers at about 7 AM, hoping I would be done in time to beat the afternoon rush hour traffic. This could be a long day but missing the rush hour traffic would minimize the major traffic delays. The minimum time to safely dive the 240 miles round trip via the most direct route was about five hours, but hitting either or both rush hours would extend it to six and a half to eight hours. Add the time spent working on the computer and I would have a long day if I wasn’t lucky.
Things went well on this and most other trips and I found that Bell Labs paid me “portal to portal”, plus a per diem meal allowance and any other expenses, and BNL never docked me for the day off, so I was well compensated even if I occasionally got tied up in traffic for a few hours.
Generally, the work was routine. The computer suffered from the “if it aint broken, don’t fix it” philosophy that most C.E.’s seem to subscribe to, so most of my work was correcting things that had not been kept in correct adjustment..
Let me digress a little and give you a simple example of the problem, and why the “if it aint broke. Don’t fix it” idea doesn’t work. In routine data processing programs, if a record is written to magnetic tape and the write tape routine reports an error occurred, the tape is rewound and the record is written again. If an error is reported again, the process is repeated, often up to ten times, and if the error is still reported, the program erases that length of tape and goes to the next unused area and tries again, and this process is repeated up to ten times before any action is taken. Let us assume that the researcher can’t take that much time. The incoming data is placed in a buffer area in memory, and when the buffer gets full, the data must be written to the magnetic tape system, while new data is stored in a second buffer, in ping-pong fashion. So he writes his mag-tape routines so that the processes described above is limited to thee tries. So, if the system reports an unrecoverable tape error, he calls the service people who come in and run their diagnostics which use the most common procedure and with the times ten tries, no data is reported unrecoverable, and they report that the machine didn’t fail. (Privately, they almost always conclude that the customer must have a software bug and wish he would learn to use the machine within its specs,) And, nothing was checked, calibrated, adjusted, replaced that hadn’t (in their eyes) broken since the last time they did not check, calibrate, adjust or replace it.)
Now, back to where we were. What I would do was assume that the System Programmer knew what he was doing, So I would check the tape drive thoroughly to make sure that everything was within the manufacturer’s specifications. And I would keep this up until I found exactly where, in the whole process the error occurred and fix it.
And that is generally the way it went, The researcher would report any problems they might have (or suspect they were having) and write it up in the console log book for me to read when I got there and barring something catastrophic, go on until the Van de Graff generator was shut down for one reason or another at which time I would be called . I’d go over the next morning and twiddle, tweak, adjust, clean etc, whatever I thought was causing the problem. Let me make it clear that these problems were often not the type that would cause the computer to be shut down. They were just things that made the users suspect that the computer was in need of some service, For example, a user might notice that the tape writing process was repeatedly causing the rewrite of many records, suggesting that there was a problem with the tape transport.
Occasionally there were no reported problems when the accelerator was shut down, They might then call me and say that if I had nothing pressing in my schedule, would I come over and just check out the computer as thoroughly as I could within the time I could spend there. This gave me the opportunity to run tests that I never had time to do normally.
One day I decided to check the Instruction executition one by one. Under conditions as nearly as I could to those encountered during normal On Line operation, including the possibility of any Instruction being interrupted. Everything looked perfectly normal until I got to the “SKIP IF NEGATIVE “ instruction ----- AND THE ROOF FELL IN! I could not believe my eyes. There had never been a complaint filed on that instruction. No one thought the data looked “funny” as it was accumulated. No one reported anything suspicious in the results produced by the data analysis programs, --- just nothing. But the instruction was failing! It never skipped, regardless of the whether the addressed operand word was positive or negative.
I could see it fail as plain as day on my oscilloscope screen. When the OP-CODE for the “SKN” instruction was decoded, a flip-flop was set, A couple of microseconds later, the sign-bit of the Operand cane chugging by and if it was high indicating a negative state, the flip-flop was reset and the result was that the next instruction was skipped, But it wasn’t working, because a leaky capacitor in the reset side of the flip-flop caused the pulse generated to have a slow rise time, thus preventing the flip flop from resetting. Without the flip flop reset the skip would not occur, regardless of the sign of the operand..
I wondered how long that trouble had been in the machine. I was very sure that it was a ling time because the component that was causing it was not digital and simply do not suddenly starts leaking. I decided that the researchers should know about it so they could decide what to do about the possibility of errors appearing in their work. I called Jim Mollenhaour and asked him to come right over and told him I had found my “bonus” trouble. When he arrived I told him what I had discovered and, using the scope, showed him the failure in action. He was aghast, and while he watched it on the scope I laughingly suggested that it would be really funny if someone had achieved results with this system which led to a Nobel Prize in physics. Would anyone find it easy to go to the committee and announce that the data was all suspect due to a failure of an instruction that was used a great deal in the data acquisition and pre-processing phase of the experiment? He asked me if I could fix it, or, if not, could I send a spare p.c. board over from BNL until they could get a new one from SDS. I told him that if he could wait around another 15 minutes or so I could probably fix it now. I removed the board and replaced the bad capacitor and tried it again and everything ran like a charm.
Soon after that they took delivery on their new computer and my services were no longer needed. But a week or so after the event I received a letter from another Bell Labs friend, who used to work at BNL, with a note from him and Jim Mollenhour that simply said: “A thorough study of all research results from the system show that no Nobel Prizes are at risk, but you surely earned the bonus promised. So, enjoy the enclosed and we demand that you share it with your wife for her patience with all the hours spent here instead of home with her:.” And the bonus was a check for a tidy sum, so I took my wife for a weekend in New York.
I have used this story many times when I try to impress service techs with the importance of Preventive and Scheduled Maintenance over the “If it ain’t broke, don’t fix it” school of service. To me, another way of saying that is: “Never examine, adjust, calibrate, clean, or test ANYTHING that hasn’t failed since the last time you didn’t examine adjust, calibrate, clean, or test it.” The fact that nobody's seen a failure, is no excuse to skip maintenance; I'd much rather test it BEFORE it fails!
These essays appear on the web at:
Software copyright © 2002 by Tripodics
Website -- courtesy of Weþyx