It's the 1990s, and an IT pilot fish at a catalog company hears about this hot new line of servers from the company's preferred vendor.
"Well, of course, we need these," fish says. "So we place two into a cluster with several gigabytes of disk space for high-transaction database queries.
"Shortly after rollout in production, these servers start crashing -- abruptly and without warning of any kind, never the same problem, appears to be really random."
The vendor's engineers come out and replace one of the parts that fails most frequently. Everything tests out fine. The engineers leave, the servers go back into production, and soon the system crashes again.
The vendor's engineers return. Another suspect part is replaced, things test fine, engineers leave, system crashes again. And again. And again.
"This goes on for two months, and things are being escalated to the highest levels within the vendor and our management," says fish. "So the vendor flies in one of its top designers to perform electronic analysis on these servers."
Fish and his cohorts figure the expert will arrive with all kinds of specialized diagnostic equipment, and they prepare to stay the night as he tears the equipment apart.
But when the expert arrives, there's no special equipment, and he doesn't start tearing into the servers. He just settles in to wait for the failure everyone knows is coming.
"Sure enough, about midday one of the clustered servers crashes," fish says. "We proceed to remove the covers and examine the server. After an initial examination, the specialist pulls out a network cable scissors and snips a quarter-inch off a copper ribbon that's grounding the motherboard to the chassis.
"This solves the problem."
Seems the grounding ribbon on these new servers is held in place with the wrong kind of glue. As the server runs, the glue heats up and the copper ribbon curls, shorting out the motherboard at random locations. Which is why sometimes the problem appears to be the CPU, sometimes memory, sometimes the network connection.
The expert packs up his scissors and heads for home.
"After this, they change their manufacturing process for future machines," fish says. "And we never see another problem like this from any of these servers ever again."
No comments:
Post a Comment