The failure of DBS Bank’s electronic banking services reminds me of an incident that is rarely known by the public but told to me by a friend who is a UNIX administrator. It was a classic example of outsourcing went bad. The following is the account of the incident, but the names of the company and the internationally acclaimed vendor are changed (for obvious reasons).
This is the story:
It was during one fine night shift, when a data centre operator delete *.* accidentally. According to her own confession, she thought she was deleting files in a folder. Unfortunately for her, she was actually in the root directory when she did so. (For the uninitiated, the root directory is sort of like the main trunk of the tree. In short, while she thought she was cutting off a useless branch, she had actually chopped the whole tree down.)
The result of her action was disastrous. She effectively wiped out the OS and the mount-points on the SANs etc. Almost 4.3 TB (Terabytes – i.e. 4 x 1012 bytes) of files were deleted. Unlike the command prompt in Windows, where one can terminate a bad command with CTRL-C or CTRL-BREAK or close it down with Task Manager, a command executed in UNIX just keep going until the job is done.
So, even if she had realised her mistake and tried to stop it, she can’t. 23 Servers all across MNC X which were connected to the same SAN volumes (all affected by her erroneous command) went down immediately. Slowly the other servers were affected. The final ‘body count’: 168 servers of MNC X in that data centre were affected. The end result, nobody from any country, in any outlets / offices, could connect back to MNC X. It is simply an IT Black Day.
Best of it all, this happened somewhere around midnight and that gave the operator time to cover up what she did. She quickly modified whatever log files she has access to and deleted all her entries. So when the monitoring system (which is miraculously still functional) sent alerts to the ‘owners’ of each affected system, the tech guys who were awaken discovered to their own dismay they couldn’t login remotely (since theirs servers are all down). They were left with no choice but to drag their tired bodies back on site.
It took all of them almost 3 hours to be back on site, and they found no trace of what happened and could only scratch their heads since the log files were manipulated. Even more puzzling was that the redundancy failed – the backup systems had not kick in.
To cut the long story short, MNC X was left with no choice but to restore from backups so they can bring all their systems back online. It was over 15 hours later before they finally restored some semblance of order to the entire IT infrastructure. Meantime, Vendor Y launched an investigation and ultimately a guru in UNIX administration discovered the log files manipulations and even found out exactly who did it.
The saddest part of it all was that MNC X never discover the truth, even though MNC X probably also lost millions that day since it also has offices in other parts of the world which is still running. According to Vendor Y’s findings, the reason why the redundant systems didn’t kick in was a result of the backup systems being too old and they no longer matched the same configuration as the primary systems. (Doesn’t that sound almost as ridiculous as the reason DBS gave in their official statement – an upgrade – as to the cause of the break down of ALL their electronic banking services?)
That was of obviously the many droppings of a bull’s behind. After all, any IT technical person worth his salt would have asked what the heck vendor Y has been doing if it has not implemented hardware and / or software upgrades to the redundant systems to keep them up to date! They would have taken the vendor to court and sue them for a substantial amount of damages.
The story gets even better from here. The culprit was totally untouched because she has been Vendor Y’s perm staff for 20 over years. Instead, a contract staff was made the scapegoat and fired to appease MNC X.
What is most ultimate to this sad story of outsourcing went bad was that the contract staff who got fired was one of the three UNIX gurus who discovered who altered the damned logs. I personally suspect that the story doesn’t end here. Vendor Y probably convinced MNC X and sold it yet another several million dollar worth of hardware to ‘make sure this will never happen again’. (Note: The part about the vendor profiting from this fiasco is just my own speculation, not that it actually happened.)
When I looked at the magnitude of the staggering damage caused by Vendor Y in this major fxxk-up, the lack of dedication of the staff hired by the vendor which my current employer outsourced some of its IT services to paled by comparison. After all, the minor delays caused by these morons who simply didn’t put themselves in the shoes of our business users is nothing compared to what MNC X suffered in that one single morning.