Monday, November 24, 2003

The Hell of Blackouts



An interim report on the August 14th US-Canada blackout was recently released. The document is over 130 pages, and talks about several causes of the blackout, but the most interesting thing is that it seems that when it started no one knew what was happening due to computer malfunctions.

The report starts with an Executive-type review of the way the systems interact due to the difficulty of storing and transmitting electricity. One of the inaccuracies in the report state that electricity travels at the speed of light. I had myself been taught 250mph by one of my Ohio State University Physics Professors, but it appears that is wrong as you can read about here and here. It is interesting, but dry. I don't blame the report writers for not being so accurate about a scientific fact, due to their final target audience, but it makes one wonder what else they 'glossed over' in their 'interim' report.

The main computer system that monitors the electrical grid for FirstEnergy (FE) in Ohio (just a few hours north of where Jack lives, and the start of the blackout) is the GE Harris XA/21 EMS system. According to the documentation, it is a UNIX based system that uses the TCP/IP network protocols (the same ones you use everyday on the internet), ODBC (Open DataBase Connectivity) standards to a SQL (Standard Query Language) POSIX-compliant Database backend. The system is programmed in ANSI C and FORTRAN.

What this essentially means, and as in indicated in the brochure, is that it uses "Open Systems". Which is industry standard protocols and programming interfaces that allow any other types of systems to connect to it.

It's kind of how the Internet works.

Pretty much everything on the Internet uses "Open Standards", or you'd be downloading a new program everytime you visit a new website.

Now, for all of you Conspiracy Theorists, time to get out your foil hats. (I've been harping on the foil hats a lot lately).

James over a Hell In A Handbasket tends to "pooh-pooh" the possible threats of a cyberattack, but I think this is a case that proves it could do a lot of damage if launched against the right targets.

The shit really started to hit the fan at 12:15 PM ESDT, about 3 hours before the blackout.

Oh, did I mention that the FE's GE XA/21 systems' software hadn't been updated since 1998? Guess how many Unix-type operating system vulnerabilities have been released in that 5-year period? Lots. Who knows what other modules the system was running? But I digress.

Anyway, just after Noon, one of the monitoring systems quit working due to "inaccurate data" (buffer overflow anyone?). However, no one at the main control center knew it. This caused another large generation unit in Eastlake to shutdown around 1:30 PM, and around 2:15 PM the alarm and logging computer system (that darned XA/21) was completely dead and useless. At 3:05 the whole blackout started and quickly put millions of people into darkness.

We're lucky that more people didn't end up hurt during that outage.

Losing the Eastlake plant itself didn't cause the blackout, but because the computer system was FUBAR'd, no one knew what was going on. The report says that the fact that operators were unaware of what was going on due to the computer failure, and the lines falling into trees were the two main reasons for the blackout.

OK - It wasn't that no one knew what was happening, in fact one of the employees of FE called around to get some things reconfigured to support the high-load that was happening that day, but because of the monitoring system failure, he wasn't working with enough information. In fact, someone figured out that a monitoring device had failed, and turned the system off to correct the error, but then went to lunch, forgetting to turn the monitoring system back on. Even though the monitors run every 5 minutes, no one noticed it wasn't working right until an hour-and-a-half later.

So someone turned it back on.

But by now the data that was coming across was bad, and while a systems engineer identified the possible problem with the grid at about 2 PM and finally called the main operator an hour later, the main operator mistakenly saw that everything was running fine. It took another 20 minutes to get that straightened out, and then another 20 minutes to get the system reporting everything correctly.

That was 2 minutes before it all went to hell.

You see, about 2 hours before that, the alarm and logging system had went down.

At about 2:14, the system wasn't reporting anything of any use. In the next 30 minutes, FE lost the primary and backup server completely. Both systems died? The report doesn't say conclusively how they failed (though some theories are discussed later).

But guess what? No one monitoring the system noticed the servers had crashed for an hour.

Guess Homer had too many donuts that day.

AEP had even called FE to report problems, but of course since the system was down, FE reported no alarms to logged problems. DOH! The backup server had failed 13 minutes after the primary server, but still no one noticed.

Well, no one WORKING noticed.

The system did automatically page the IT staff.

Everyone who works at the building with IT staff knows that things can go wrong, but the IT staff doesn't tell anyone, other than "we've got a system down and we're working on it".

Don't want to look bad, ya know?

The report supposes that data "overflowed the process' input buffers" (see buffer overflow above) in the system, which caused the alarm system failure. This means that neither the server or the remote terminals spewed out any data about the grid problems. Oops.

Since the data overflow wasn't stopped, when the system transferred over to the backups, the backup servers failed as well under the data load.

This overflow, as it was happening, caused the refresh rate on the operator's screen to refresh only once every minute, as compared to every 1 to 3 seconds as normal. These screens are also "nested" underneath the top level screens that the operators view, thus slowing things down to a crawl.

By now the IT guys arrived, and "warm booted" (reboot without power off) the systems. The IT guys checked the servers and saw that all was good, but never verified with the control room operators that the alarm system was functioning again.

"Just reboot it, and we can go home guys, no one will notice that anything major was wrong".

What's interesting is that the operators hadn't noticed the real problem. They hadn't called about the alarm system problem until about an hour after the IT staff started working on things (and had 'fixed' it 30 minutes before).

The alarm system displays had "flat-lined" (didn't go to zero, but just stayed at where they had been at the point of failure, which would be unusual due to normal voltage changes in the grid) and no one seemed to notice or care.

Once they did figure out what was wrong, it was too late. The cascade had started, and the operators didn't want the IT staff to "cold-boot" (power off and restart) all the systems, because they were afraid that they wouldn't have any data after that, even though what they had was pretty useless.

The rest is history.

I don't know if these systems are connected in any way to the Internet, but I'd be surprised if they weren't. 100% isolation of a private network is difficult to maintain these days. Someone somewhere always hooks something up to help them get easier access to resources they need. If someone mounted a concerted effort against utility and power systems through these connections, it would be easy to see how it could get many people hurt or killed.

It's all the computers fault.

Really.

No comments:

Post a Comment