SPOTLIGHT
Hot and known service issues
Power outage incident in the James Watt North building 04/09/07
As you may already know IT Services experienced a major power loss on Tuesday 4th September between 08.35 and 10.05 am.
What happened?
In the James Watt North building there is a complex system of servers (more than 100), routers and switches which serve the University (and beyond!) with many central University services including web and email.
Aside from the mains power supply we have a back up generator and an Uninterruptible Power Supply (UPS). A UPS kicks in if there are any fluctuations in power to ensure our systems remain on and is essentially a large battery. The back up generator takes over in the event the mains power fails.
Best practice UPS design includes a bypass which allows removal of the UPS from service for maintenance or replacement in a safe way.
Our bypass has a circuit breaker built in and it is this circuit breaker which tripped, isolating the UPS from the mains and back up generator. (see diagram)
After around 20 minutes the UPS batteries were drained and power was lost to our systems. Main power was still available in the building so the back up generator did not start.
We quickly knew where the problem was but could not identify the exact cause. It took a University electrician to help diagnose the problem before a safe solution could be implemented.
We were able to return mains power to the systems at 10:05am.
Disaster Recovery
In such circumstances our Disaster Recovery plans come into effect. These plans are there to allow us to return to full service in the event of various system failures.
As part of our Disaster Recovery we had to decide whether to move to our Disaster Recovery suite in the Boyd Orr building or to bring our primary site back online. It was decided, as we had restored power, it would be fastest to bring our primary site back.(Switching over to the disaster recovery suite is not just flicking a switch - many services would need to be reconfigured and some would need to be rebuilt from scratch.
Why didn’t we just switch everything back on?
It doesn’t work like that.
Everything has to be switched on in the correct order as many services have dependence on other services. E.g. We couldn’t bring back email until the network was there first. Meanwhile thousands of users of all the services were trying to log on to systems that weren’t there!
By 11.30 most services had been returned although we continued to experience some instability. This instability is inevitable when carrying a disaster recovery process on the very complex University systems. Over the course of the day we experienced a number of related problems but these were fixed almost as quickly as they appeared.
What now?
We are investigating the events that led up to this incident to discover the cause of the circuit breaker failure. We still don’t know why it failed. When we tested the load it was well under the trip threshold so perhaps there was a spike in the mains power.
Most people at the University are aware of the ongoing electrical upgrade work taking place on the main Campus including the James Watt North Building. This work has caused fluctuations in the electrical supply but our systems have functioned without further electrical problem.
We are reviewing events to see if there is anything we could do better. If there are lessons to be learned we will learn them.
IT Services would like to thank everyone for their support and patience during this incident.