Current Affairs: Sweat The Small Stuff

Reliability is an important design criterion. Service capacity adequate to supply the load is not sufficient for many facilities. Hospitals, data centers, and other critical facilities require high availability of electrical power and use a combination of multiple sources and redundant pathways to maintain service in the event of utility failure or an internal fault.

The boom in the construction of mission-critical facilities during the recent dotcom era popularized computer software packages that analyze power system reliability, given configuration and information about the failure characteristics of the components. Because switchgear, transformers, and generators are expensive, increasing redundancy to obtain higher levels of reliability carries a price tag, and such software is often used to determine the economically optimal configuration.

We've all heard stories about multimillion-dollar aircraft grounded by the failure of a $0.50 electronic component. Similar opportunities for small details to compromise otherwise highly reliable systems are also often overlooked in power distribution systems.

Limitations Of Software

Software may use a number of analysis techniques, with the objective of applying the mathematical laws of probability and statistics to the failure process. Given system configuration, expected failure frequency, and associated outage or repair time for each component, these packages calculate expected outage frequency and duration at the loads. They are used to compare alternative system configurations or determine the effect on the load point reliability of improving the performance of a particular component.

With this power, however, come the drawbacks associated with all uses of software in engineering, the greatest of which is a tendency to mistake precision for accuracy. When presented with an output report of reliability values that run out to eight or nine significant figures, there is a strong tendency to believe it must be so; after all, what could be better for the ego of an engineer raised in the era of Star Trek's Mr. Spock than to be able to pronounce that availability of power at a particular UPS panel in a system with hundreds of components is 99.993457%? Unfortunately, such pronouncements overlook the following important factors.

The GIGO (garbage in, garbage out) principle applies. Certainty of results can be no better than certainty of input data. In the best of databases, component failure rates are estimated with reasonable confidence no closer than a factor of two, and repair times vary greatly between facilities. Reference component data also may have been developed from older generations of equipment or under different operating conditions.

Software can only model what you tell it to model, which is generally the one-line diagram of the system. Other factors affecting in-service reliability, ranging from environment to conditions of maintenance to human factors, must be accounted for elsewhere.

The Details

For high in-service reliability, designers must look beneath the surface to find the details where the devil lurks. Following are "real world" areas that compromise reliability as surely as a single utility supply with no backup.

Human factors. It is often stated that more than 50% of all loss-of-load incidents involve human activity, which may be scheduled maintenance or operator response to alarms or off-normal conditions. In a common scenario, a component fails, then the system responds automatically, maintaining service to the load; however, an operator errs in the switching necessary to isolate the failed component for repair, resulting in an outage. Multiplying redundant pathways and protecting against every contingency produce high-calculated reliability, but if operators cannot intuitively understand the resulting system operation and control, human error is likely to produce far lower attained values. The KISS (keep it simple stupid) principle should be kept in mind.

Environment. These factors present opportunities for common mode failure (CMF), in which redundant pathways or components fail by the same event. Water and thermal threats are common; if the roof leaks or a sprinkler discharges, it's likely to soak both supply transformers if they are in the same room. A single cooling unit serving a space housing redundant UPS modules may be a weaker link than the electrical system. Segregating equipment can also reduce operator error; it's harder to turn the wrong switch if it is in the other room.

Controls. Frequently overlooked, control systems often impact redundant portions of electrical systems with opportunities for CMF. A recent personal experience involved a system with redundant standby generators arranged to backup a utility source and also to parallel with the utility for peak shaving. Failure of a single control relay used to switch the engine governors to droop when in parallel with the utility prevented both machines from operating when required.ES