October 29 was a Monday. It was also the day that Hurricane Sandy made landfall in New York City. Around 7 p.m., Con Edison shut power off in Lower Manhattan, and it would be sometime Saturday before lights would begin to flicker back on.
Three days into Mother Nature’s metaphoric wedgie, the party was just getting started for data center operator Zayo (zColo). So far they had managed to stay online without major incident, but now the generators were acting up, and it was determined that load needed to be shed or they would lose the entire operation. The decision was made to shut down portions of the cooling system … and soon after, thermal runaway was underway in the data halls.
According to reports, “…temperature inside zColo’s data center reached above 100°F. ZColo staff had to scramble — putting fans on the data center floor, opening windows, shutting down some equipment — to keep the temperature from rising.” Around four hours into the ordeal, the electrical issues were being addressed and ambient conditions had been stabilized between 95° and 100°.
To zColo’s staff’s credit, by noon the next day, Patrick Thibodeau reporting for Computerworld was able to state, “The problem was a result of a failed isolation valve in the fuel system supplying a generator. It was fixed without any customer outages, and temperatures [had] returned to 70°.”
ZColo’s story is just one of many from that week. There were multiple utility rooms flooded; equipment damaged by wind, flood, and explosion; and enough design failures to develop a full curriculum for a hard knocks graduate degree in Mission Critical design.
As an HVAC guy, I have wondered what it must have been like for zColo’s mechanical team when they were informed that in order to keep IT online they would have to shed mechanical load. Did they have any idea how the space temperature would react and how fast? Could anyone quantify the near-term and long-term risks to the IT equipment? And last, were their resumés up to date?
The truth is that these questions (sans the resumé query) should be asked and answered during design. The answers will help to determine the ultimate capacity of systems, what mechanical equipment may need to be on UPS, and whether thermal inertia in the form of increased water volume is needed.
Plus, no one wants to be doing the math five minutes before they kill the power to the chiller.
IT REALLY DOES DEPEND
So you ask, what’s the answer? The seemingly lame response is … it depends. But it’s OK to say “it depends” if you know what it depends on, and more importantly, if you know the impact of each variable on the final answer. For our discussion, the answers depend on the following:
- Normal ambient conditions at the server
- Maximum allowable operating temperature at the server
- Data center configuration
- Data center power density
- The amount of time it will take for cooling to be restored
The first three variables aren’t variables. By now, they should be given, based on the industry standards delineated by ASHRAE TC 9.9. In particular, the normal and excursion operating conditions should be the recommended and the allowable specifications indicated in the 2011 Thermal Guidelines for Data Processing Environment.
Therefore, the normal operating condition should be between 64.4° and 80.6°. Per the Green Grid and research conducted by others in the industry, the increase in server fan power consumption begins to adversely affect energy performance above roughly 77°. So that is where we chose to model the steady state. And finally, the maximum allowable operating temperature should be 89.6° (we will round to 90°).
DATA CENTER CONFIGURATION AND POWER DENSITY
Our model was based on a typical legacy design that I wouldn’t recommend for most new construction, but it is, in fact, what you will find in the majority of data centers operating in an enterprise environment. With that said, the CFD model and data presented herein were based on the following:
- 12-ft ceilings and a 3-ft raised floor
- 30 sq ft per rack
- Hot and cold aisles with no separation
- N+25% CRACs in a perimeter configuration
- 20˚F air ∆T across the servers
- Return air plenum above the ceiling
- N+1 chilled water plant with 10° CHW ∆T
- Estimated CHW volume in system at 10 gal/ton.
- Engine generators for power back-up
- And, no mechanical equipment on UPS
Regarding power densities, our analysis considered power density at 50 and 150 W/sf, (nominally 1.5 kW to 4.5 kW per rack, respectively) based on the pitch of our layout.
TIME TO RESTART
We modeled a failure event without any means of continuous cooling (i.e., no thermal storage tanks). We accounted for the time associated with a chiller plant restart, including the time associated with power transfer to the generators and restart times for centrifugal chillers.
Because this is a magazine aimed at HVAC, I won’t delve into all of the electrical data behind the restart metrics. But I will say they were based on historical commissioning data, manufacturer’s specifications, and experience in the field. Also, because this was a failure study, when presented with a choice we always took the most conservative route.
Subsequently, our predictive model was based on the following restart assumptions:
- 0:00 Utility power is lost
- 0:10 The first engine generator is on the bus
- 0:20 Secondary pumps and CRAC fans are re-energized
- 0:30 Secondary chilled water flow and CRAC airflow are re-established
- 0:45 All of the chillers and primary pumps are re-energized
- 1:30 CHW at design return temperature reaches the CRACs
- 3:45 The restart cycles are completed for the previously running chillers
- 4:30 The standby chiller is operating at 100% capacity
- 7:30 All chillers are operating at 100%
- 8:30 CHW at design temperature reaches the CRACs – capacity restored
INITIAL POWER LOSS: A BUSY TWO MINUTES
On a loss of power, a number of things immediately happen. The chillers, pumps, and CRACs that are running drop out. Flow continues for a few seconds on the air and water side as the motors wind down, but very quickly you are at a dead stop and the temperature in the data center will begin to rise.
Now, how quick it rises is a function of the load density in the space, and to a lesser extent the thermal inertia present in the data center structure itself (cool walls, racks, etc.). Looking at Figure 1, you can see the marked difference in rate of rise, and ultimately maximum temperature in a medium density data center versus one with a relatively low power density.
[At this point, it has to be noted that the charts presented are based on collected data but have been smoothed for presentation purposes. However, for the purposes of this article (introduction to concepts vs. a research submission), the simplification shouldn’t become a distraction. Now back to our program.]
In a matter of 30 seconds with no airflow in a higher-density data center, the inlet temperature will quickly spike to +100°. On the other hand, the low-density data center sees a rise in temperature but stays well within the allowable range. Figures 2 and 3 clearly show the relatively small impact on the data center at 50 W/sf. Compare this to Figures 5 and 6, where at 150 W/sf, red becomes the dominant color all too soon.
But an interesting thing happens when the CRACs and pumps kick in at 30 seconds. Full cooling capacity at the data center is fully restored, if only for a little while. You see, there is cooling capacity resident in the chilled water supply piping. All of that cool water between the chiller plant and the CRACs is still there and hasn’t degraded. So depending on your system size (physical capacity, not tonnage) and your CHW ∆T (lower ∆T = shorter lag) you will have in the neighborhood of one to five minutes of interim cushion.
But don’t let this respite fool you. If you have no chilled water storage in the design (e.g., a tank or some exceptionally oversized element), there is a slug of warm water heading to the CRACs that will be the primary contributor to diminished conditions in the data center until the ride has come to a full and complete stop.
RESTART AND RECOVERY
So at about the two-minute mark, the good news is that power has been restored to all of the mechanical equipment. The bad news is that the data hall won’t see cool water for some time to come. Warm water has replaced cool water in the system, and it will still be a few minutes before a single chiller is operating at capacity. So in this phase, the sensible capacity of the CRACs takes a hard hit and your IT load is still outrunning your capacity.
Looking at Figure 4 (a simplified diagram based on our analysis and industry studies), the story now becomes a matter of minutes instead of seconds. The temperatures continue to increase but not at such a steep rate of rise, due to the fact that you have some cooling, albeit degraded. But that sure beats dead air and still water.
However, the slow rise is little solace because ultimately we will see temperatures similar to the initial spike. And this is due to the start sequences and restart requirements of the centrifugal chillers. Basically, it takes almost five minutes and eight minutes, respectively, for the standby chiller and then the previously running chillers to reach capacity.
And here is where that thermal inertia of the piping system works against you. The same pipe that held the cool water that helped you for a few minutes is now holding all that warm water. So now you have to work through that slug, and that can take up to 10 min, based on your system size and ∆T.
RISK ASSESSMENT AND MITIGATION
But there is some good news here. Even with the negative inertia and the glacial restart times, the low-density data center never really gets that far out of whack. Sure, it spikes, but well within our accepted tolerance. It would appear that a low-density data center at about 50 to 75 W/sq ft can ride through a power loss without needing continuous cooling strategies.
If it were only that simple in the higher density application. The unfortunate truth is that that the temperatures get pretty high pretty fast — easily exceeding our preferred maximum of 90° for 10 min or more.
And while an excursion this short may be an acceptable risk in some operating models, there is no getting around the fact that damage may occur. Representing the conservative end of the spectrum, after the Sandy, Vince Renaud of the Uptime Institute stated in part that higher temperatures, “…will produce catastrophic results for computer equipment or, if no immediate affects, they will result in ‘wounded servers’ that will fail later on when you least expect it.”
Of course, this also gets into the whole conversation regarding ASHRAE TC 9.9’s “X-Factor” and a time-weighted assessment of risk and failure. The standard speaks in terms of hours, not minutes, vis-à-vis manageable risk.
So it’s safe to say the designer has some latitude when determining whether additional steps should be taken to provide continuous cooling. But one major takeaway should be that once you get to 150 W/sq ft and above, you have to quantify the risk. And in turn, you must design accordingly, in order to mitigate this risk based on your data center’s particular operation.
So what are some design options to consider when it comes to decreasing risk? Well, simply placing the CRAC fans on UPS will lessen initial spike we discussed in Figure 1. And putting the pumps on UPS will eliminate the spike altogether.
Solving the 30-min challenge (or longer at higher densities) conveyed in Figure 2 will require chilled water storage and a control sequence that automatically switches from production to storage utilization and avoids CHW temperature degradation. The capacity of that storage will be dependent on design parameters and safety factors driven by practical assessment of the risk.
And you can’t forget there must be an intelligent sequence allowing for changeover back to production and recharging of the tank (again, without degrading capacity in the process).
Last, consider the sizing of the secondary piping and pumps. Because of redundancy, you have “extra” capacity resident in the system to assist in recovery. In fact, if you are operating at full design load, you will need that additional capacity to outrun the thermal runaway. Oversizing the secondary piping and pump systems by an appropriate percentage is relatively cheap insurance.
There’s no getting around the fact that risk is resident in mission critical design. And while you can’t solve every problem ahead of time, you can certainly avoid stupid decisions. Ask the guys who had to carry fuel oil up 17 floors in buckets during Sandy, in order to serve a generator located on the roof because the pumps had been located in a basement that was now flooded.
We have discussed cooling failure and recovery in relatively simple terms, but shutdowns are never simple. Sometimes you clear the initial hurdle only to be challenged by a second and then a third, like zColo was.
But as designers, we have an opportunity to use these lessons learned and our own common sense … starting with the recognition that at higher densities, our world gets a little more complicated. ES