Thoughtful Simplification In The Data Center
Now that you’re back, I would like to make my second mistake and share an anecdote that may lead you to question my credibility. About five years ago, I was sitting in front of a young engineer named Rob, explaining the basics of chilled water design. I explained that old school designers used a chilled water delta-T of 10°F or 12°, but more enlightened designers (like me) used a difference of 14° or maybe even 16°. Rob simply asked, “Why not 15?” My initial answer had something to do with odd vs. even numbers or some such nonsense.
ENLIGHTENED? NOT SO MUCH ...
On this particular topic, the difference between Taylor and me is that Steve made a decision to step away from rules of thumb and so-called standard design practice and delve back into the essentials of chilled water design, using tools that weren’t available to designers in the past. I, on the other hand, had simply tweaked an existing convention and hadn’t put in much thought in the process. Instead of cutting a trail to enlightenment, I chose the path of least resistance, and the path of least resistance will inevitably lead to designs of remarkable mediocrity.
Today, I work mostly in data centers, and things have changed so much in the last 10 years that I barely recognize those old designs. So, if the very definition of a data center is in flux, our designs and how they manage and respond to the environment should evolve as well. But do we just take the old concepts and tweak them, or should we re-examine the question with a focus on the fundamentals?
If I plan to recover from one of the most damaging article introductions I’ve ever crafted, I better insist we get back to basics.
DATA CENTER DESIGN ABSOLUTES
But before we begin, I want to list what I consider the givens of modern data center operation and design. I have concluded that at every opportunity (and an article is an opportunity), I must be an evangelist for common sense data center design, and the data center design absolutes are:
• Recognize the recommended
• Advance the allowed
• Separate the streams
The “recommended” is the ASHRAE Recommended Thermal Guidelines.2 The recommended ranges should not be a matter of debate any longer. The “allowed” are the limits set by the ASHRAE Allowed Thermal Guidelines. We should strive to approach these boundaries sooner instead of later.
And the “streams” we must separate are those in and out of the IT rack. And the only way to absolutely guarantee that airstreams don’t mix is to provide physical demarcation using curtains, walls, chimneys, or other … pick your poison and employ it.
A BRIEF HISTORY OF DATA CENTER DESIGN
Data centers have their roots in the research universities, national labs, and the vast military industrial complex of the 1950s and ’60s. It wasn’t until the ’90s that information technology (IT) materialized as a resource that commercial entities needed to manage and control.
The hot aisle/cold aisle (HACA) concept was developed at IBM in 1992 but didn’t gain major traction for years. And it wasn’t until 2004 that ASHRAE for the first time budged from the data center design gold standard of 72°/50% rh. In the meantime, traditional designs included constant volume computer room air conditioning (CRAC) units, often with integrated humidifiers.
And for an industry that coined the term “precision cooling,” the sequence of operation was remarkably unsophisticated: The chilled-water coil control valve, humidifier, and reheat coil modulated in sequence to maintain the entering drybulb and relative humidity at the CRAC return. Sometimes the controls for multiple CRACs were linked, but most often, they operated independently, in turn fighting one another as one cooled and dehumidified while another humidified.
When you look at the timeline, is it really any wonder that we step into so many data centers today that are still operating like this? Heck, by today’s standards (see the data center absolutes above), a so-called modern data center built circa 2002 is a 10-yr-old energy-sucking antique. Throw in the trend of high rack-power densities and you can bet there are a number of field engineered “solutions” that make an already questionable situation that much less tenable.
Most data centers today look similar to Figure 1. And the graphic is telling because even though HACA is indicated, the CRACs are distributed in the space. So as those pesky red arrows indicate, you will likely see mixing. And because a data center cooling system is only as efficient as its least efficient condition, what gains you might be seeing due to HACA are undermined by this short circuit.
MOVE THE SENSORS
For the sake of brevity and to avoid a two-part series, we will assume that dewpoint control is handled by centralized ventilation unit as suggested by ASHRAE,3 and we will focus solely on the reconfiguration and control of the CRACs and drybulb temperature.
The first thing to understand is that the ambient conditions at the CRAC return are probably the thing you can care the very least about. They don’t represent anything of value, and what problems you can identify from an atypical reading at that location, you can determine at a more appropriate sensor location quicker and more accurately.
Fortunately, ASHRAE has given you the exact new locations for those sensors. Specifically, as shown in Figure 2, we should provide at least one sensor every 10 to 30 ft or every fourth rack position in the cold aisle.4 ASHRAE calls for the sensor to be located in the center of the aisles, but that just isn’t workable. Instead, you should locate the sensors on or inside the doors of the cabinet, aligned with the air intakes of the servers at the top, middle, and bottom of the rack.
This seems easy enough, but it’s surprising what kind of pushback you can get when you suggest this move. For starters, it can amount to a lot of wire and sensors. Where you once had one sensor integrated into a CRAC, now you may be looking at a dozen located throughout the space. It’s a hurdle that must be overcome.
Second, if you are dealing with the typical data center staff, there may not be a separation between the hot aisles and the cold aisles, but there will be one heck of a wall between IT and facilities. Getting the IT folks to allow the wrench and screwdriver crowd to touch their cabinets may require some finagling. Perhaps an Xbox 360 at the bottom of a box of Dunkin’ Donuts will curry favor with the CTO.
One option that may make folks’ CFO and CTO more amiable is wireless sensors (Figure 3). The Lawrence Berkeley National Lab has a great white paper that is worth a look.5
GIVEN #1: IT LOAD VARIES OVER TIME
So sensor relocations are fairly obvious. But should we just move the sensors and control the CRACs the same way we always have? Of course not. This is an engineering challenge after all, not just a geometry problem.
Although we design HVAC in data centers for a peak connected load, it is never a constant load. And quite often it takes years for that projected load to arrive … if ever. While the load can be relatively constant it is more likely to vary considerably over time. Consider that even though we live in a 24/7 world, the weekend still introduces a welcome rhythm to our existence.
In general, business-related processing like banking and corporate activities will wind down and back up over Saturday and Sunday, while entertainment-related activities like Netflix, Xbox, and online shopping trend up over the same two-day period.
And even during the day, there are rhythms to our digital universe. Social media peaks in the morning, at lunch, and during the evening (those darn kids and their Facebook). Search engines surge during the working hours. And as the various markets open and close around the globe, financial institutions and their respective operations respond in unison.
GIVEN #2: IT AIRFLOW VARIES OVER TIME
Something that may surprise a lot of folks is that even if you’re designing a constant volume data center conditioning system, VAV is actually in play. Because equipment manufacturers are under the same pressure we are all under to save energy, they have incorporated measures at the server level that most of us haven’t considered. Just like HVAC at our macro-scale, VAV is an intelligent strategy to employ directly at the server, so most servers now employ variable-speed fans.
In an open data-center configuration, server fans are in parallel with CRAC fans. If the amount of air you deliver into the cold aisle is equal to the amount of air drawn by IT equipment, then all is well. But if you are delivering less air than the IT fans draw, then those servers are going to get the air they need from somewhere, and that somewhere is the hot aisle. This will lead to hot spots, high hot-aisle temperatures, and possible equipment failure or degradation.
But too often, in order to solve the under-airing problem, over-airing is employed. The hot spots are gone, leaving a cold air bypass and lower hot-aisle temperatures … which is not a good thing. This means the fans are working harder and wasting energy.
Throw into this mix the fact that IT load and flow are modulating, and you can see that the only way to guarantee no hotspots is by overcooling most of the time. And that is just silly.
It should be noted that if you incorporate true separation between the hot aisle and cold aisles, the server fans will run in series with the CRAC fans. However, this doesn’t really change the dynamics because even properly blanked off IT racks, raised access floors, and aisle-containment systems cannot hold a vacuum. Temperature degradation may be dampened simply because the pathways have been minimized, but you still have the potential for hot spots and wasting energy via under- and over-airing.
Therefore, with or without separation, air will go where it has to, and the variable air issue still needs to be acknowledged and addressed.
GIVEN #3: DATA CENTER CONTROL IS NOT PRECISE
Until we embrace direct fluid cooling, HVAC in the data center will always be a mushy mess. Everything in a typical data center fights the very notion of precise control. We blow air into a leaky raised floor where it eventually flows through uncontrolled openings located in a cold aisle with imaginary boundaries. This air is then drawn randomly at varying flows through IT equipment where it is discharged into an equally nebulous hot aisle and eventually returned to a CRAC or CRACs via an imperfect ceiling plenum or over the tops of other racks and aisles.
With comfort conditioning we are actually working in a tighter band than in a data center. In its most conservative recommended ranges, ASHRAE still allows a drybulb variance from about 64° to 81°. So even though the designer always want to be as close to the ideal condition as possible, there is a wider dead band to work within. Throw in the fact that typical OEM parameters are much wider than ASHRAE’s Recommended, and you have quite a bit of elbow room.
So how does this figure into our approach? Well, for starters, requiring precise measurements would be both overkill and frustrating. ASHRAE and many designers advocate controlling the VFDs to maintain a minimum floor pressure.6 But I would suggest that measuring pressure accurately is difficult to start with, and then throw in a less-than-perfect floor and a very low setpoint, and I think you have a control vendor’s dream but a facility manager’s nightmare.
I came across a pretty good T-shirt once that read, “Slow and steady wins the race, except in a real race.” But if the race is a race to save energy in the data center, then Aesop and his trusty tortoise may be right on target. Control loops, whatever they may be, should avoid hysteresis and hunting, with a slow-acting attitude of tuning the loops to react only as fast as needed. This is common sense that flies in the face of the precision cooling maxim of keep it tight.
Bottom line: Getting carried away with precise control in a space that by industry definition can swing a minimum of 20º is like putting a micrometer on a party balloon: kind of pointless.
So first, establish airflow discipline to the maximum extent possible. Plug holes in the floor, use blanking in the racks, provide only as many perforated tiles as you need, and place them where the load is. Comparing Figure 4 to Figure 1, if you can introduce a return path via a plenum you will see improvements. And if you can add real separation, you are more than half way to your target.
HVAC energy consumed in the average data center is split about 50/50 between the fan and your heat rejection method (DX, chiller, etc.). DX applications have issues with reduced airflow, so varying airflow and in turn fan energy isn’t usually an option. However, supply air reset strategies can be applied pretty much across the board. So I would suggest that if you can employ only one strategy you due to lack of VSDs, cash, or nerve, that you go the variable temperature route.
Utilizing feedback from the temperature sensors at the racks resets the CRAC supply temperatures upward to keep the most demanding rack satisfied. In smaller data centers, you would control all CRACs in unison, but in larger installations, you may employ local temperature zones controlling multiple units.
And what temperature should you control to? Well, even though ASHRAE and the vendors say you can go as high as 95°, there is a significant body of research that indicates that around 80° you see diminishing energy savings as server fans begin to ramp up.7 This makes a great case for real-time power monitoring of IT gear to find that sweet spot, but I digress.
With just the temperature-reset strategy, your compressorized equipment runs less, and when it is operating, it is running more efficiently. Plus, you have opened the economizer windows — be they air or water. If your coils can handle the variation, you can double down with a VAV strategy using the same simple rack control point.
Specifically, using that slow-acting control we discussed earlier, reset the CRAC fan speed and supply temperature by highest rack temperature. Obviously, this can’t be done simultaneously, so a stepped strategy is recommended.
• On start-up, the fan shall run at pre-set maximum speed and minimum LAT.
• If temperature at all racks is below setpoint, then increase the LAT incrementally until it reaches the predetermined maximum.
• If the temperature at all racks remains below setpoint and the LAT is at its maximum, then reduce the fan speed incrementally until it reaches the predetermined minimum.
• If the temperature at the racks exceeds setpoint, the sequence shall reverse.
The setpoints for rack temperature, LAT, and fan speed shall all be estimated during design, but determined during commissioning. Once the system is up and running, the operators will have the opportunity to optimize the setpoints to realize the most energy savings possible. Also, there may actually be two maximum fan speeds: One for normal operation when multiple units are operating, and one for during an event when a CRAC may be off line.
OK, then. We have reviewed the history of data centers. We have a pretty good idea why so many legacy designs are out there and how, through no fault of their own, they have become outdated sooner than anyone expected.
We see that IT loads and their associated airflows vary over time, but we have also seen that the data center is a squishy setting where overkill often leads us to operator irritation and operating inefficiency.
I have argued that the only control point that really counts is the entering condition at the equipment. And while we all want to know a number of other points so that we can troubleshoot and optimize the systems, we mustn’t forget what matters most. So we use that entering rack temperature to drive a straightforward LAT reset strategy that can be combined with an integrated VAV approach to airflow which in turn can provide a more balanced and efficient data center operation.
So wrapping things up … Occam’s Razor (also called the principle of the economy of thought), suggests that other things being equal, simpler explanations are generally more likely than more complex ones. Einstein famously said that “Everything should be made as simple as possible, but no simpler.” Well, I’m no Occam, Einstein, or Taylor, but I do hope this article makes you consider a more thoughtful (but less complicated) approach to your next data center opportunity. ES
1. Taylor S. “Optimizing Design and Control of Chilled Water Plants, Part 3: Optimizing Pipe Sizing and Delta-T.” ASHRAE Journal. 53(12): 22-34. 2011.
2. ASHRAE. 2011 Thermal Guidelines for Data Processing Environments – Expanded Data Center Classes and Usage Guidance. Developed by ASHRAE Technical Committee 9.9. 2011
3. ASHRAE. Best Practices for Datacom Facility Energy Efficiency. American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc. Atlanta. 2009
4. ASHRAE. Thermal Guidelines for Data Processing Environments. American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc. Atlanta. 2009
7. Moss, D. “A Dell Technical White Paper - Data Center Operating Temperature: The Sweet Spot.” Dell Incorporated. 2011