Risk. It’s an interesting concept in the mission critical sector. We speak of it like it’s this thing you can hold. In the broadest of terms we seek to measure it (quantify, assess, etc.) or manage it (minimize, mitigate, etc.). While on face value these two activities may seem the same, I would suggest that one’s possible and the other is fairly impossible.
Going all Old Testament on you, consider Moses and Noah. Moses managed risk. On the run from an army and cornered on a beach? Raise your staff and part the sea. Your entourage is screaming for water? Take that trusty staff and strike a rock. Booyah! Risk managed.
Noah, on the other hand, was more of a risk assessor. It’s gonna rain. Risk of slipping, better wear boots. It’s gonna rain a lot. Risk of flash flooding, better seek high ground. It ain’t gonna stop raining until the earth is covered with water. Oh…better build a boat.
The takeaway from my canonical comparison is that Moses kinda had a miracle thing happening. That’s something I would suggest you need in your hip pocket if you really think you are going to manage danger (a synonym for risk). Meanwhile, Noah had more of an “if-then” take on the situation, in that he measured the risk and then set off to design something in response.
Now, I would like to encourage you to embrace the Noah within. And the great news is, you won’t have to listen for the voice of God.
MISSION CRITICAL REVELATIONS
When ASHRAE TC9.9 issued their 2011 Thermal Guidelines for Data Processing Environments - Expanded Data Center Classes and Usage Guidance¹ (henceforth referred to simply as The Guidelines), the big news of course was the expanded temperature and humidity ranges. No doubt the ambient expansion represented a significant shift, especially since it flung the window wide open (both figuratively and literally) on free cooling with outside air.
But hidden in plain site was the gamechanger with the unfortunate moniker: The X-Factor. There on page 16, in the seemingly be-nign section titled Server Reliability Trend vs Ambient Temperature, the makers of IT equipment had collectively showed their cards, giving us a tool once and for all to quantify risk in thermodynamic and geographic terms without consulting our Magic 8-Ball®.
Perhaps it’s because the front page news (expanded ranges) was such a big deal and so deceptively easy to understand that the “Usage Guidance” part of The Guidelines has been virtually ignored in the trade journals. Who doesn’t understand that 80.6°F is higher than 77°F? Plus, there was a fair share of hyperbole out there that would make a National Enquirer editor blanch: “Alien Baby Found in Data Center Operating at 104°F!”
But seriously, the banner headlines don’t mean a great deal if you don’t understand the risks and opportunities that are baked into the metrics delineated in that now-famous Table so often referenced by the enlightened and those not so (see Table 1).
TWICE THE PAPER, TWICE THE FUN
The Guidelines is actually a two-part whitepaper. Following the well-known table presentation, there are five metrics discussed that can help the data center designer optimize his design. All are relevant, but in varying degrees. The additional usage guidance includes the following, but in this treatise we will discuss only the first three:
- Acoustical noise levels in the data center vs. ambient temperature
- Server power trend vs. ambient temperature
- Server reliability trend vs. ambient temperature
- Server reliability vs. moisture, contamination, and other temperature effects
- And finally, server cost trend vs. ambient temperature
Let’s start with the least common concern: noise. While it is something the designer should factor in, quite frankly it doesn’t come up a great deal, especially in concept design. However, as our society becomes more and more litigious and as our data centers become less like surgical suites and more like warehouses, expect all those IT guys who came of age inside the quiet white box to start throwing elbows.
Typically, the warmer the data center, the more air the servers move. That’s right, the increase in temperature ramps up the fans on the IT equipment … not the CRACs. And according to ASHRAE TC9.9, higher fan speeds lead to more fan racket, and that correlation is broadly defined in Table 2. Refer to The Guidelines for the caveats, but you get the idea. Higher temps mean higher server airflows, which lead to more noise in the data center.
The second factor we will touch on briefly is power draw versus temperature. This is a big one, especially when we discuss PUE. In fact, entire articles have been dedicated to this very topic². But at the severe risk of vastly oversimplifying, I will say that there is a nominal tipping point at around 27°C (80.6°F)(Figure 1). Once again, this is related to the server fans speeding up.
Although the rule of thumb is helpful, I must add the qualifier that as servers become more efficient and the data center ambient norm continues to rise, you can expect this tipping point to move higher. But for the purposes of this article, which is to simply raise your awareness of the relationship, committing the 27°C (80.6°F) watermark to memory should suffice.
In a data center, when did it ever stop being about reliability? The honest answer is never.
Even though the PR guys at the big Internet firms are pushing out press releases on PUE’s approaching 1, solar arrays the size of football fields, and love letters to Greenpeace promising to hug more trees, the real story has always been at the server level. And in particular, what matters most is the mission that that server supports. Hey, they don’t call it carbon critical, they call it mission critical.
So while awareness of potential noise problems is important and power optimization even more so, server reliability is far and away the primary driver of the top and bottom line … and all the lines in between.
Interestingly enough, what some folks don’t realize is that server reliability is not necessarily vendor-driven. In fact, the metrics found in The Guidelines make no distinction between manufacturers, processes, or applications. With all the major players playing along, the data was submitted and processed blindly by TC 9.9. In turn, what you get are vendor neutral metrics based on vendor specific data. That’s pretty good stuff.
So even though Vendor A might tell you that their machine is better than Vendor B’s, that really becomes a matter of functionality or features more than reliability per se — at least as far as the facility designer is concerned here at the macro level where we happily reside.
Remember when I said that the ambient ranges presented in The Guidelines were deceptively easy to understand? The point being that terms like “Recommended” and “Allowed” seem intuitive and engineers love tables with hard numbers that can be turned into hard lines on psychrometric charts. But like finding true love, it’s never that simple.
A great insight into the intellectual laziness that can be fed by a seemingly simple table is provided in The Guidelines:
“…There have been some misconceptions regarding the use of the recommended envelope. When it was first created, it was in-tended that within this envelope the most reliable, acceptable, and reasonable power-efficient operation could be achieved….It was never intended that the recommended envelope would be the absolute limits (emphasis added)…”
They went on to state in part:
“…it is acceptable to operate outside the recommended envelope for short periods of time without affecting the overall reliability and operation of the IT equipment. However, some still felt the recommended envelope was mandatory, even though that was never the intent (emphasis added).”
But the good news is that this flub on the part of the design community forced the vendors and TC 9.9 to show their cards. In turn, The Guidelines included the power and reliability versus ambient temperature data that the expanded ranges had been based on in the first place.
In retrospect, you can understand why the vendors may not have wanted to share this data initially, since reliability becomes simply a function of vendor neutral environmental parameters. Kudos therefore to the vendors for partially leveling the playing field and potentially sacrificing a sales advantage in order to advance the science of mission critical design.
However, shame on us as designers if we do not take advantage of this hard-earned information in order to provide more enlight-ened designs for our clients and their customers.
In Appendix C of The Guidelines there is a table that has become near and dear to my heart. After years of relying on spotty anecdotal evidence to convey the potential risks of higher temperatures in the data center, I now have a straightforward table that I can point to with confidence (see Table 3).
It should come as no surprise that the higher the temperatures are, the higher the failure rate becomes. Assuming 24/7 operation and using 20°C (68°F) as the baseline, we can see that if we were to run our data center between 20 and 25°C (68 and 77°F), the X-factor would average nominally 1.13, which represents a potential increase in server failures of 13%.
Wow, that’s kind of interesting, isn’t it? I challenge you to find a data center article written since 2011 that touts running at higher temperatures (including those written by yours truly) that clearly states this rather damaging figure. Bump up to the magic 27°C (80.6°F) value and the average X-factor jumps to about 1.3!
But what does that mean? The Guidelines explains:
“The relative failure rate shows the expected increase in the number of failed servers, not the percentage of total servers failing (emphasis added); e.g. if a data center that experiences 4 failures per 1,000 servers incorporates warmer temperatures and the relative failure rate is 1.2, then the expected failure rate would be 5 failures per 1,000 servers.”
Relative or not, an increase in probable failure is still an increase. So perhaps it’s surprising that The Guidelines do not make a case for operating at higher temperatures alone. But rather, it makes the case for economization (which would allow you to operate at relatively cool temperatures (about 21°C (70°F)) without running a chiller) in combination with short excursions outside that temperature range.
When you combine this data with Table 1, you come to understand why the “Recommended Range,” while wide, still has a some-what conservative median of 22.5°C (72.5°F).
X MARKS (AND DRIVES) THE SPOT
So if The Guidelines are actually making the case for economizers combined with higher temperatures at times, then are economizers the way to go everywhere? Of course not. Once again it doesn’t require much reeducation to realize that some places on earth are more accommodating than others when it comes to free cooling.
The Guidelines presents the weather data for the Windy City and then presents the weighted average X-factor. Nifty that even in the Midwest where the summers can be pretty miserable at times, you can see a benign X-factor of 0.99.
Now, using the weather data for different cities across the U.S. and the X-factors found in Table 3 and accounting for heat pick up from the systems, you can graphically see where geography and risk cross paths (Figure 2). Seattle good. Miami bad. Probably saw that coming though, didn’t you?
But economizers are not all created equal, so if you consider dry coolers where efficiency is lost and water temperatures are higher, the numbers skew to the right (Figure 3). Now even dry and cool Helena, MT, sees a bump in risk.
The beauty of Table 3 is that you can create your own versions of Figures 2 and 3 when you have the weather data for a particular city. Through the magic of a spreadsheet, you can plug in the weather data from programs like HDPsyChart³ and more easily quantify risk and opportunity for a particular city and application.
Information is a beautiful thing. Information with guidance is even better. The goal here has been to provide some perspective and direction when you approach the often referenced and quite often oversimplified values in The Guidelines and Table 1 in particular.
At the risk of providing one too many Biblical references, I will remind you that the devil is, in fact, in the details. The second half of The Guidelines provides some information that can give the designer so much more perspective when making decisions and recommendations. However, it requires more education on your part, and ideally, the development of site-specific tools.
As mentioned earlier, it’s called mission critical for a reason. And if you can convey to your client the real implications at the server level of changes made at the mechanical system level, you have bridged a divide between IT and facilities that has traditionally been an uncrossable chasm.
Hey, maybe we can be a little more like Moses than I thought. Have fun.
1) ASHRAE, 2011. 2011 Thermal Guidelines for Data Processing Environments – Expanded Data Center Classes and Usage Guid-ance. Developed by ASHRAE Technical Committee 9.9.
2) Moss, D., 2011. “A Dell Technical White Paper - Data Center Operating Temperature: The Sweet Spot.” Dell Incorporated.