Discovery vs. Disaster: Know The Limits Of Your Data Center
Try to whittle down the unknown unknowns.
Surprises can be a good thing.
To this day, the greatest surprise of my life occurred when I was a young student at Berkeley Junior High. At the year-end assembly, unexpectedly, I heard my name called as they announced the recipient of the Seventh Grade Citizenship Award. I can still remember my heart pounding in my chest as I made my way to the podium to accept my unframed certificate.
In the subsequent 35 years, I have recognized that such awards were merely churned out on a copier and that some teacher probably figured the quiet kid with a bad haircut might as well get the award since he hadn’t been caught with cigarettes like most of the student body. Nevertheless, I still consider the day a curious highlight in my life.
On the other hand, some examples of bad surprises include NOT winning the Eighth Grade Citizenship Award, being dumped by Kay Hamlin (for no good reason), and totaling my car in college.
Professionally, surprises have been rare (thankfully and knock on wood), although some time ago on a data center project I was surprised in a negative fashion. However, the bad mojo wasn’t due to what was discovered, but rather how we seemingly stumbled upon it.
Entropy in the Data Center
Once up and running, data centers take on a life of their own. When that last commissioning report has been issued and the racks go live with relevant data, a fear-induced cycle of entropy begins.
Entropy as you may recall is the gradual decline into disorder. And while disorder is an anathema to the data center operator, the very act of avoiding it can contribute to it. In particular, when you do everything in your power to avoid even a hint of downtime, problems just below the surface can fester.
Data centers are notoriously overbuilt. When the IT folks are asked how many kW per rack they anticipate, they think of a number and then multiply it by two. When the PM for the project gets that number, he or she rounds it up. When the engineers get that number, they nix diversity and maybe even add a safety factor.
So on day one, there is likely more than twice the capacity needed. And it could be years before the limits are ever approached.
So what do we tend to do with that excess capacity? Well, if you got it, flaunt it, right? And in the case of a data center, that is often demonstrated by overcooling the space and dumbing down the operation of the central plant. Moreover, if the CIO’s bonus is linked to maximizing uptime with no linkage to minimizing operating costs … then Katy bar the door, we have a new definition of thermal runaway.
Too cold is Too Cautious
Anytime you step into a data center where you expect to come across meat hanging in the aisles, your radar should start pinging. A cold data center is the domain of the chicken-hearted. I have walked into cold rooms and been told it was necessary because of equipment specs. One look at the ASHRAE Environmental Standards and you know that’s a red herring.
Some will say that they overcool the room for thermal mass to avoid overheating if the plant is lost. As if a cold wall were capable of cooling a chip — inside a server — inside a rack — 20 ft away.
Mostly though, I have simply been told that this is how things have always been, followed by the kicker, “… we have never had a problem.”
But masking a problem is not the same as solving one. Putting a picture over a hole in the wall doesn’t alter the fact that there is a hole in the wall. So when the day comes that the load starts to approach capacity, or the capacity drops because of an unplanned event, that hole is going to be exposed, and someone is going to ask, “Why is that there and why didn’t I know about it?”
This brings us back around to surprises.
Wow, the sky really is falling…
So back to that data center I was talking about. The space was cool, but not overly so. Because the delta-T was so small on the chilled water system, I was sure the plant was working harder than it had to. The existing space was set up in a hot-aisle/cold-aisle configuration, but there was no means of separation and there were gaps in the rows, so a great deal of mixing was likely occurring.
Early in the construction schedule, it was necessary to shut down a single panel for a short period, in order to demonstrate that it could be isolated in the future when actual work on the panel would be required. The expected impact of this task was a loss of half of the cooling units in the space, which was a combination of computer room air conditioners (CRACs) on the perimeter and in the aisles.
Based on readings from the power distribution units (PDUs), we knew the critical operating load in the space was roughly equal to half of the installed cooling capacity. The math indicated that a loss of half the cooling should be a non-event. The ops folks were confident. The contractor felt justified in proceeding. And the engineer (me) was so oblivious to the potential threat that he wasn’t even on site.
Within five minutes of pulling the plug, the problems brewing just below the surface for years suddenly bubbled to the surface.
Hot spots were noticed first, where concentrations of servers and gaps in the aisles allowed heat to build and mix. Instead of every other CRAC going off, entire rows shutdown and cabinets were left with no cooling. Then, just to add the potential for injury to the ongoing insult, lenses on light fixtures began to fall from the ceiling!
That’s right … it was just warm enough at the ceiling that the lenses expanded the tiny fraction of an inch necessary to free them from the surly bonds of their meager retaining clips.
Perception and Reality
Well, the client had seen enough and demanded the panel be reenergized. The power was immediately restored and the space recovered relatively quickly.
The fact of the matter is that the test was a success. The process worked. The problems were identified, the back-out procedure was followed, and no damage or shutdown occurred.
But it felt like a failure, and the client was justifiably spooked when the design/build team looked so flatfooted. And this letdown was a direct result of an unfounded expectation of success. This wasn’t supposed to be a big deal.
As engineers, we can’t forget that we are in the business of communication, whether it’s through drawings, specs, or dialogue. And in order to manage risk, one needs to convey and manage expectations.
Uncovering the unexpected shouldn’t be a shocking jack-in-the-box moment.
In order to avoid the negative surprise, you must treat every event in a data center, whether routine or a one-timer, as a potential catastrophe. And you have to prepare yourself and your client both mentally and procedurally.
What You Don’t Know Can Hurt You
The problems we discovered that Saturday morning were not new. They had been there all along just waiting to be found. The only question was whether they would be carefully discovered or harshly exposed.
If instead of a sequential shutdown, the panel in question had fried for some reason, all of the problems we witnessed would still have arrived on cue. The difference being, however, that thermal runaway would have occurred and the processors would have had to be shutdown. Most data center operators would call that a career limiting event.
Recall what was said earlier about data center entropy. Some of the problems were actually exacerbated because they were built on the backs of other problems that had never been rooted out.
For example, one of the primary contributors to the hot spots was the wiring of the in-row CRACs. Because many units were sequentially wired instead of in an alternating fashion, an entire row lost cooling. Throw in a cabinet with a concentration of servers, and now you have a hot spot in an already hot aisle.
HOT AISLE + HOT CABINET = REALLY HOT SPOT REALLY FAST
Another design flaw discovered was the location of the temperature sensors for the CRACs. They were located at the return air inlets instead of the ASHRAE-recommended location within the cold aisles.
So when the temperatures at the server inlets began to spike, the CRACs did not respond. That was because they were seeing a diluted room average temperature that was rising much slower. And before the CRACs could respond, the power had already been restored.
One other cruelly ironic discovery was that some parts of the data center were still overcooled. In one corner, you have a server in alarm, and in another, the units were humming along as if nothing had happened. But why?
The culprit was the location of floor diffusers. They had been laid out over time with little thought given to their direct relationship to their proximal load. Therefore, there were too many in some areas leading to air starvation in others —a very common (but correctable) problem found in almost every raised-floor environment.
At the end of the event, there were three types of problems on the table. Those associated with the installation and configuration of the original design. Those related to how the facility was operated. And lastly, those created by a lack of communication between the design/build team and the client.
Interestingly enough, the toughest problem to remedy was the owner’s loss of trust in the team. I had experienced it enough with my boys to know that once trust is lost, it takes a real effort to regain (although a data center on the fritz is more consequential than beer in the basement fridge). Nonetheless, we all managed to get back on the same page.
Not surprisingly, the solutions to the configuration problems were obvious. The math had been correct all along; the capacity was in the room. The problems simply had to do with the fact that the cooling being provided was decoupled from the load it was meant to serve.
Floor tiles were intelligently relocated or simply removed. Blank offs were added to empty cabinets. Rudimentary separation was employed to keep the hot side hot and the cool side cool. And the in-row CRACs were rewired to assure a logical redundancy to each aisle.
Regarding the third problem type … operational … nothing beats experience as a teacher. No matter how many articles I write or seminars I speak at, I cannot make the impression an alarming cabinet can. If the location of a temperature sensor can’t get your attention, it will when your aisle is cooking and your CRAC is idling.
Thank goodness it was a controlled peek behind the uptime curtain that exposed the potential for chaos, instead of a rogue lighting strike. Because of that, we were able to address the weak links in the redundancy chain through a controlled process.
Early in my career, I felt that I had to convey infallibility. This in itself was a failure to communicate, because as my mother always reminded me, “… be sure, your sins will find you out.” You can’t bluff your way through engineering. People eventually see right through it.
So you’re encouraged to be as open about what you don’t know as what you do. And that includes telling your client that you don’t know exactly what’s going to happen when you flip the switch, but you know something might, and you are prepared to respond.
If you run a data center, or you advise someone who does, try to know what you don’t know yet. It is in our nature not to fix what isn’t broken, but you should peek under the hood to be sure. Just because you haven’t experienced a problem, it doesn’t mean one isn’t out there.
As Andrew Grove, the co-founder of Intel said,
Success breeds complacency. Complacency breeds failure. Only the paranoid survive.
If you combine that with one of my favorite colloquialisms, you should be sufficiently on guard:
Just because you’re paranoid, it doesn’t mean they’re NOT out to get you.
In the beginning, we were hired to create a new data center space. But when we moved on, the existing data center spaces were more efficient, better understood, and operationally more reliable. None of which were included in the original scope of work.
And all because of a surprise that shouldn’t have been …except for the falling light fixtures. I never could have seen that coming.