Infrastructure Upgrades At Live Data Centers

Data centers are expected to operate 24 hours a day, seven days a week, 365 days a year, and availability is measured in the highest of percentages. In fact, 99.999% availability (“5x9” or “five nines”) is a benchmark that is targeted by most data centers. To put it in perspective, 99.999% availability is equivalent to less than 5.26 minutes of outage over a year.

When equipment reaches the end of its useful life, data centers are prone to unplanned outages, and equipment replacement becomes critical. There are also instances where upgrades are necessary to become compliant with federal and other standards. Infrastructure components can be replaced to reduce preventive maintenance costs and to improve resiliency. Capacity and density can be increased, and loads can be consolidated across the same data center or across different data centers. In fact, there is a perfect opportunity to find and remediate points of failure, remove the Band-Aids, and figure out permanent solutions in existing data centers. There is also the potential to capture stranded capacity and space and maximize the ability to support IT loads.

Infrastructure upgrades in a live data center can be challenging since availability is paramount and an outage can be extremely detrimental, frequently resulting in exorbitant costs. For example, an outage at a U.S. carrier’s data center last year led to cancelation of roughly 2,000 flights over three days. The company announced the cost of the outage reached $150 million.

Disruptions associated with construction activity can affect business continuity. There are also safety concerns associated with working on or near live electrical equipment. Extended outages are not acceptable, as that would injure a company’s reputation and cause other non-monetary ill effects. Yet there are best practices and recommendations to minimize the risks and systematically implement upgrades in live data centers.

Though companies might neglect data center upgrades for various reasons, cost and risk aversion are the primary ones. Yet the money saved today can be lost in exponentially greater sums tomorrow, as has been proven countless times. Also, owners must weigh the risk of experiencing a catastrophic failure at an unexpected time if they choose to delay upgrading the infrastructure.

The cost of data center outages has been increasing over the past few years. According to a recent study sponsored by Emerson Network Power and published by Ponemon Institute — drawing from polling with more than five dozen data centers that had suffered outages in the past year — unplanned data center outages cost companies nearly $9,000 per minute. The average cost of an outage is nearly $750,000, up from just over $500,000 in 2010.

With this in mind, what are the strategies and recommendations when conducting infrastructure upgrades at live data centers? What challenges might be faced by stakeholders when the project is underway and how should they handle them to ensure successful completion? What are the best practices to pursue when working to reduce energy consumption, improve resiliency, increase capacity, increase power density, or replace equipment and beyond while still minimizing risk of outages? How should one tackle a live data center upgrade to ensure a high probability of success?

While each project is different and has its unique set of requirements and challenges, follwing are a few important items to consider during the programming and planning phase.

1) Review the owner’s project requirements (OPR). The OPR should clearly define the overall objectives, goals, and expectations of the project. In addition, also consider detailed interviews with the stakeholders to understand the current and future needs, planned initiatives, and interdependencies. There are typically myriad stakeholders in such projects. The author was part of a team that upgraded the flagship data center of a Fortune 500 company, and Figure 1 lists the client groups that provided input. The stakeholders will have varied interests, objectives, and appetite for risk, and coordination and collaboration is essential for project success.

2) Validate the power and cooling requirements, both past and current. The BMS and EPMS trend logs, electricity bills, equipment readouts, and more can provide useful information. The past and current usage can be coupled with planned initiatives and deployment models to forecast the future requirements. It is especially important to be cognizant of future requirements when upgrading the infrastructure to ensure a future-proof facility.

3) Based on the forecasting results, establish milestones where additional capacity needs to be deployed or equipment needs to be replaced to meet the project objectives. These milestones will dictate the overall design and construction schedule. A phased deployment is recommended if there is a low-to-medium level of confidence in the forecast. For example, the projected UPS power and deployment requirements from a data center upgrade project the author was involved in are depicted in Figure 2.

4) Perform gap analysis and identify any major constraints that could pose a challenge to the infrastructure upgrade project. For example: limited utility capacity, insufficient space within or outside the facility, inadequate structural capacity, and other problems can pose serious challenges to the proposed upgrades. These issues might be unknown to the stakeholders, and expectations and requirements may need to be adjusted accordingly.

5) Review the facility’s original master plan that addressed expansion or densification as part of a future phase, if it exists, and confirm it is still valid. Master plans are occasionally abandoned or modified without documenting the changes or fully vetting out the repercussions, especially when there is a change in business model as a result of mergers, acquisitions, and similar reasons. A thorough analysis prior to implementing the guidelines in the master plan is essential.

6) Review available documentation to get an in-depth understanding of the data center systems. Survey the data center in detail and conduct interviews with the data center operators. If there have been outages in the past or there are recurring issues, it is critical that the root cause be identified. All corrective actions should focus on the root cause and not the symptoms.

7) For projects requiring concurrent maintainability or fault tolerance, identify any points of failure that can impact the data center and review remedial options with the stakeholders. It is recommended that any existing temporary fixes and workarounds be replaced with permanent solutions. It is also an opportunity to incorporate failover scenarios or improve the existing ones. Figure 3 depicts the deficiencies, risk level, and remedial options from a project the author was involved in.

8) Verify the effectiveness of the data center cooling system if the project scope involves related upgrades. The available cooling capacity cannot be fully utilized if there are deficiencies. Consider a CFD analysis to validate the existing conditions and provide recommendations for increasing the effectiveness. Identify improvements to the existing infrastructure that can be achieved by optimizing operations and set points. There are several low-cost, high-impact strategies that can be implemented at data centers to improve the capability to support IT load, reduce energy consumption, and improve resiliency. Measures include (but are not limited to) confirming location of supply and return grilles and relocating as needed, installing blank-off panels at IT racks and cabinets, deploying equipment in a consistent hot-aisle/cold-aisle configuration, eliminating underfloor obstructions where possible, using brush grommets at cable openings, and more.

9) Occasionally there are circumstances that affect the entirety or certain elements of the upgrade project such as pending M&A, litigation, security requirements, government regulations, or other issues. Adjustments to the stakeholder expectations and requirements may be needed.

10) For colocation data centers, review the impact of upgrades and construction activity on agreements (e.g. SLAs, lease) between the customer and the provider.

There are implications if the agreements are violated and the potential impact should be reviewed in detail and weighed against the anticipated benefits from the upgrade project.

11) Review the financial impact of the upgrade project. For colocation data centers, it may be possible to pass the entire or a portion of the upgrade cost to the existing customers, subject to lease requirements. Utility companies frequently offer incentives if upgrades will lead to reduction in energy usage (kWh) or peak power (kW), and the incentives can partially offset the cost of the project. A detailed financial analysis is extremely helpful to secure buy-in and sponsorship from the C-suite stakeholders.

12) Review options and alternates that best meet the project needs. For projects that involve an increase in capacity or density, expanding the existing systems and deploying existing technologies already on site might not be appropriate in certain situations. Criteria such as space requirements, structural implications, financial impact (CapEx, OpEx, TCO), potential risks, reliability, time-to-market, commissioning challenges, code impact, and much more need to be considered. Frequently, the project requirements need to be prioritized, as a solution that will meet all the metrics is typically unlikely. Discuss the options and alternates with the stakeholders to guide them towards the optimal solution.

13) Consider involving a preconstruction team (general contractor and sub-contractors) to help with pricing, logistical support, construction feasibility, equipment procurement, construction scheduling, development of MOPs/SOPs, creating safety protocols, and more during the planning phase. Data center upgrade projects are specifically suited for integrated project delivery (IPD).

14) Tap into the O&M knowledge base. The upgrade project also offers an excellent opportunity for the data center facility and operations team to provide input and gain hands-on experience as MOPs/SOPs are developed and implemented in preparation of construction. Their participation during the entire planning phase is highly recommended.

15) It is important for the commissioning agent to be involved during the planning phase. Commissioning systems in a live data center have their own set of challenges, and the commissioning agent needs to develop strategies and provide input to the design team.

16) Consider a backup plan up front. Construction activity can be intrusive, and there are risks associated with working in a live data center. The risks can be minimized with detailed planning but can never be eliminated. Murphy’s Law can render even the best plan and strategy ineffective. Consider creating contingency plans and fallback options in collaboration with the stakeholders to help mitigate the risks. All stakeholders need to be cognizant of risks involved, and client acceptance is critical.

By following the thought process and the underlying philosophy alluded to above, upgrades to a live data center can be implemented in a successful manner with little or no impact to its operations. Had companies harmed by data center outages known this in advance, no doubt they would have ended up in better shape today.