FilterHN

Data Centers, Temperature, and Power

60 points

by quectophoton

7 days ago

| past

| 7 comments

| backblaze.com

| HN

▲

jeffbee

4 days ago

[-]

Seems to be a mental mishmash. For one thing, they are taking it as given that temperature is relevant to device lifetime, but Google's FAST 2007 paper said "higher temperatures are not associated with higher failure rates".

Second weird thing is that it says cooling accounts for 40% of data center power usage, but this comes right after discussing PUE without contextualizing PUE with concrete numbers. State-of-the-art PUE is below 1.1. The article then links to a pretty flimsy source that actually says server loads are 40% ... this implies a PUE of 2.5. That could be true for global IT loads including small commercial server rooms, but it hardly seems relevant when discussing new builds of large facilities.

Finally, it's irritating when these articles are grounded in equivalents of American homes. The fact is that a home just doesn't use a lot of energy, so it's a silly unit of measure. These figures should be based on something that actually uses energy, like cars or aircraft or something.

▲

dijit

4 days ago

[-]

> Seems to be a mental mishmash. For one thing, they are taking it as given that temperature is relevant to device lifetime, but Google's FAST 2007 paper said "higher temperatures are not associated with higher failure rates".

Google have been wrong a couple of times, and this is one area where I think what they've said (18 years ago btw) might have had some time to meet the rubber of reality a bit more.

Google also famously chose to disavow ECC as mandatory[0] but then quietly changed course[1].

In fact, even within the field of memory: higher temperatures cause more errors[2], and voltage leaking is more common at higher temperatures within dense lithographic electronics (memory controllers, CPUs)[3].

Regardless: thermal expansion and contraction will cause degradation of basically any material that I can think of, so if you can utilise the machines 100% consistently and maintain a solid temperature then maybe the hardware doesn't age as aggressively as our desktop PCs that play games- assuming there's no voltage leaking going on to crash things.

[0]: https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

[1]: https://news.ycombinator.com/item?id=14206811

[2]: https://dramsec.ethz.ch/papers/mathur-dramsec22.pdf

[3]: https://www.researchgate.net/publication/271300947_Analysis_...

▲

jeffbee

4 days ago

[-]

I am not taking Google's result at face value, but the article shouldn't make assumptions without supporting evidence, either. ASHRAE used to say your datacenter should be 20º-25º which you know makes a certain amount of sense when it comes from an organization earning its money from installing and repairing CRACs. Now they admit that 18º-27º is common and they allow for up to 45º ambient designs. They are following the industry up.

▲

adrian_b

3 days ago

[-]

"Higher temperatures" must be qualified in any statement like this.

There is absolutely no doubt that with increasing temperature the rate of failures for any semiconductor device increases very quickly. This is routinely tested by any manufacturer.

What happens is that at low enough temperatures the rate of failures caused by temperature may be small in comparison with that for failures caused by other reasons so you will see no temperature effect. However, once you raise the temperature enough, you will see an obvious dependence of temperature for the rate of failures.

Semiconductor devices are designed so that their rate of failures for a crystal temperature specified in their datasheet, usually in the range of 90 to 110 degrees Celsius, is low enough so that most devices will have a life of at least 10 years or other such value.

Which is the ambient temperature at which the nominal maximum temperature is reached depends on the cooling and on the power consumption.

If the device has a temperature that exceeds the nominal maximum temperature, it is pretty certain that you will see a strong dependence on temperature of the failure rate.

Whether you also see temperature effects at lower crystal temperatures, e.g. around 60 degrees Celsius, depends on the device and it is unpredictable unless you do a costly experiment yourself.

In general, it is expected that for low-quality devices you will not see temperature effects, because those will fail for other reasons, while for high-quality devices, which lack manufacturing defects, you will see a temperature dependence for the failure rate even at lower temperatures.

So Google might have not seen temperature effects because they were using the cheapest junk anyway.

▲

PeterStuer

3 days ago

[-]

In the Google study all drives were kept relatively cool. None was operating >50C. Even so, failiure rates started to creep up >45C. In a completely uncooled situation, I can imagine temps rising to >50C. What would failiure rates be at 60C or 70C?

▲

Python3267

4 days ago

[-]

This article was written for non-technical folks unfortunately. I read the phrase below and nearly puked from the corpo speech.

> So, the methodology around temperature mitigation always starts at power reduction—which means that growth, IT efficiencies, right-sizing for your capacity...

▲

metadat

4 days ago

[-]

The person who wrote the HDD failure rate quarterly reports recently retired. Sorry for the bad news, but what other reports or blog posts published by backblaze have you enjoyed reading? For me, the answer is.. none. I hope to be declared wrong and that the legacy of quality HDD reporting will live on.

▲

jakedata

4 days ago

[-]

I have had high hopes for passive daytime radiative cooling since I read about it 10 years ago. Converting waste heat to an infrared wavelength that flies off into space day or night is apparently not that easy or cost effective right now.

https://www.asme.org/topics-resources/content/new-solar-ener...

https://www.skycoolsystems.com/

https://www.nature.com/articles/s41377-023-01119-0

▲

quickthrowman

4 days ago

[-]

This is unlikely to work in a data center with thousands or tens of thousands of servers emitting heat. Possibly this sort of system will some day function for buildings where only humans are emitting heat.

▲

louwrentius

4 days ago

[-]

So Backblaze is going to invest in nuclear power?

What is the purpose of this article exactly?

▲

moebrowne

3 days ago

[-]

> one megawatt is enough energy to power about 200 American homes

Disappointed that the article continually confuses power and energy.