Second weird thing is that it says cooling accounts for 40% of data center power usage, but this comes right after discussing PUE without contextualizing PUE with concrete numbers. State-of-the-art PUE is below 1.1. The article then links to a pretty flimsy source that actually says server loads are 40% ... this implies a PUE of 2.5. That could be true for global IT loads including small commercial server rooms, but it hardly seems relevant when discussing new builds of large facilities.
Finally, it's irritating when these articles are grounded in equivalents of American homes. The fact is that a home just doesn't use a lot of energy, so it's a silly unit of measure. These figures should be based on something that actually uses energy, like cars or aircraft or something.
Google have been wrong a couple of times, and this is one area where I think what they've said (18 years ago btw) might have had some time to meet the rubber of reality a bit more.
Google also famously chose to disavow ECC as mandatory[0] but then quietly changed course[1].
In fact, even within the field of memory: higher temperatures cause more errors[2], and voltage leaking is more common at higher temperatures within dense lithographic electronics (memory controllers, CPUs)[3].
Regardless: thermal expansion and contraction will cause degradation of basically any material that I can think of, so if you can utilise the machines 100% consistently and maintain a solid temperature then maybe the hardware doesn't age as aggressively as our desktop PCs that play games- assuming there's no voltage leaking going on to crash things.
[0]: https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
[1]: https://news.ycombinator.com/item?id=14206811
[2]: https://dramsec.ethz.ch/papers/mathur-dramsec22.pdf
[3]: https://www.researchgate.net/publication/271300947_Analysis_...
There is absolutely no doubt that with increasing temperature the rate of failures for any semiconductor device increases very quickly. This is routinely tested by any manufacturer.
What happens is that at low enough temperatures the rate of failures caused by temperature may be small in comparison with that for failures caused by other reasons so you will see no temperature effect. However, once you raise the temperature enough, you will see an obvious dependence of temperature for the rate of failures.
Semiconductor devices are designed so that their rate of failures for a crystal temperature specified in their datasheet, usually in the range of 90 to 110 degrees Celsius, is low enough so that most devices will have a life of at least 10 years or other such value.
Which is the ambient temperature at which the nominal maximum temperature is reached depends on the cooling and on the power consumption.
If the device has a temperature that exceeds the nominal maximum temperature, it is pretty certain that you will see a strong dependence on temperature of the failure rate.
Whether you also see temperature effects at lower crystal temperatures, e.g. around 60 degrees Celsius, depends on the device and it is unpredictable unless you do a costly experiment yourself.
In general, it is expected that for low-quality devices you will not see temperature effects, because those will fail for other reasons, while for high-quality devices, which lack manufacturing defects, you will see a temperature dependence for the failure rate even at lower temperatures.
So Google might have not seen temperature effects because they were using the cheapest junk anyway.
> So, the methodology around temperature mitigation always starts at power reduction—which means that growth, IT efficiencies, right-sizing for your capacity...
https://www.asme.org/topics-resources/content/new-solar-ener...
What is the purpose of this article exactly?
Disappointed that the article continually confuses power and energy.
I forgot to put it in the title and I can't edit anymore.