Post-Silicon Validation of Static Lockstep Mode
37 points
by luu
4 days ago
| 3 comments
| intel.com
| HN
electricshampo1
4 days ago
[-]
Nice to see SDC concerns being taken more seriously by hardware folks. Once software gets to sufficient quality (which we have achieved in many cases), these kinds of rando hw issues are the only remaining causes of "impossible" bugs that waste endless engineering time to debug.

I wonder how much of this relies on or is made easier by the clustered core architecture of E-Core Xeons. In comparison each physical core of P-Core Xeons is its own island basically.

reply
trebligdivad
4 days ago
[-]
Is this limited to lockstep between softcores on a die - so good for low level error failures like soft error, but no good if the package dies? (Still very neatly done)
reply
addaon
3 days ago
[-]
> Is this limited to lockstep between softcores on a die - so good for low level error failures like soft error, but no good if the package dies? (Still very neatly done)

Depends on what you mean by "good for." The intent of lockstep is to convert essentially all undetectable errors to detectable errors, usually to allow fail-silent behavior, rather than to eliminate detectable errors. This property that all failures have defined failure modes is then used at the system level to build robust systems; for example downstream actuators can receive multiple command streams from multiple lockstep systems, and, relying on the invariant that a correctly received message came from a correctly operating system, can safely act on any of them, rather than needing to vote on the received messages. A package failure should be very unlikely to introduce an undetectable error in this context.

reply
bombela
4 days ago
[-]
I wonder what is the ratio of software vs those type of hardware bugs in the wild. Maybe the product of this paper will help produce this metric.
reply