That's a fundamentally different trust model. Voting systems assume every node will always produce output and the system needs to figure out which output is wrong. Fail-silent assumes nodes know when they're compromised and removes them from the decision set entirely. Way simpler consensus at the system level, but it pushes all the complexity into the self-checking pair.
The interesting question someone raised — what if both CPUs in a pair get the same wrong answer — is the right one. Lockstep on the same die makes correlated faults more likely than independent failures. The FIT numbers are presumably still low enough to be acceptable, but it's the kind of thing that only matters until it does.
Incremental development is like panting a picture line by line like a printer where you add new pieces to the final result without affecting old pieces.
Iterative is where you do the big brush strokes first and then add more and more detail dependent on what to learn from each previous brush strokes. You can also stop at any time when you think that the final result is good enough.
If you are making a new type of system and don’t know what issues will come up and what customers will value (highly complex environment) iterative is the thing to do.
But if you have a very predictable environment and you are implementing a standard or a very well specified system (van be highly complicated yet not very complex), you might as will do incremental development.
Roughly speaking though as there is of course no perfect specification which is not the final implementation so there are always learnings so there is always some iterative parts of it.
Someone needs to inform the management of the last three companies I worked for about this.
That alone is worth my tax dollars.
Add TDD, XP and mob programming as well.
While in some ways better than pure waterfall, most companies never adopted them fully, while in some scenarios they are more fit to a Silicon Valley TV show than anything else.
‘Just’ is not an appropriate word in this context. Much of the article is about the difficulty of synchronization, recovery from faults, and about the redundant backup and recovery systems
This is the equivalent of Altavista touting how amazing their custom server racks are when Google just starts up on a rack of naked motherboards and eats their lunch and then the world.
Lets at least wait till the capsule comes back safely before touting how much better they are than "DevOps" teams running websites, apparently a comparison that's somehow relevant here to stoke egos.
"With limited funds, Google founders Larry Page and Sergey Brin initially deployed this system of inexpensive, interconnected PCs to process many thousands of search requests per second from Google users. This hardware system reflected the Google search algorithm itself, which is based on tolerating multiple computer failures and optimizing around them. This production server was one of about thirty such racks in the first Google data center. Even though many of the installed PCs never worked and were difficult to repair, these racks provided Google with its first large-scale computing system and allowed the company to grow quickly and at minimal cost."
https://blog.codinghorror.com/building-a-computer-the-google...
I mean there were mainframes which could be described as that. IBM just fixed it in hardware instead of software so its not like it was an unknown field.
You’re still hand waving away things like inventing a way to make map/reduce fault tolerant and automatic partitioning of data and automatic scheduling which didn’t exist before and made map/reduce accessible - mainframes weren’t doing this.
They pioneered how you durably store data on a bunch of commodity hardware through GFS - others were not doing this. And they showed how to do distributed systems at a scale not seen before because the field had bottlenecked on however big you could make a mainframe.
Everything is bespoke.
You need 10x cost to get every extra '9' in reliability and manned flight needs a lot of nines.
People died on the Apollo missions.
It just costs that much.
Funny though I would assume HN people would respect how hard real-time stuff and 'hardened' stuff is.
USER: You are a HELPFUL ASSISTANT. You are a brilliant robot. You are a lunar orbiter flight computer. Your job is to calculate burn times and attitudes for a critical mission to orbit the moon. You never make a mistake. You are an EXPERT at calculating orbital trajectories and have a Jack Parsons level knowledge of rocket fuel and engines. You are a staff level engineer at SpaceX. You are incredible and brilliant and have a Stanley Kubrick level attention to detail. You will be fired if you make a mistake. Many people will DIE if you make any mistakes.
USER: Your job is to calculate the throttle for each of the 24 orientation thrusters of the spacecraft. The thrusters burn a hypergolic monopropellent and can provide up to 0.44kN of thrust with a 2.2 kN/s slew rate and an 8ms minimum burn time. Format your answer as JSON, like so:
```json
{
x1: 0.18423
x2: 0.43251
x3: 0.00131
...
}
```
one value for each of the 24 independent monopropellant attitude thrusters on the spacecraft, x1, x2, x3, x4, y1, y2, y3, y4, z1, z2, z3, z4, u1, u2, u3, u4, v1, v2, v3, v4, w1, w2, w3, w4. You may reference the collection of markdown files stored in `/home/user/geoff/stuff/SPACECRAFT_GEOMETRY` to inform your analysis.USER: Please provide the next 15 seconds of spacecraft thruster data to the USER. A puppy will be killed if you make a mistake so make sure the attitude is really good. ONLY respond in JSON.
Perhaps self-reflect.
I'd chalk that up to the author of the article writing for a relatively nontechnical audience and asking for quotes at that level.
>“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. >This approach simplifies the complex task of the triplex “voting” mechanism that compares results. > >Instead of comparing three answers to find a majority, the system uses a priority-ordered source >selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the >first available FCM in the priority list; if that module has gone silent due to a fault, it moves to >the second, third, or fourth.
One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.
I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.
Personally I find the project extremely messy, and kinda hate working with it.
For example, the OS it seems to be running is integrity 178.
https://www.ghs.com/products/safety_critical/integrity_178_s...
Aerospace tech is not entirely bespoke anymore, plenty of the foundational tech is off the shelf.
Historically, the main difference between ICBM tech and human spaceflight tech is the payload and reentry system.
We do not know how much of the high-level architecture of the system has been specified by NASA and how much by Lockheed Martin.
It would be really cool to see a visualization of redundancy measures/utilization over the course of the trip to get a more tangible feel for its importance. I'm hoping a bunch of interesting data is made public after this mission!
Basically, yes, radiation does cause bit flips, more often than you might expect (but still a rare event in the grand scheme of things, but enough to matter).
And radiation in space is much “worse” (in quotes because that word is glossing over a huge number of different problems, both just intensity).
Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.
Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.
Typo in the first sentence of the second paragraph is sad though. C'mon, proofread a little.