I wonder if there's an available OS that parity checks every operation, analogous to what's planned for Quantum computers.
Unrelated, but the other day I read that the main computer for core calculation in Fukushima's nuclear plant used to run a very old CPU with 4 cores. All calculations are done in each core, and the result must be exactly the same. If one of them was different, they knew there was a bit flip, and can discard that one calculation for that one core.
Interesting. I wonder why they didn't just move it to somewhere with less radiation? And clearly, they have another more trustworthy machine doing the checking somehow. A self-correcting OS would have to parity check it's parity checks somehow, which I'm sure is possible, but would be kind of novel.
In a really ugly environment, you might have to abandon semiconductors entirely, and go back to vacuum as the magical medium, since it's radiation proof (false vacuum apocalypse aside). You could make a nuvistor integrated "chip" which could do the same stuff; the biggest challenge would be maintaining enough emissions from the tiny and quickly-cooling cathodes.
There was that kind of bug in Linux and a person restarted it idk how much (iirc around 2k times) just to debug it.
We call this sort of test "fuzzy". If it's really bad they call it by my own personal identifier of "unstable".
One of my old programs produces a broken build unless you then compile it again.
Just had that happen to me today. Setup logging statements and reran the job, and it ran successfully.
I've had that happen, the logging statements stopped a race condition. After I removed them it came back...
If that doesn't work, sometimes your computer just needs a rest. Take the rest of the day off and try it again tomorrow.
All the time. Causes include:
- Test depends on an external system (database, package manager)
- Race conditions
- Failing the test cleared bad state (test expects test data not to be in the system and clears it when it exits)
- Failing test set up unknown prerequisite (Build 2 tests depends on changes in Build 1 but build system built them out of order)
- External forces messing with the test runner (test machine going to sleep or running out of resources)
We call those "flaky tests" and only fail a build if a given test cannot pass after 2 retries. (We also flag the test runs for manual review)
My way: wrap it in a shell script and put a condition if exit status is not 0 then say "try clear the cache and run it again"