[1] With weak memory hardware, it is effectively impossible to write correct programs without using at least one of atomic read-modify-write or barriers, because both compilers and hardware can reorder both reads and writes, to a maddening degree.
(I previously posted this comment as a reply to a comment that got flagged.)
I can see where the parent is coming from.
I think you can both be right - it can be valuable in any case.
That's not quite true. The article explicitly mentions problems with interleaving of instructions between two processes (aka threads) which is the fundamental problem of concurrency. If you consider only a single thread in isolation its program order is always maintained even though compiler optimizations and hardware OOO execution actually execute the instructions in data order.
Consistency Model - https://en.wikipedia.org/wiki/Consistency_model
Again, with weak memory models, writes from another thread can be observed out of order due to hardware reordering, even without compiler optimizations. With weak memory models, there can be behaviors that correspond to no valid interleaving of two threads.
I am explicitly pointing out that what you are assuming, along with the author, what is often called sequential consistency (https://en.wikipedia.org/wiki/Sequential_consistency) hasn't been true of hardware for decades. It's a common misconception that most systems obey sequential consistency. Your comment just seems to repeat that again, and it isn't true. See here (https://preshing.com/20120930/weak-vs-strong-memory-models/) for a deeper explanation.
Once you start thinking about a program as a sequence of loads/stores (i.e. reads/writes to shared memory) and note that Pipelining/OOO/Superscalar are techniques to parallelize these (and other) instructions for a single thread of control you start getting an idea of how sequential order can be preserved though the actual execution is not quite so.
Some relevant details in a previous comment of mine - https://news.ycombinator.com/item?id=42422867
See also this book - https://news.ycombinator.com/item?id=42472334
I disagree. If we want people to understand, then we should teach computer systems bottom-up and thus understand the bottom-level behaviors before being surprised that our bad (i.e. wrong) abstractions are leaking later. If you want to train programmers to accomplish parallel programming tasks via plugging together nice abstractions, that's all well and good, but that's a completely different kind of instruction than trying to teach them how things work. If we only do the latter then we'll never have experts again.
And while I will agree that mutexes and the like are useful in their own right, they are not the most fundamental for understanding concurrency.
It's not in most programming languages. That's actually a big part of the linked Wikipedia article is when and how program order is allowed to be modified.
A great portion of the JVM memory model definition is dedicated to laying out exactly when and how read/write barriers are established and when/how ordering is enforced. Without any sort of barrier (in most cases) the JVM is fairly free to completely reorder program execution.
[1] https://github.com/amilajack/reading/blob/master/Computer_Sc...
Visualizing Concurrency in Go (2016)
The instruction pointer is all synchronized, providing you with fewer states to reason about.
Then GPUs mess that up by letting us run blocks/thread groups independently, but now GPUs have highly efficient barrier instructions that line everyone back up.
It turns out that SIMDs innate assurances of instruction synchronization at the SIMD lane level is why warp based coding / wavefront coding is so efficient though, as none of those barriers are necessary anymore.
We use threads to solve all kinds of things, including 'More Compute'.
SIMD is limited to 'More Compute' (unable to process I/O like sockets concurrently, or other such thread patterns). But as it turns out, more compute is a problem that many programmers are still interested in.
Similarly, you can use Async patterns for the I/O problem (which seems to be more efficient anyway than threads).
--------
So when we think about a 2024 style program, you'd have SIMD for compute limited problems (Neural Nets, Matricies, Raytracing). Then Async for Sockets, I/O, etc. etc.
Which puts traditional threads in this weird jack of trades position: not as good as SIMD methods for raw compute. Not as good as Async for I/O. But threads do both.
Fortunately, there seem to be problems with both a lot of I/O and a lot of compute involved simultaneously.
IMHO the jury is still out on whether async I/O is worth it, either in terms of performance or the potential complexity that applications might incur in trying to do it via callback hell. Many programmers find synchronous I/O to be a really, really intuitive programming model, and the lowest levels of the software stack (i.e. syscalls) are almost always synchronous.
-------
But let's zoom into Raytracing for a minute. Intel's Raytrace (and indeed, the DirectX model of Raytracing) is for Ray Dispatches to be consolidated in rather intricate ways.
Intel will literally move the stack between SIMD lanes, consolidating rays into shared misses and shared hits (to minimize branch divergence).
There's some new techniques being presented here in today's SIMD models that cannot easily be described by the traditional threading models.
GPUs in particular have a very hyperthread/SMT like model where multiple true threads (aka instruction pointers) are juggled while waiting for RAM to respond.
Still, the intermediate organizational step where SIMD gives you a simpler form of parallelism is underrated and understudied IMO.
> How concurrecy works: A text guide
This is a comforting idea, but enumerating every possible state of a real-world (integrated/distributed) software system is a fool's task.
Spend less time on formal verification (fancy word for perfectionism) and more time on risk analysis and cybernetics. Instead of praying that nothing ever goes wrong, plan for faults and design systems that integrate human and machine to respond rapidly and effectively.
"Correctable" is a much more practical target than "always correct".
The 1st edition divides the algorithms under two broad sections viz; "Message Passing" (eg. chapter Mutual Exclusion) and "Shared Memory" (eg. chapter Mutual Exclusion II) while the 2nd edition removes these section headings and simply groups under functional categories (eg. both the above under one chapter Mutual Exclusion).
Book website - https://mitpress.mit.edu/9780262037662/distributed-algorithm...
So is the ability to compute the entire state space and prove that liveness and/or safety properties hold the single main basis of model checker’s effectiveness? Or are there other fundamental capabilities as well?
[1] With weak memory hardware, it is effectively impossible to write correct programs without using at least one of atomic read-modify-write or barriers, because both compilers and hardware can reorder both reads and writes, to a maddening degree.
1. "It's not visual, it's text". Yeah, but: how many "visual" representations have no text? And there _are_ visuals in there: the depictions of state space. They include text (hard to see how they'd be useful without) but aren't solely so.
2. "Meh, verification is for well paid academics, it's not for the real world". First off, I doubt those "academics" are earning more than median sw devs, never mind those in the SV bubble. More importantly: there are well-publicised examples of formal verification being used for real-world code, see e.g. [1].
It's certainly true that verification isn't widespread. It has various barriers, from use of formal maths theory and presentation to the compute load arising from combinatorial explosion of the state space. Even if you don't formally verify, understanding the state space size and non-deterministic path execution of concurrent code is fundamentally important. As Dijkstra said [2]:
> our intellectual powers are rather geared to master static relations and that our powers to visualise processes evolving in time are relatively poorly developed. For that reason we should do (as wise programmers aware of our limitations) our utmost to shorten the conceptual gap between the static process and the dynamic program, to make the correspondence between the program (spread out in space) and the process (spread out in time) as trivial as possible.
He was talking about sequential programming: specifically, motivating the use of structured programming. It's equally applicable to concurrent programming though.
[1]: https://github.com/awslabs/aws-lc-verification
[2]: https://homepages.cwi.nl/~storm/teaching/reader/Dijkstra68.p...
What does this mean and how does it excuse a 'visual' article being all text?
I doubt those "academics" are earning more than median sw devs
Do you think the point might have been more about the pragmatism of formal verification?
I find the BEAM/OTP process model relatively simple to use, and it moves some of the friction from logic bugs to a throughput problem.
Just... no. First off, I'll bet a vanishingly small number of application developers on BEAM languages do anything to manage their own worker pools. The BEAM does it for them: it's one of its core strengths. You also imply that "spinning up a new BEAM process" has significant overhead, either computationally or cognitively. That, again, is false. Spinning up processes is intrinsic to Erlang/Elixir/Gleam and other BEAM languages. It's encouraged, and the BEAM has been refined over the years to make it fast and reliable. There are mature, robust facilities for doing so - as well as communicating among concurrent processes during their execution and/or retrieving results at termination.
You've made clear before that you believe async/await is a superior concurrency model to the BEAM process-based approach. Or, more specifically, async/await as implemented in .Net [1]. Concurrent programming is still in its infancy, and it's not yet clear which model(s) will win out. Your posts describing .Net's async approach are helpful contributions to the discussion. Statements such as that quoted above are not.
[1]: https://news.ycombinator.com/item?id=40427935
--
EDIT: fixed typo.
Spawning 1M processes with Elixir is anywhere between ~2.7 and 4 GiB with significant CPU overhead. Alternatives are much more efficient at both: https://hez2010.github.io/async-runtimes-benchmarks-2024/tak... (a revised version).
Go, for example, probably does not want you to spawn the Goroutines that are too short-lived as each one carries at least 2KiB of allocations by default which is not very kind to Go's GC (or any GC or allocator really). So it expects you to be at least somewhat modest at such concurrency patterns.
In Erlang and Elixir the process model implementation follows similar design choices. However, Elixir lets you idiomatically await task completion - its approach to concurrency is much more pleasant to use and less prone to user error. Unfortunately, the per-process cost remains very Goroutine-like and is the highest among concurrency abstractions as implemented by other languages. Unlike Go, it is also very CPU-heavy and when the high amount of processes are getting spawned and exit quickly, it takes quite a bit of overhead for the runtime to keep up.
Sure, BEAM languages are a poor fit for compute-intensive tasks without bridging to NIFs but, at least in my personal experience, the use cases for "task interleaving" are everywhere. They are just naturally missed because usually the languages do not make it terse and/or cheap - you need to go out of your way to dispatch operations concurrently or in parallel, so the vast majority doesn't bother, even after switching to languages where it is easy and encouraged.
(I'm not stating that async/await model in .NET is the superior option, I just think it has good tradeoffs for the convenience and efficiency it brings. Perhaps a more modern platform will come up with inverted approach where async tasks/calls are awaited by default without syntax noise and the cost of doing "fork" (or "go" if you will) and then "join" is the same as with letting the task continue in parallel and then awaiting it, without the downsides of virtual threads in what we have today)
What do you mean by "tightly interleaving"? It sounds like a technical term of some importance but when I web search it seems people are saying it means that the order of execution or data access is arbitrary, which you can let the built-in preemptive multitasking or a queue with a worker pool figure out for you.
In the alternatives you have in mind, how good is the monitoring, observability and IPC tooling? What's their hardware clustering story? I've done IPC in POSIX and that was absolutely dreadful so I hope you're working with something better than that at least.