Does it tho? Assuming no torn reads/writes at those sizes, given the location should be strictly increasing are there situations where you could read a higher-than-stored value which would cause skipping a necessary update?
Afaik on all of x86, arm, and riscv an atomic load of a word sized datum is just a regular load.
It doesn't need to be strictly increasing some other thread could be making other arbitrary operations. Still even in that case, as Dylan16807 pointed out, it likely doesn't matter.
If you are implementing a library function atomic<T>::fetch_max, you cannot assume that every other thread is also performing a fetch_max on that object. There might be little reason for it, but other operations are allowed so the the sequence of modifications might not be strictly increasing (but then again, it doesn't matter for this specific optimization).
It's a common misconception to reason about memory models strictly in terms of hardware.
Sequential consistency is a property of a programming language's semantics and can not simply be inferred from hardware. It is possible for hardware operations to all be SC but for the compiler to still provide weaker memory orderings through compiler specific optimizations.
I'm referring to the performance implications of the hardware instruction, not the programming language semantics. Incrementing or decrementing the reference count is going to require an RMW instruction, which is expensive on x86 regardless of the ordering.
The concept of sequential consistency only exists within the context of a programming language's memory model. It makes no sense to speak about the performance of sequentially consistent operations without respect to the semantics of a programming language.
Yes, what I meant was that the same instruction is generated by the compiler, regardless if the RMW operation is performed with relaxed or sequentially consistent ordering, because that instruction is strong enough in terms of hardware semantics to enforce C++'s definition of sequential consistency.
There is a pretty clear mapping in terms of C++ atomic operations to hardware instructions, and while the C++ memory model is not defined in terms of instruction reordering, that mapping is still useful to talk about performance. Sequential consistency is also a pretty broadly accepted concept outside of the C++ memory model, I think you're being a little too nitpicky on terminology.
The presentation you are making is both incorrect and highly misleading.
There are algorithms whose correctness depends on sequential consistency which can not be implemented in x86 without explicit barriers, for example Dekker's algorithm.
What x86 does provide is TSO semantics, not sequential consistency.
I did not claim that x86 provides sequential consistency in general, I made that claim only for RMW operations. Sequentially consistent stores are typically lowered to an XCHG instruction on x86 without an explicit barrier.
From the Intel SDM:
> Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).
> And futexes aren’t the only way to get there. Alternatives:
> - thin locks (what JVMs use)
> - ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)
Worth nothing that somewhere under the hood, any modern lock is going to be using a futex (if supported). futex is the most efficient way to park on Linux, so you even want to be using it on the slow path. Your language's thread.park() primitive is almost certainly using a futex.
That's interesting, I'm more familiar with the Rust parking-lot implementation, which uses futex on Linux [0].
> Sure that uses futex under the hood, but the point is, you use futexes on Linux because that’s just what Linux gives you
It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly. A futex lets you manage the atomic condition yourself, while generic parking utilities encode that state internally. A mutex implementation generally already has a built-in atomic condition with simpler state transitions for each thread in the queue, and so can avoid the additional overhead by making the futex call directly.
> It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly.
No, it absolutely isn’t.
The dominant cost of parking is whatever happens in the kernel and at the microarchitectural level when your thread goes to sleep. That cost is so dominant that whether you park with a futex wait or with a condition variables doesn’t matter at all.
(Source: I’ve done that experiment to death as a lock implementer back when I maintained Jikes RVM’s thin locks and then again when I wrote and maintained ParkingLot.)
> The message has some weird mentions in (alloc565), but the actual useful information is there: a pointer is dangling.
The allocation ID is actually very useful for debugging. You can actually use the flags `-Zmiri-track-alloc-id=alloc565 -Zmiri-track-alloc-accesses` to track the allocation, deallocation, and any reads/writes to/from this location.
> Every single Future you look at will look like this,
That's not true. A Future is supposed to schedule itself to be woken up again when it's ready. This Future schedules it to be woken immediately. Most runtimes, like Tokio, will put a Future that acts like this at the end of the run queue, so in practice it's not as egregious. However, it's unquestionably a spin lock, equivalent to back off with thread::yield.
It's quite common for concurrent algorithms to only implement a subset of operations. For example forgoing, removal or iteration. It's also common to put limitations on the data structure, such as limiting keys and values to 64-bits. Papaya being feature-complete means that it does not have any of these limitations when compared to std::collections::HashMap.
Looks very interesting, but seems to serve a pretty different use case:
> This is an ordered data structure, and supports very high throughput iteration over lexicographically sorted ranges of values. If you are looking for simple point operation performance, you may find a better option among one of the many concurrent hashmap implementations that are floating around. Pay for what you actually use :)
Java atomics are actually sequentially consistent. C# relaxes this to acquire/release. Though the general concept of happens-before is still immensely useful for learning atomics as sequential consistency is a superset of acquire/release.
All of the memory models in question are based on data-race-free, which says (in essence) that as long as all cross-thread interactions follow happens-before, then you can act as if everybody is sequentially-consistent.
The original Java 5 memory model only offered sequentially-consistent atomics to establish cross-thread happens-before in a primitive way. The C++11 memory model added three more kinds of atomics: acquire/release, consume/release (which was essentially a mistake [1]), and relaxed atomics (which, to oversimplify, establish atomicity without happens-before). Pretty much every memory model since C++11--which includes the Rust memory model--has based its definition on that memory model, with most systems defaulting an otherwise unadorned atomic operation to sequentially-consistent. Even Java has retrofitted ways to get weaker atomic semantics [2].
As a practical matter, most atomics could probably safely default to acquire/release over fully sequentially-consistent. The main difference between the two is that sequentially-consistent is safer if you've got multiple atomic variables in play (e.g., you're going with some fancy lockless algorithm), whereas acquire/release tends to largely be safe if there's only one atomic variable of concern (e.g., you're implementing locks of some kind).
[1] A consume operation is an acquire, but only for loads data-dependent on the load operation. This is supposed to represent a situation that requires no fences on any system not named Alpha, but it turns out for reasons™ that compilers cannot reliably preserve source-level data dependencies, so no compiler really implemented consume/release.
[2] Even Java 5 may have had it in sun.misc.Unsafe; I was never familiar with that API, so I don't know for certain.
> as long as all cross-thread interactions follow happens-before, then you can act as if everybody is sequentially-consistent.
I don't think that's the actual guarantee. You can enforce happens-before with just acquire/release, but AFIK that's not enough to recover SC in the general case[1].
As far as I understand, The Data Race Free - Sequentially Consistent memory model (DRF-SC) used by C++11 (and I think Java), says that as long as all operation on atomics are SC and the program is data-race-free, then the whole program can be proven to be sequentially consistent.
[1] but it might in some special cases, for example when all operations are mutex lock and unlock.
Not a bad idea actually! I would be happy with this too! It would require changing go's fmt to allow inline code in braces if it consists of a single return... that would be backward compatible with other code too (just make fmt not revert the previous style to a single line).