More

ibraheemdev · 2025-09-24T07:49:21 1758700161

It does make a difference of course if you're running fetch_max from multiple threads, adding a load fast-path introduces a race condition.

masklinn · 2025-09-24T09:16:15 1758705375

Does it tho? Assuming no torn reads/writes at those sizes, given the location should be strictly increasing are there situations where you could read a higher-than-stored value which would cause skipping a necessary update?

Afaik on all of x86, arm, and riscv an atomic load of a word sized datum is just a regular load.

gpderetta · 2025-09-24T10:47:00 1758710820

It doesn't need to be strictly increasing some other thread could be making other arbitrary operations. Still even in that case, as Dylan16807 pointed out, it likely doesn't matter.

masklinn · 2025-09-24T14:17:22 1758723442

> It doesn't need to be strictly increasing some other thread could be making other arbitrary operations

We're talking about collating a maximum, by definition every write to that is an increase.

gpderetta · 2025-09-24T14:19:53 1758723593

If you are implementing a library function atomic<T>::fetch_max, you cannot assume that every other thread is also performing a fetch_max on that object. There might be little reason for it, but other operations are allowed so the the sequence of modifications might not be strictly increasing (but then again, it doesn't matter for this specific optimization).

ibraheemdev · 2025-09-12T01:55:09 1757642109

pip, PDM, and uv already support PEP751 [0] and were involved in the design process.

[0]: https://discuss.python.org/t/community-adoption-of-pylock-to...

ibraheemdev · 2025-08-31T17:20:26 1756660826

> There is no way the shared_ptr<T> is using the expensive sequentially consistent atomic operations.

All RMW operations have sequentially consistent semantics on x86.

It's not exactly a store buffer flush, but any subsequent loads in the pipeline will stall until the store has completed.

Kranar · 2025-08-31T22:03:26 1756677806

It's a common misconception to reason about memory models strictly in terms of hardware.

Sequential consistency is a property of a programming language's semantics and can not simply be inferred from hardware. It is possible for hardware operations to all be SC but for the compiler to still provide weaker memory orderings through compiler specific optimizations.

ibraheemdev · 2025-08-31T22:11:16 1756678276

I'm referring to the performance implications of the hardware instruction, not the programming language semantics. Incrementing or decrementing the reference count is going to require an RMW instruction, which is expensive on x86 regardless of the ordering.

Kranar · 2025-08-31T22:59:54 1756681194

The concept of sequential consistency only exists within the context of a programming language's memory model. It makes no sense to speak about the performance of sequentially consistent operations without respect to the semantics of a programming language.

ibraheemdev · 2025-08-31T23:50:32 1756684232

Yes, what I meant was that the same instruction is generated by the compiler, regardless if the RMW operation is performed with relaxed or sequentially consistent ordering, because that instruction is strong enough in terms of hardware semantics to enforce C++'s definition of sequential consistency.

There is a pretty clear mapping in terms of C++ atomic operations to hardware instructions, and while the C++ memory model is not defined in terms of instruction reordering, that mapping is still useful to talk about performance. Sequential consistency is also a pretty broadly accepted concept outside of the C++ memory model, I think you're being a little too nitpicky on terminology.

Kranar · 2025-09-01T01:17:54 1756689474

The presentation you are making is both incorrect and highly misleading.

There are algorithms whose correctness depends on sequential consistency which can not be implemented in x86 without explicit barriers, for example Dekker's algorithm.

What x86 does provide is TSO semantics, not sequential consistency.

ibraheemdev · 2025-09-01T01:31:51 1756690311

I did not claim that x86 provides sequential consistency in general, I made that claim only for RMW operations. Sequentially consistent stores are typically lowered to an XCHG instruction on x86 without an explicit barrier.

From the Intel SDM:

> Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).

ibraheemdev · 2025-08-20T22:02:02 1755727322

> And futexes aren’t the only way to get there. Alternatives:

> - thin locks (what JVMs use)

> - ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)

Worth nothing that somewhere under the hood, any modern lock is going to be using a futex (if supported). futex is the most efficient way to park on Linux, so you even want to be using it on the slow path. Your language's thread.park() primitive is almost certainly using a futex.

pizlonator · 2025-08-20T22:22:09 1755728529

ParkingLot just uses pthread mutex and cond.

Sure that uses futex under the hood, but the point is, you use futexes on Linux because that’s just what Linux gives you

ibraheemdev · 2025-08-20T22:52:35 1755730355

> ParkingLot just uses pthread mutex and cond.

That's interesting, I'm more familiar with the Rust parking-lot implementation, which uses futex on Linux [0].

> Sure that uses futex under the hood, but the point is, you use futexes on Linux because that’s just what Linux gives you

It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly. A futex lets you manage the atomic condition yourself, while generic parking utilities encode that state internally. A mutex implementation generally already has a built-in atomic condition with simpler state transitions for each thread in the queue, and so can avoid the additional overhead by making the futex call directly.

[0]: https://github.com/Amanieu/parking_lot/blob/739d370a809878e4...

pizlonator · 2025-08-21T15:52:07 1755791527

> It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly.

No, it absolutely isn’t.

The dominant cost of parking is whatever happens in the kernel and at the microarchitectural level when your thread goes to sleep. That cost is so dominant that whether you park with a futex wait or with a condition variables doesn’t matter at all.

(Source: I’ve done that experiment to death as a lock implementer back when I maintained Jikes RVM’s thin locks and then again when I wrote and maintained ParkingLot.)

ibraheemdev · 2025-02-05T08:56:51 1738745811

> The message has some weird mentions in (alloc565), but the actual useful information is there: a pointer is dangling.

The allocation ID is actually very useful for debugging. You can actually use the flags `-Zmiri-track-alloc-id=alloc565 -Zmiri-track-alloc-accesses` to track the allocation, deallocation, and any reads/writes to/from this location.

ibraheemdev · on Nov 5, 2024

> Every single Future you look at will look like this,

That's not true. A Future is supposed to schedule itself to be woken up again when it's ready. This Future schedules it to be woken immediately. Most runtimes, like Tokio, will put a Future that acts like this at the end of the run queue, so in practice it's not as egregious. However, it's unquestionably a spin lock, equivalent to back off with thread::yield.

ibraheemdev · on Oct 11, 2024

It's quite common for concurrent algorithms to only implement a subset of operations. For example forgoing, removal or iteration. It's also common to put limitations on the data structure, such as limiting keys and values to 64-bits. Papaya being feature-complete means that it does not have any of these limitations when compared to std::collections::HashMap.

ibraheemdev · on Oct 10, 2024

Looks very interesting, but seems to serve a pretty different use case:

> This is an ordered data structure, and supports very high throughput iteration over lexicographically sorted ranges of values. If you are looking for simple point operation performance, you may find a better option among one of the many concurrent hashmap implementations that are floating around. Pay for what you actually use :)

ibraheemdev · on Aug 13, 2024

Java atomics are actually sequentially consistent. C# relaxes this to acquire/release. Though the general concept of happens-before is still immensely useful for learning atomics as sequential consistency is a superset of acquire/release.

freddierest · on Aug 14, 2024

Thanks for correction/clarification. Much as C# has a weaker memory model than Java, my mental model for memory models is weaker than I thought.

Where do Rust and C++ lie wrt C# and Java?

jcranmer · on Aug 14, 2024

All of the memory models in question are based on data-race-free, which says (in essence) that as long as all cross-thread interactions follow happens-before, then you can act as if everybody is sequentially-consistent.

The original Java 5 memory model only offered sequentially-consistent atomics to establish cross-thread happens-before in a primitive way. The C++11 memory model added three more kinds of atomics: acquire/release, consume/release (which was essentially a mistake [1]), and relaxed atomics (which, to oversimplify, establish atomicity without happens-before). Pretty much every memory model since C++11--which includes the Rust memory model--has based its definition on that memory model, with most systems defaulting an otherwise unadorned atomic operation to sequentially-consistent. Even Java has retrofitted ways to get weaker atomic semantics [2].

As a practical matter, most atomics could probably safely default to acquire/release over fully sequentially-consistent. The main difference between the two is that sequentially-consistent is safer if you've got multiple atomic variables in play (e.g., you're going with some fancy lockless algorithm), whereas acquire/release tends to largely be safe if there's only one atomic variable of concern (e.g., you're implementing locks of some kind).

[1] A consume operation is an acquire, but only for loads data-dependent on the load operation. This is supposed to represent a situation that requires no fences on any system not named Alpha, but it turns out for reasons™ that compilers cannot reliably preserve source-level data dependencies, so no compiler really implemented consume/release.

[2] Even Java 5 may have had it in sun.misc.Unsafe; I was never familiar with that API, so I don't know for certain.

gpderetta · on Aug 14, 2024

> as long as all cross-thread interactions follow happens-before, then you can act as if everybody is sequentially-consistent.

I don't think that's the actual guarantee. You can enforce happens-before with just acquire/release, but AFIK that's not enough to recover SC in the general case[1].

As far as I understand, The Data Race Free - Sequentially Consistent memory model (DRF-SC) used by C++11 (and I think Java), says that as long as all operation on atomics are SC and the program is data-race-free, then the whole program can be proven to be sequentially consistent.

[1] but it might in some special cases, for example when all operations are mutex lock and unlock.

ibraheemdev · on Jan 10, 2024

Formatting the if onto a single line gets you most of the way there.

    if err != nil { return err }

emmanueloga_ · on Jan 10, 2024

Not a bad idea actually! I would be happy with this too! It would require changing go's fmt to allow inline code in braces if it consists of a single return... that would be backward compatible with other code too (just make fmt not revert the previous style to a single line).