You Can Do Any Kind of Atomic Read-Modify-Write Operation Comments

Charles

Mon, 15 May 2017 16:34:10 +0000

Hi Jeff, When I use memory_order_acq_rel on an RMW, my understanding is that it prevents memory operations from moving down below the operation (release) and from moving up above the operation (acquire). Does that mean it acts as a full barrier (including #StoreLoad) ? If so, what is the difference with using memory_order_seq_cst on an RMW call ?

Francesco

Tue, 7 Feb 2017 11:27:51 +0000

Thank Jeff, sorry for the strange questions but in my daily-job language (Java) there are no atomic read/write with relaxed order (until now) only read-acquire/write-release ones, hence the java memory model accustom the devs to use read-acquire/write-release for every thread communication... About the reordering logic: the CAS loop I've written but using relaxed order doesn't suffer for reordering of any kind? This is the part the scares me the most...

Francesco

Tue, 7 Feb 2017 09:54:48 +0000

Hi Jeff! Thanks for the blog, is pure gold!!!! I'm a bit confused by the failure and success parameters in: bool compare_exchange_strong( T& expected, T desired, std::memory_order success, std::memory_order failure ) volatile; In a loop that do: local_x = x.loadAcquire do{ ...check if local_x is valid to compute next_x and exit if not... next_x = ...cool computation that use local_x... }while(x.cas(&local_x,next_x,Release,Relaxed); The acquire and release order (at start and at the end) here is not necessary to perform a correct CAS loop, but to ensure that: 1) all the instructions inside the (failed iterations of the) loop can't be pushed outside of it (it is a single-thread compiler/HW constraint) 2) correct acquire and release semantics for any other shared memory access inside the loop It is true? Or i'm using the acquire-release semantic when is not necessary? I generally tends to use write-release operation when I want to transfer data between threads, but if the data is x itself, the relaxed order could be enough to maintain the correct single-threaded logic (no hoisting etc etc) and thread-safe operation atomicity? Sorry for the long question :(

Zab

Mon, 3 Oct 2016 09:23:55 +0000

Hi Jeff Can you please help me understand the guarantees behind Read-Write atomic operation? Here is my simple scenario: 1. One singleton writer process is allowed to increment a shared 32-bits unsigned integer in a non-atomic fashion (i = i + 1) 2. In the other side, N-reader processes are reading this shared unsigned integer using an atomic read (__atomic_add_fetch). Is it Ok? I mean does the CPU guarantees that the readers will be able to read valid values ? Or can something strange happens in between, while writing to this 32-bits integer and reading from it ? I'm completely lost with the guarantees here. Maybe you can blog about it. Thank you.

AR4

Tue, 29 Dec 2015 23:26:26 +0000

Jeff, I love your blog, and its been in my RSS lists for years now. Unfortunately it looks like my reader is starting to choke on the size of the feed. Any chance you could cut it down to just the last year or so? Looking forward to more posts!

@jcdickinson

Thu, 8 Oct 2015 10:43:27 +0000

Shouldn't fetch_mulitply re-load oldValue in the while loop body?

EugeneZ

Thu, 24 Sep 2015 13:23:53 +0000

Jeff, thanks for the read! I composed some comments to the draft (N4296) of a proposed C++17 Standard. Are they correct? >>> https://parallella.org/forums/viewtopic.php?f=13&...

Sean

Mon, 10 Aug 2015 10:29:04 +0000

Hi Jeff, What do you use for the graphics/pictures on your blog? It's presented very well.

Gautam Goel

Mon, 22 Jun 2015 05:25:15 +0000

Hi Jeff, Love your blog. Quick question: what Octopress theme are you using? The blog looks great! Thanks.

Eugene

Wed, 17 Jun 2015 11:43:10 +0000

what will the correct memory_order for simple bounded stack implementation? size_t stack_buffer[256]; td::atomic_uint_fast8_t stack_pointer(0); void push(const size_t value) { // the out of bounds write assumed to not ever happen stack_buffer[stack_pointer.fetch_add(1, std::memory_order_release)] = value; } bool pop(size_t& value) { uint_fast8_t oldStatus(stack_pointer.load(std::memory_order_relaxed)); size_t id; do { if(oldStatus == 0) return; value = stack_buffer[oldStatus-1]; } while (!stack_pointer.compare_exchange_weak(oldStatus, oldStatus-1, std::memory_order_acquire, std::memory_order_relaxed)); } i've noticed that using default memory_order for compare_exchange_weak and fetch_add can really hurt performance on mobile platforms. but i'm not particular sure if my use is correct. thanx

Jeff Preshing

Mon, 11 May 2015 11:34:06 +0000

std::atomic<>::fetch_add() is an RMW. So is InterlockedIncrement() in Win32. I don't know what API exposes atomic_increment(), but based on the name, it's likely to be an RMW too.

Hari

Mon, 11 May 2015 11:12:31 +0000

I have a question. If you are updating stats from multiple threads, say: struct { uint64_t count; }; Would atomic_increment(&s.count) suffice or would you have to do RMW? Seems to me you have to use RMW.

Jeff Preshing

Fri, 3 Apr 2015 09:15:08 +0000

Yes. There is actually a limit on the number of RMW operations that can happen on the same address per second, no matter how many threads you throw at it. On my quad-core MacBook, I once measured it at around 20 ns per RMW operation. With four threads running on four cores, that means each thread could only get in there once every 80 ns, on average. On my dual-processor PowerMac G5, the rate was much lower and I'm pretty sure individual threads could be made to starve indefinitely. So, I think you can make it happen in synthetic tests, but even in that case, at least there's still progress being made across the whole system.

Nathan Reed

Fri, 3 Apr 2015 05:29:09 +0000

Do you ever run into starvation issues with CAS loops, if several threads are hammering on the same shared variable with CAS loops? Suppose thread A has to do a bunch of extra work in the "modify" step for some reason, such that one of threads B, C, ... has almost always gotten in and updated the variable by the time A gets to the CAS call. Then A might have to retry many, many times before it succeeds. Maybe this situation is contrived enough that it's not actually a problem in practice. :)

Jeff Preshing

Fri, 3 Apr 2015 03:15:17 +0000

You're right. Thanks for the precision! I've since changed that sentence to "If it fails, it loads the current value of shared back into expected." It doesn't make any difference when you pass a local variable to expected (as is done here), but it's obviously an important difference in the cases described in that Stack Overflow link, where a shared variable is passed in.

bames53

Thu, 2 Apr 2015 17:22:27 +0000

This quote is not quite correct: "Simultaneously, it always loads the current value of shared back into expected" compare_exchange_weak/strong should not write to `expected` if the exchange is successful. `expected` may be a shared variable, and a successful write may be used to indicate that the thread is finished with its work and that other threads can now safely read and write that shared variable. In such a case, if compare_exchange were to write to expected again it could create a race and stomp on writes by other threads. The C++ spec states that expected is written to only if the exchange fails [n4140 § 29.6.5 / 21]. Originally, implementations did not correctly guarantee this, but MSVC, gcc, and clang have since been fixed. See my stackoverflow answer for more details: http://stackoverflow.com/a/21946549/365496

Jeff Preshing

Thu, 2 Apr 2015 16:30:00 +0000

Well, C\+\+11 atomic operations are always atomic. They're just not always lock-free. The library was designed that way to achieve maximum portability while maintaining correctness. It only guarantees lock-freedom if std::atomic<T>::is_lock_free() returns a non-zero value for that type. When C\+\+11 atomic operations are not lock-free, then yea, they will compile to code that uses something equivalent to a mutex. You can test it by wrapping a very large struct in an atomic<>. In VS2012/x86, the compiler implements a simple per-object spinlock, for example.

zeuxcg

Thu, 2 Apr 2015 15:39:30 +0000

This is interesting. Is there any reason why std::atomic<T> can't fail to compile if lock-free code can't be generated? Silently making atomic operations non-atomic seems... bad. Or will atomic<T> with unsupported T compile to a code that uses a global mutex or something of the sort?

Jeff Preshing

Thu, 2 Apr 2015 14:05:24 +0000

compare_exchange_weak always reloads oldValue on failure. oldValue is passed by reference.

Matt Fisher

Thu, 2 Apr 2015 13:16:10 +0000

I especially like the concurrent fetch_multiply thread graphic. There appears to be a potential infinite loop in your first while(!compare_exchange_weak()); code, since it never reloads oldValue in the case of failure. If the other thread sets a new value, it will never succeed. The other compare_exchange_weak instances use do-while, so they don't suffer from the same danger.