Memory barriers

Jun 14, 2012 at 8:27pm

Memory barriers

Hi all.

Being not a real computer programmer, I have just discovered the existence of memory barriers (and, more importantly, of the problem they are designed to solve). Now this puts in serious danger the self-confidence I had gained in dealing with concurrency issues.

But it also leaves me with some questions, Max-wise:

- I have noticed that we have two macros ATOMIC_INCREMENT_BARRIER and ATOMIC_DECREMENT_BARRIER besides the “non-barrier” ones (the two being actually different only under OSX): but why in the examples of dealing with buffer~, at least in the Max5 SDK, the non-barrier version was used? I understand that this is exactly the case in which barrier increments are needed…

- AFAIK there the Max API doesn’t contain any cross-platform barrier mechanism – how does the Cycling code deal with the issue? and how do people more experienced than me deal with it in their externals? Simply building their own macros?

… or does all this mean (as I think I have read somewhere) that this is not a real problem on the x86 architecture (while it was on the PPC)?

Thank you very much for any enlightenment!
aa

#46597
Jun 15, 2012 at 5:55am

Hello,

As a “non real computer programmer” too i struggled a lot with memory barriers.

I tried to solve a problem with reordered instructions by compiler optimisation, expecting to force it to keep it correct but with no result at all. The only solution i found was to change function declaration in the concerned code to avoid automatic values as arguments.

long val; stackPop (stack); val = stackPopped (stack);

Instead of :

long val; stackPop (stack, &val);

But i never experienced multi-threading situation as your “buffer~” case.

Anybody else with a better knowledge ?

#167601
Jun 15, 2012 at 6:34am

Hi Nicolas.

I’m not sure I understand your example. Do you mean that in the case

long val; stackPop (stack, &val);

it might happen that stackPop returned before val was correctly allocated in the stack? According to what I have read, it appeared that both compiler and CPU reordering were guaranteed to keep the logical order of operations consistent, in a single-threaded context: which should be the case with your code. Did you notice elsewhere? I mean, if your code has a chance to get messed up then programming is black magic… Would you like to explain me further the issues you have found with that?

On the other hand, I’m afraid that taming compiler reordering is not enough to be sure that memory operations are actually performed in the order you meant. The CPU does its share of reordering as well, and it seems that there is no way to control it besides placing memory barriers.

But frankly, I’m just trying to make sense of stuff I have read here and there on the internet… I’d really love someone knowledgeable to explain me more about this…

aa

#167602
Jun 15, 2012 at 12:28pm

Hello,

it might happen that stackPop returned before val was correctly allocated in the stack?

More or less, but to be honest i can not really remember precisely how it happened, and after few investigations i’m not able to reproduce the behavior anymore :-(

The only things i know that it is was only with local automatic variables (never heap nor bss) and agressive optimisation (-O3).

With the tricks explained above and POSIX lock to my shared data i never had to go deeper in memory barrier documentation even with (-O3 -Os) optimisations.

Sorry not very useful post indeed ;-)

#167603
Jun 15, 2012 at 3:32pm

Hello again,

i searched a little bit more in archives and it seems the example i gave you was not really correct … oops … i can not find _exactly_ the error i had … so forget it ;-)

Ciao.

#167604
Jun 19, 2012 at 12:25am

Tim… you answered everyone’s posts but mine… :(
(just joking, of course, but bump!)
thank you
aa

#167605
Jun 19, 2012 at 4:12pm

Haha — Not trying to avoid you Andrea, just trying to avoid memory barriers ;-)

Memory barriers are pretty complex and I can’t hope to explain them clearly or comprehensively here. To try and give the boiled-down answer: software engineering is always a series of compromises. This is no more true anywhere else than it is of multithreading. When there are multiple threads operating on shared data there is a tradeoff between speed and absolute safety. (just to muddy the waters, there are also differences between theoretical speed and real-world speed and theoretical safety and real-world safety).

The scenarios will be different depending on:

* how many threads are accessing the data?
* is any given thread read-only? write-only? or read-write?
* what is/are the other thread(s) activity (read/write/both)?
* how long do the operations on the two threads take?
* etc.
* etc.
* etc.

Mutexes and critical regions provide the most safety, and are appropriate in many cases. They are, however, problematic for realtime performance-sensitive code (e.g. audio). The atomic inc/dec can be used as a lighter-weight mechanism. There are scenarios however where the ordering of instructions may still end up mis-ordered. A slightly heavier weight way to help with this is to use the barrier variants.

In some cases you can use a structure like a non-locking queue (see http://www.rossbencina.com/code/lockfree) which is fast and doesn’t require a mutex or critical region. If you dig into this you will find several implementations online which all basically come around to the same algorithm — except some use a memory barrier and some don’t. Is it really needed? Is there out of superstition? No one that I know of has written an article specifically explain their use of a memory barrier or not. Were there real-world examples from runs of their program that exhibited instruction re-ordering? Or was it just a fear of incorrect instruction ordering? I don’t know.

So now you know why I avoided your question ;-)

All of that said, are you experiencing a particular problem in your code that you suspect is related to memory barriers?

Cheers,
Tim

#167606
Jun 20, 2012 at 8:09am

Hi Tim.

Thank you very much for taking the time for this. Now I have a much clearer picture.

First of all no, I’m not experiencing any particular practical problem. It was just that I was reading around and I stumbled upon this thing I didn’t really know about, and I felt like omg, my code might be full of this kind of problems everywhere. And so I wanted to understand more… But on the other hand it’s true that we routinely test our bach externals with fast metros and qmetros working in parallel, and we never met a problem that we couldn’t solve with proper thread-locking.

So all in all I understand that in common-life situations instruction reordering is more a theoretical than a practical issue… but ok, nonetheless I think I’ll start paying attention to this from now on!

Thanks again
aa

#167607
Jun 20, 2012 at 9:07am

Hello,

I do not think it is theory ; but it occurs in a very precise scenario. I guess it will never happen if you do not code RealTime DSP multithreaded process and if you properly use mutex for shared data …

In my case (compiler reordering) i can remember that i was using “extern inlining” too in the scenario ;-)

Now, each time a really really strange behavior occurs i do not forget to make a test with optimisation off.

This article from Ross Bencina is very intersting too :

http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing

#167608
Jul 4, 2012 at 4:22pm

Hello folks,

lock free introduction for noobs (with few lines about memory reordering) http://preshing.com/20120612/an-introduction-to-lock-free-programming and several good papers (at least for me) : http://preshing.com/

Ciao ;-)

#167609
Jul 4, 2012 at 8:17pm

Thank you for sharing this, Nicolas!
Cheers,
Ádám

#167610
Jul 5, 2012 at 5:26am

Very interesting, thanks!
aa

#167611

You must be logged in to reply to this topic.