aengus

Peter, you wrote that James McCartney wrote an article on something

for setting the index back to zero rather than going off the end of

the array. I'm curious to see what his method is. I had a quick look

in the music dsp archives but had no idea what to search for. Can you

Peter, you wrote that James McCartney wrote an article on something
more efficient than:

for setting the index back to zero rather than going off the end of
the array. I'm curious to see what his method is. I had a quick look
in the music dsp archives but had no idea what to search for. Can you
post a link to it?

more-efficient-than-if-index-arraylength-index-0

On 7-Sep-2006, at 3:40, Aengus Martin wrote:

> Peter, you wrote that James McCartney wrote an article on something

The problem is doing that inside your inner loop.

I had a quick google and didn't find the article either, I'm afraid.

Remind me next week and I can post some example code showing the 

Peter Castine +--> Litter Power & Litter Bundle for Jitter

Interface Building for |home | chez nous|

Max/MSP Extremely cool |bei uns | i nostri|

> Peter, you wrote that James McCartney wrote an article on something
> more efficient than:
>
> if (++index >= arraylength) index = 0;

Remind me next week and I can post some example code showing the  
technique. Sorry, but not thi week.

--------------    http://www.bek.no/~pcastine/Litter/    -------------
Peter Castine             +--> Litter Power & Litter Bundle for Jitter
                                Universal Binaries on the way
iCE:  Sequencing,  Recording &
       Interface  Building  for                   |home    | chez nous|
       Max/MSP   Extremely cool                   |bei uns |  i nostri|
       http://www.dspaudio.com/                   http://www.castine.de

Well, the easy way to do this if your array length is an even power of 

or if len = 4096 (1 << 13, the size of a memory page in os x)

the bitwise-and then store operation is comparatively cheap,

one to two instructions depending on optimization.

one can easily make a mask given an even power of two length like:

#define make_mask(len) ( ~( (~(len) + 1) ) )

that will work for most of the powers of two (until you start hitting 

But... before you go optimizing, have you done any profiling?

Do you know that your bounds check is really slowing you down?

On Sep 7, 2006, at 12:56 AM, Peter Castine wrote:

> On 7-Sep-2006, at 3:40, Aengus Martin wrote:

>> Peter, you wrote that James McCartney wrote an article on something

>> if (++index >= arraylength) index = 0;

> The problem is doing that inside your inner loop.

> I had a quick google and didn't find the article either, I'm afraid.

> Remind me next week and I can post some example code showing the 

> Peter Castine +--> Litter Power & Litter Bundle for Jitter

> Interface Building for |home | chez nous|

> Max/MSP Extremely cool |bei uns | i nostri|

Well, the easy way to do this if your array length is an even power of  
two would be:

or if len = 4096 (1 << 13, the size of a memory page in os x)
++index;
index &= 0x0fff;

the bitwise-and then store operation is comparatively cheap,
one to two instructions depending on optimization.

#define make_mask(len) (  ~(  (~(len) + 1) )  )

that will work for most of the powers of two (until you start hitting  
overflow)

But... before you go optimizing, have you done any profiling?
Do you know that your bounds check is really slowing you down?

> On 7-Sep-2006, at 3:40, Aengus Martin wrote:
>
>> Peter, you wrote that James McCartney wrote an article on something
>> more efficient than:
>>
>> if (++index >= arraylength) index = 0;
>
> The problem is doing that inside your inner loop.
>
> I had a quick google and didn't find the article either, I'm afraid.
>
> Remind me next week and I can post some example code showing the  
> technique. Sorry, but not thi week.
>
> --------------    http://www.bek.no/~pcastine/Litter/    -------------
> Peter Castine             +--> Litter Power & Litter Bundle for Jitter
>                                Universal Binaries on the way
> iCE:  Sequencing,  Recording &
>       Interface  Building  for                   |home    | chez nous|
>       Max/MSP   Extremely cool                   |bei uns |  i nostri|
>       http://www.dspaudio.com/                   http://www.castine.de
>
>


i don't think there are any other resonably general ways for this than 

either to avoid branching or to direct the branch prediction of the cpu 

For the sake of maintainability i normally prefer the latter... modern 

gcc compilers have a pseudo-function for "branch hints", called 

__builtin_expect. You can use it as shown in

http://kratcode.wordpress.com/2006/02/09/likelyunlikely-gcc-extensions/

The compiler knows how cpus implement branch prediction and will emit 

if arraylength is >> 0, then your code will be

if (unlikely(++index >= arraylength)) index = 0;

Hi Aengus,
i don't think there are any other resonably general ways for this than 
either to avoid branching or to direct the branch prediction of the cpu 
into the right direction.
For the sake of maintainability i normally prefer the latter... modern 
gcc compilers have a pseudo-function for "branch hints", called 
__builtin_expect. You can use it as shown in
http://kratcode.wordpress.com/2006/02/09/likelyunlikely-gcc-extensions/
The compiler knows how cpus implement branch prediction and will emit 
proper code for this.
if arraylength is >> 0, then your code will be
if (unlikely(++index >= arraylength)) index = 0;

On 7-Sep-2006, at 9:56, Peter Castine wrote:

Mark's trick with bit-masking rather than branching is not what I was 

talking about, although it is another approach for power-of-two arrays.

Thomas' idea of branch hints is OK, but only works for some chips 

(granted, the chips you're writing for with Max/MSP/Jitter externals 

What I'm on about is a general technique that you should have in your 

BEFORE you attack this, make sure that your code is correct and don't 

worry about efficiency. I cannot emphasize this enough.

And *of course* you should profile your code to make sure that you 

actually do have a performance problem before attacking efficiency 

concerns. OTOH, DSP-loops are always a bottleneck because someone is 

bound to instantiate 100 copies of your MSP external sooner or later.

So, if you have a perform loop that looks like:

// Initialize inputVector and, if necessary, buf.

The code is less efficient than it could be because you have two 

branch instructions in your inner loop (the while condition is a 

branch, as is the if statement). Ideally you will never have more 

than a single branch in your inside loop, the loop condition.

The solution is to unwrap something like the below:

int sampsThisTime = (n > kArraySize) ? kArraySize : n;

The inner loop now only has a single branch instruction for the loop 

condition. You have branch overhead for the outer loop, but those 

branch instructions will, on average, only be executed once every 

Obviously not worth the effort if kArraySize is small but the 

assumption is that the buffer is large enough to justify the extra work.

There are zillions of ways to further tweak the code (use an 

incremented pointer instead of buf[m], etc.) Left as an exercise for 

I hear lunch is on the table. The code above is untested and hasn't 

On 7-Sep-2006, at 9:56, Peter Castine wrote:
> On 7-Sep-2006, at 3:40, Aengus Martin wrote:
>> Peter, you wrote that James McCartney wrote an article on something
>> more efficient than:
>>
>> if (++index >= arraylength) index = 0;
>
> The problem is doing that inside your inner loop.
>
> Remind me next week and I can post some example code showing the  
> technique.

Mark's trick with bit-masking rather than branching is not what I was  
talking about, although it is another approach for power-of-two arrays.

Thomas' idea of branch hints is OK, but only works for some chips  
(granted, the chips you're writing for with Max/MSP/Jitter externals  
have this).

What I'm on about is a general technique that you should have in your  
programming vocabulary.

BEFORE you attack this, make sure that your code is correct and don't  
worry about efficiency. I cannot emphasize this enough.

And *of course* you should profile your code to make sure that you  
actually do have a performance problem before attacking efficiency  
concerns. OTOH, DSP-loops are always a bottleneck because someone is  
bound to instantiate 100 copies of your MSP external sooner or later.

===========================
int n = vectorSize,
     m = 0;
float *inputVector,
       buf[kArraySize];

// Initialize inputVector and, if necessary, buf.
// then...

while (n-- > 0) {
   m += 1;
   if (m >= kArraySize)
       m = 0;
   buf[m] = *inputVector++;
   // whatever other processing follows...
   }

The code is less efficient than it could be because you have two  
branch instructions in your inner loop (the while condition is a  
branch, as is the if statement). Ideally you will never have more  
than a single branch in your inside loop, the loop condition.

===========================
// Initialize as before, then:

while (n > 0) {    // Decrement inside loop
   int sampsThisTime = (n > kArraySize) ? kArraySize : n;
   n -= sampsThisTime;

   m = 0;
   while (sampsThisTime-- > 0) {
     buf[m++] = *inputVector++;
     // whatever other processing follows...
     }
   }

The inner loop now only has a single branch instruction for the loop  
condition. You have branch overhead for the outer loop, but those  
branch instructions will, on average, only be executed once every  
kArraySize samples.

Obviously not worth the effort if kArraySize is small but the  
assumption is that the buffer is large enough to justify the extra work.

There are zillions of ways to further tweak the code (use an  
incremented pointer instead of buf[m], etc.) Left as an exercise for  
the reader.

I hear lunch is on the table. The code above is untested and hasn't  
even been proof-read. Caveat lector.

Also, it just occurred to me that you should make very sure that your 

circular buffer is at least as large as the DSP vector. I know this 

is a silly and simple suggestion, but you might notice nasty clicks 

and other artifacts when you overwrite the starting point in your 

If you're using Peter's excellent suggestion to pre-compute the wrap 

value, and the element size of your circular buffer is the same as the 

element size of the dsp vector, you may as well use memcpy. You may 

see up to a 2x improvement with that method, as os x auto-vectorizes 

    sampsThisTime = (kArraySize - curBufPos);

         inputVector + curInputPos,

         sampsThisTime * sizeof(*inputVector));

This way we don't even have an internal loop (though memcpy does);

I think that the code will only loop twice at most, if it loops more 

I'm pretty sure this will work, but I ran no tests :)

Also, it just occurred to me that you should make very sure that your  
circular buffer is at least as large as the DSP vector.  I know this  
is a silly and simple suggestion, but you might notice nasty clicks  
and other artifacts when you overwrite the starting point in your  
buffer for the current dsp call.

If you're using Peter's excellent suggestion to pre-compute the wrap  
value, and the element size of your circular buffer is the same as the  
element size of the dsp vector, you may as well use memcpy.  You may  
see up to a 2x improvement with that method, as os x auto-vectorizes  
memcpy when possible.

while (n > 0) {    // Decrement inside loop
   int sampsThisTime;
   size_t newCurBufPos;
  if( (n + curBufPos)  => kArraySize) {
	// we're wrapping
  	sampsThisTime = (kArraySize - curBufPos);
	newCurBufPos = 0;
   }
   else {
	// it all fits
	sampsThisTime = n;
	newCurBufPos += n;
   }

   memcpy(buf + curBufPos,
		  inputVector + curInputPos,
		  sampsThisTime * sizeof(*inputVector));

   n -= sampsThisTime;
   curInputPos += sampsThisTime;
   curBufPos = newCurBufPos;
}

This way we don't even have an internal loop (though memcpy does);
I think that the code will only loop twice at most, if it loops more  
than that then you're losing data.
I'm pretty sure this will work, but I ran no tests :)

More efficient than: if (++index >= arraylength) index = 0; ?