More efficient than: if (++index >= arraylength) index = 0; ?


    Sep 07 2006 | 1:40 am
    Hi,
    Peter, you wrote that James McCartney wrote an article on something more efficient than:
    if (++index >= arraylength) index = 0;
    for setting the index back to zero rather than going off the end of the array. I'm curious to see what his method is. I had a quick look in the music dsp archives but had no idea what to search for. Can you post a link to it?
    Thanks,
    Aengus.

    • Sep 07 2006 | 7:56 am
      On 7-Sep-2006, at 3:40, Aengus Martin wrote:
      > Peter, you wrote that James McCartney wrote an article on something > more efficient than: > > if (++index >= arraylength) index = 0;
      The problem is doing that inside your inner loop.
      I had a quick google and didn't find the article either, I'm afraid.
      Remind me next week and I can post some example code showing the technique. Sorry, but not thi week.
      -------------- http://www.bek.no/~pcastine/Litter/ ------------- Peter Castine +--> Litter Power & Litter Bundle for Jitter Universal Binaries on the way iCE: Sequencing, Recording & Interface Building for |home | chez nous| Max/MSP Extremely cool |bei uns | i nostri| http://www.dspaudio.com/ http://www.castine.de
    • Sep 07 2006 | 4:49 pm
      Well, the easy way to do this if your array length is an even power of two would be:
      say len = 256 (1 << 7)...
      ++index; index &= 0x00ff;
      or if len = 4096 (1 << 13, the size of a memory page in os x) ++index; index &= 0x0fff;
      the bitwise-and then store operation is comparatively cheap, one to two instructions depending on optimization.
      one can easily make a mask given an even power of two length like:
      #define make_mask(len) ( ~( (~(len) + 1) ) )
      that will work for most of the powers of two (until you start hitting overflow)
      But... before you go optimizing, have you done any profiling? Do you know that your bounds check is really slowing you down?
      _Mark
      On Sep 7, 2006, at 12:56 AM, Peter Castine wrote:
      > On 7-Sep-2006, at 3:40, Aengus Martin wrote: > >> Peter, you wrote that James McCartney wrote an article on something >> more efficient than: >> >> if (++index >= arraylength) index = 0; > > The problem is doing that inside your inner loop. > > I had a quick google and didn't find the article either, I'm afraid. > > Remind me next week and I can post some example code showing the > technique. Sorry, but not thi week. > > -------------- http://www.bek.no/~pcastine/Litter/ ------------- > Peter Castine +--> Litter Power & Litter Bundle for Jitter > Universal Binaries on the way > iCE: Sequencing, Recording & > Interface Building for |home | chez nous| > Max/MSP Extremely cool |bei uns | i nostri| > http://www.dspaudio.com/ http://www.castine.de > >
    • Sep 10 2006 | 7:08 pm
      Hi Aengus, i don't think there are any other resonably general ways for this than either to avoid branching or to direct the branch prediction of the cpu into the right direction. For the sake of maintainability i normally prefer the latter... modern gcc compilers have a pseudo-function for "branch hints", called __builtin_expect. You can use it as shown in http://kratcode.wordpress.com/2006/02/09/likelyunlikely-gcc-extensions/ The compiler knows how cpus implement branch prediction and will emit proper code for this. if arraylength is >> 0, then your code will be if (unlikely(++index >= arraylength)) index = 0;
      best greetings, Thomas
    • Sep 12 2006 | 11:01 am
      On 7-Sep-2006, at 9:56, Peter Castine wrote: > On 7-Sep-2006, at 3:40, Aengus Martin wrote: >> Peter, you wrote that James McCartney wrote an article on something >> more efficient than: >> >> if (++index >= arraylength) index = 0; > > The problem is doing that inside your inner loop. > > Remind me next week and I can post some example code showing the > technique.
      Mark's trick with bit-masking rather than branching is not what I was talking about, although it is another approach for power-of-two arrays.
      Thomas' idea of branch hints is OK, but only works for some chips (granted, the chips you're writing for with Max/MSP/Jitter externals have this).
      What I'm on about is a general technique that you should have in your programming vocabulary.
      BEFORE you attack this, make sure that your code is correct and don't worry about efficiency. I cannot emphasize this enough.
      And *of course* you should profile your code to make sure that you actually do have a performance problem before attacking efficiency concerns. OTOH, DSP-loops are always a bottleneck because someone is bound to instantiate 100 copies of your MSP external sooner or later.
      So, if you have a perform loop that looks like:
      =========================== int n = vectorSize, m = 0; float *inputVector, buf[kArraySize];
      // Initialize inputVector and, if necessary, buf. // then...
      while (n-- > 0) { m += 1; if (m >= kArraySize) m = 0; buf[m] = *inputVector++; // whatever other processing follows... }
      ===========================
      The code is less efficient than it could be because you have two branch instructions in your inner loop (the while condition is a branch, as is the if statement). Ideally you will never have more than a single branch in your inside loop, the loop condition.
      The solution is to unwrap something like the below:
      =========================== // Initialize as before, then:
      while (n > 0) { // Decrement inside loop int sampsThisTime = (n > kArraySize) ? kArraySize : n; n -= sampsThisTime;
      m = 0; while (sampsThisTime-- > 0) { buf[m++] = *inputVector++; // whatever other processing follows... } }
      ===========================
      The inner loop now only has a single branch instruction for the loop condition. You have branch overhead for the outer loop, but those branch instructions will, on average, only be executed once every kArraySize samples.
      Obviously not worth the effort if kArraySize is small but the assumption is that the buffer is large enough to justify the extra work.
      There are zillions of ways to further tweak the code (use an incremented pointer instead of buf[m], etc.) Left as an exercise for the reader.
      I hear lunch is on the table. The code above is untested and hasn't even been proof-read. Caveat lector.
      -------------- http://www.bek.no/~pcastine/Litter/ ------------- Peter Castine +--> Litter Power & Litter Bundle for Jitter Universal Binaries on the way iCE: Sequencing, Recording & Interface Building for |home | chez nous| Max/MSP Extremely cool |bei uns | i nostri| http://www.dspaudio.com/ http://www.castine.de
    • Sep 12 2006 | 7:38 pm
      Also, it just occurred to me that you should make very sure that your circular buffer is at least as large as the DSP vector. I know this is a silly and simple suggestion, but you might notice nasty clicks and other artifacts when you overwrite the starting point in your buffer for the current dsp call.
      One more thing ;)
      If you're using Peter's excellent suggestion to pre-compute the wrap value, and the element size of your circular buffer is the same as the element size of the dsp vector, you may as well use memcpy. You may see up to a 2x improvement with that method, as os x auto-vectorizes memcpy when possible.
      so the code would look like:
      while (n > 0) { // Decrement inside loop int sampsThisTime; size_t newCurBufPos; if( (n + curBufPos) => kArraySize) { // we're wrapping sampsThisTime = (kArraySize - curBufPos); newCurBufPos = 0; } else { // it all fits sampsThisTime = n; newCurBufPos += n; }
      memcpy(buf + curBufPos, inputVector + curInputPos, sampsThisTime * sizeof(*inputVector));
      n -= sampsThisTime; curInputPos += sampsThisTime; curBufPos = newCurBufPos; }
      This way we don't even have an internal loop (though memcpy does); I think that the code will only loop twice at most, if it loops more than that then you're losing data. I'm pretty sure this will work, but I ran no tests :)
      _Mark