## The most efficient way to obtain quasi-random number between 0. and 1.

Nov 29 2006 | 9:55 pm
Dear all
I am looking for C code to obtain 4 quasi-random number between 0. and 1. per sample.
Right now, I use the good old
seed = seed * 1103515245 + 12345;
and I divise by 4294967296.0
but is is cycle consuming...
Is there a good place to find some help on this? I do not care that much if the random is pure or not, I just want some good number crushing...
Does it worth me having an altivec call for 4 randoms (if such a thing exist)?
I have an intuition that I could avoid the division by casting somehow, but I am not sure...
Any help welcome, even sending me to a funky manual ;-)
p.a.

• Nov 29 2006 | 10:08 pm
On 06-11-29, at 1355, Pierre Alexandre Tremblay wrote: > > Any help welcome, even sending me to a funky manual ;-) > http://www.google.com/search?q=random+number+c+-c%2B%2B+code http://www.ciphersbyritter.com/NEWS4/ RANDC.HTM#369B5E30.65A55FD1@stat.fsu.edu
r.
• Nov 29 2006 | 10:57 pm
• Nov 30 2006 | 1:39 pm
The easiest way to avoid division is to multiply with
const double kOneOverMaxULong = 1.0 / 4294967296.0;
(The division is calculated once at compile time rather than on every call to your RNG). A really smart optimizing compiler would do that for you, but since you're probably using XCode, better do the above.-)
There is a technique to map unsigned ints to double by typecasting and some fancy bitshifting. In my experience it is _not_ significantly faster than converting to double and multiplying. I can dig up the code if you really want it, I originally saw it in a post by James McCartney to the musicdsp list (which is probably still available to the determined Googler).
As for fast RNGs...
There is the feedback shift register technique which is probably the fastest out there. However it only produces 1s and 0s, not 32-bit words.
Tausworthe 88 is very fast (a few shifts and XORs) and has a very good pedigree.
LC (what you're doing) is one of the fastest but has poor random quality (too many correlations pop up). You say you don't care.
I have numerous random processes in MSP objects in Litter Power that require as many as 4 pseudo-randoms per sample and, in my experience, the core RNG is not the real bottleneck. I recently implemented a major speed-up to lp.gsss~ by using a clever algorithm that avoids the trig functions required in the standard Box-Muller algorithms. What I have done for MSP (and Jitter) objects is to write the code such that all data is kept in registers for the duration of a perform loop. That made a significant performance improvement.
What are you trying to do that Litter Power doesn't already do?
On 29-Nov-2006, at 22:55, Pierre Alexandre Tremblay wrote:
> Dear all > > I am looking for C code to obtain 4 quasi-random number between 0. > and 1. per sample. > > Right now, I use the good old > > seed = seed * 1103515245 + 12345; > > and I divise by 4294967296.0 > > but is is cycle consuming... > > Is there a good place to find some help on this? I do not care > that much if the random is pure or not, I just want some good > number crushing... > > Does it worth me having an altivec call for 4 randoms (if such a > thing exist)? > > I have an intuition that I could avoid the division by casting > somehow, but I am not sure... > > Any help welcome, even sending me to a funky manual ;-) > > p.a.
-------------- http://www.bek.no/~pcastine/Litter/ ------------- Peter Castine +--> Litter Power & Litter Bundle for Jitter Universal Binaries on the way iCE: Sequencing, Recording & Interface Building for |home | chez nous| Max/MSP Extremely cool |bei uns | i nostri| http://www.dspaudio.com/ http://www.castine.de
• Nov 30 2006 | 1:46 pm
On 29-Nov-2006, at 23:57, Pierre Alexandre Tremblay wrote:
> any other hints?
Oh, the other obvious thing... you *are* inlining the RNG in your perform method, aren't you?
-------------- http://www.bek.no/~pcastine/Litter/ ------------- Peter Castine +--> Litter Power & Litter Bundle for Jitter Universal Binaries on the way iCE: Sequencing, Recording & Interface Building for |home | chez nous| Max/MSP Extremely cool |bei uns | i nostri| http://www.dspaudio.com/ http://www.castine.de
• Nov 30 2006 | 2:16 pm
Dear Peter (and other readers I hope...)
I was hoping for your answer to my call for help ;-)
> const double kOneOverMaxULong = 1.0 / 4294967296.0;
indeed, I will document the gain here.
> There is a technique to map unsigned ints to double by typecasting > and some fancy bitshifting. In my experience it is _not_ > significantly faster than converting to double and multiplying. I > can dig up the code if you really want it, I originally saw it in a > post by James McCartney to the musicdsp list (which is probably > still available to the determined Googler).
I was exploring that avenue yesterday. A great site to get my head around my intuition was http://babbage.cs.qc.edu/IEEE-754/. I was thinkin of masking with 3FFF FFFF and then typecast to float and substract one...
> Tausworthe 88 is very fast (a few shifts and XORs) and has a very > good pedigree. > LC (what you're doing) is one of the fastest but has poor random > quality (too many correlations pop up).
Which algorithm is faster? T88 or LC
> You say you don't care.
at all! I just want 4 random values per sample.
> I have numerous random processes in MSP objects in Litter Power > that require as many as 4 pseudo-randoms per sample and, in my > experience, the core RNG is not the real bottleneck. I recently > implemented a major speed-up to lp.gsss~ by using a clever > algorithm that avoids the trig functions required in the standard > Box-Muller algorithms. What I have done for MSP (and Jitter) > objects is to write the code such that all data is kept in > registers for the duration of a perform loop. That made a > significant performance improvement.
any hints of how I should do this? My next step was to use altivec for this number crushing, because when I remove my random stuff, I divide the processor load by 4...
> Oh, the other obvious thing... you *are* inlining the RNG in your > perform method, aren't you?
nope. I have done a function called alea() that does my LC, but where can I read about inlining? Sorry to be thick here.
> What are you trying to do that Litter Power doesn't already do?
I want to improve my butt~ external. It is a dirty filter - it simply adds some noise on the coefficient of a biquad. The code will be public domain as soon as I release this version 2. It is shamelessly simple, but sounds so good!
pa
• Nov 30 2006 | 5:08 pm
> Which algorithm is faster? T88 or LC
It's pretty close. LC looks like it should be a bit faster, but I would want to benchmark before making a call. It could even depend on processor architecture.
In any case, replacing div by multiplication and inlining (see below) will bring far more gain than the difference between the two algorithms.
>> I have numerous random processes in MSP objects in Litter Power >> that require as many as 4 pseudo-randoms per sample and, in my >> experience, the core RNG is not the real bottleneck. I recently >> implemented a major speed-up to lp.gsss~ by using a clever >> algorithm that avoids the trig functions required in the standard >> Box-Muller algorithms. What I have done for MSP (and Jitter) >> objects is to write the code such that all data is kept in >> registers for the duration of a perform loop. That made a >> significant performance improvement. > > any hints of how I should do this? My next step was to use altivec > for this number crushing, because when I remove my random stuff, I > divide the processor load by 4...
Really depends on what you're doing. I spent many hours on the web and in the library (yes! physical relocation to a depository of dead trees!!) researching alternatives in the literature before I found an appropriate algorithm.
BTW, for all we go on about being efficient in MSP, it was doing Gaussian noise in Jitter where the optimizations really paid off. Audio is _pippifax_ compared to video.
>> Oh, the other obvious thing... you *are* inlining the RNG in your >> perform method, aren't you? > > nope. I have done a function called alea() that does my LC, but > where can I read about inlining? Sorry to be thick here.
static inline double alea(void /*or whatever*/) { // whatever }
That one reserved word will speed up processing in this case by something in the 20-50% ballpark.
While we're here, you're not calling it in a loop, are you?
for (i = 0; i < 4; i += 1) x[i] = alea();
is slower than
x0 = alea(); x1 = alea(); x2 = alea(); x3 = alea();
>> What are you trying to do that Litter Power doesn't already do? > > I want to improve my butt~ external. It is a dirty filter - it > simply adds some noise on the coefficient of a biquad. The code > will be public domain as soon as I release this version 2. It is > shamelessly simple, but sounds so good!
Cute. Wish I'd thought of that.
-------------- http://www.bek.no/~pcastine/Litter/ ------------- Peter Castine +--> Litter Power & Litter Bundle for Jitter Universal Binaries on the way iCE: Sequencing, Recording & Interface Building for |home | chez nous| Max/MSP Extremely cool |bei uns | i nostri| http://www.dspaudio.com/ http://www.castine.de
• Dec 01 2006 | 1:14 pm
Dear Peter and all
The multiplication instead of division, but more important once instead of 4 times, did save more than 25% of horsepower!
So to make it clear, the version one was:
while (n--) { in0 = *entree++; yn = b0a0*in0 + b1a0*in1 + b2a0*in2 - a1a0*ou1 - a2a0*ou2; *sortie++ = yn; in2 = in1 * (1. - (sale * alea() / 4294967296.0)); in1 = in0 * (1. - (sale * alea() / 4294967296.0)); ou2 = ou1 * (1. - (sale * alea() / 4294967296.0)); ou1 = yn * (1. - (sale * alea() / 4294967296.0)); }
where alea gives me an unsigned long and sale is a dirtiness factor.
and now is:
while (n--) { in0 = *entree++; yn = b0a0*in0 + b1a0*in1 + b2a0*in2 - a1a0*ou1 - a2a0*ou2; *sortie++ = yn; in2 = in1 * (1. - (sale * alea())); in1 = in0 * (1. - (sale * alea())); ou2 = ou1 * (1. - (sale * alea())); ou1 = yn * (1. - (sale * alea())); }
where sale is now the dirtiness factor multiplied by the kOneOverMaxULong constant as proposed...
--
The inline did not make any difference, which surprised me. If I understood well, I just have to change my declaration by adding 'static inline' to it?
-- I still want to optimise, because the following code is still 75% faster. The only thing I remove is the dirtiness in the biquad memory swap...
while (n--) { in0 = *entree++; yn = b0a0*in0 + b1a0*in1 + b2a0*in2 - a1a0*ou1 - a2a0*ou2; *sortie++ = yn; in2 = in1; // * (1. - (sale * alea())); in1 = in0; // * (1. - (sale * alea())); ou2 = ou1; // * (1. - (sale * alea())); ou1 = yn; // * (1. - (sale * alea())); }
-- I am thinking of going vectorial since I have 4 times the same maths, but maybe it is not relevant...
any hints welcome again!
pa
• Dec 02 2006 | 2:54 pm
Hi PA,
I haven't time to read this thread totally now, but I do have an audio rate rand generator (between -1 an 1 I think) for max which could be modified to produce numbers between 0 and 1. It's a replacemnet for noise~ that seeds differently (but not randomly) for each new instance, up to 600 odd different versions. It's not ultra efficient, but it's not that much more expensive than noise~, so if your results are not as efficient as noise~ than might be useful to you. I can send it to you off-list if you'd like, both as a UB external, and also as code.
Cheers
Alex
• Dec 03 2006 | 2:31 pm
On 1-Dec-2006, at 14:14, Pierre Alexandre Tremblay wrote: > The inline did not make any difference, which surprised me. If I > understood well, I just have to change my declaration by adding > 'static inline' to it?
Surprises me a little, too. But some details of inline handling are compiler dependent.
There is a very slight chance that a really intelligent compiler would already be inlining automatically, but that seems unlikely to me. You could look at the disassembled code to check. (Reading assembly code is one of my favorite pastimes. Pas.)
I would check:
1) The definition of alea() (not just the declaration) needs to be qualified as inline.
2) The definition of alea() needs to be available to the file using it (either directly in the file or #include'd)
3) For some compilers it may be necessary for the definition to occur before the calling function.
If you're only using alea() your perform method, then something like
static UInt32 sSeed = 0; // Initialize to uptime() // or something from main() static inline double alea() { const double kFactor = 1.0 / 4294967296.0;
sSeed = 123456789 * sSeed + 987654321; // *really* bad RNG return (double) sSeed * kFactor; }
long* ButtPerform( long* iParams)
{ // whatever }
should convince any compiler to do the right thing. Defining the function static inline has the side-effect that you should not need to declare it, providing the definition occurs before the first call to the function.
Hope this helps, P.
(Who really regrets aiding and abetting anyone using Linear Congruence as a RNG... c'est comme manger de sanglier avec sauce de minthe.)
-------------- http://www.bek.no/~pcastine/Litter/ ------------- Peter Castine +--> Litter Power & Litter Bundle for Jitter Universal Binaries on the way iCE: Sequencing, Recording & Interface Building for |home | chez nous| Max/MSP Extremely cool |bei uns | i nostri| http://www.dspaudio.com/ http://www.castine.de
• Dec 03 2006 | 4:26 pm
• Dec 04 2006 | 11:34 am
• Dec 04 2006 | 8:24 pm
> Vectorizing algorithms is something of a black art: I've read > reports where uncautious vectorization even resulted in worse > performance.
Vectorising is not so hard. I have done quite a lot of this, and the apple docs are very good - I learnt everything from there.
It is indeed possible to write less efficient code for altivec than the same algoritm in straightforward scalar code (I've done it - but I've also written very efficient code). This basically occurs where you are processing data that is not easy to process in parallel (either because of the algorithm, or the way the data is stored) and so you iccur CPU costs for re-packing your data into vectors and/or pulling out separete elements, or when you do operations across vectors (like summing all four values). If you are processing four values in parallel all the time and hence can avoid all these bad practices then you should see an improvement.......
For UB purposes it is possible to write code which identifies the platform to compile for and use macros for vector functions which are replaced by the pre-compiler for the appropriate platform, thus allowing the use of SSE on intel macs (note that you can assume SSE2 for intel macs, as it is a requirement of the OS - the extra functions in SSE3 will probably be irrelevant in this case and also most others). For simple adds and multiplies etc. the correspondence between altivec and SSE is very direct, except there is no vec_madd equivalent for SSE, you must use two seperate operations instead. For me porting to SSE was very easy from altivec, but if you're starting vectorising now I would think about this straight away, as you might as well be efficient whatever the processor.
Good Luck,
Alex
• Dec 04 2006 | 9:02 pm
Hi guys,
I am loving this thread... Since we are on the topic, I have done quite some vectorization myself, both using Altivec and SSE. As Alex pointed out, it's not too hard to go from one to the other. However something that I've always been uncertain about is if in the perform method of any MSP external the audio vectors are passed 16-byte aligned ready for vector calculations or if we have to move data around to align them.
I checked the times~.c example in the SDK folder, but I couldn't find any mention of memory alignment.
Does anybody know?
Thanks.
- Luigi
--- Alex Harker wrote:
> > > > Vectorizing algorithms is something of a black > art: I've read > > reports where uncautious vectorization even > resulted in worse > > performance. > > Vectorising is not so hard. I have done quite a lot > of this, and the apple docs are very good - I learnt > everything from there. > > It is indeed possible to write less efficient code > for altivec than the same algoritm in > straightforward scalar code (I've done it - but I've > also written very efficient code). This basically > occurs where you are processing data that is not > easy to process in parallel (either because of the > algorithm, or the way the data is stored) and so you > iccur CPU costs for re-packing your data into > vectors and/or pulling out separete elements, or > when you do operations across vectors (like summing > all four values). If you are processing four values > in parallel all the time and hence can avoid all > these bad practices then you should see an > improvement....... > > For UB purposes it is possible to write code which > identifies the platform to compile for and use > macros for vector functions which are replaced by > the pre-compiler for the appropriate platform, thus > allowing the use of SSE on intel macs (note that you > can assume SSE2 for intel macs, as it is a > requirement of the OS - the extra functions in SSE3 > will probably be irrelevant in this case and also > most others). For simple adds and multiplies etc. > the correspondence between altivec and SSE is very > direct, except there is no vec_madd equivalent for > SSE, you must use two seperate operations instead. > For me porting to SSE was very easy from altivec, > but if you're starting vectorising now I would think > about this straight away, as you might as well be > efficient whatever the processor. > > Good Luck, > > Alex > >
------------------------------------------------------------ THIS E-MAIL MESSAGE IS FOR THE SOLE USE OF THE INTENDED RECIPIENT AND MAY CONTAIN CONFIDENTIAL AND/OR PRIVILEGED INFORMATION. ANY UNAUTHORIZED REVIEW, USE, DISCLOSURE OR DISTRIBUTION IS PROHIBITED. IF YOU ARE NOT THE INTENDED RECIPIENT, CONTACT THE SENDER BY E-MAIL AT SUPERBIGIO@YAHOO.COM AND DESTROY ALL COPIES OF THE ORIGINAL MESSAGE. WITHOUT PREJUDICE UCC1-207. ------------------------------------------------------------
Yahoo! Music Unlimited Access over 1 million songs. http://music.yahoo.com/unlimited
• Dec 04 2006 | 9:15 pm
Instead of writing Altivec or SSE specific code, you can use Apple's Accelerate framework for more cross-platform code: http://developer.apple.com/performance/accelerateframework.html . It does the platform specific business for you. From the link above:
"The Accelerate framework is the solution for computationally intensive universal binaries. The functions it provides eliminate the need for you to include hardware-dependent AltiVec or SSE code in your application."
wes
• Dec 04 2006 | 10:04 pm
> However something that I've always been uncertain > about is if in the perform method of any MSP external > the audio vectors are passed 16-byte aligned ready for > vector calculations or if we have to move data around > to align them. > > I checked the times~.c example in the SDK folder, but > I couldn't find any mention of memory alignment. > > Does anybody know?
I have no certain answer, but everything in the max/msp sdk and my personal experience with msp externals suggests that yes this is the case, or at least this is definitely the case for ppc (in that all the altivec stuff from c74 I've seen just takes the pointers and casts them as *vFloat or *(vector float) which implies that they are already 16-byte aligned). I'm 99.999% sure this is the case for intel as well (in that I've never had to do non-aligned loads to make it work), but perhaps an official answer from one of the c74 guys would absolutely clear this up if you want to know for sure.
If you try to read from a non 16-byte aligned address it should just read from the nearest aligned address, so it is unlikely that one wouldn't notice if this was happening as you would either incur a sample delay of 2-3 with dropped samples at the beginning of every msp vector (if it didn't crash from writing to protected memory on the last vector), or in my case (working with FFTs a lot) a bin shift of 2-3 bins. Both of these would be clearly audible.....
Alex
P.S. Thanks wes - yes that's the proper way to do it, which I should have known having used the FFT stuff cross-platform, but I will now check out stuff for basic vector ops as well.
• Dec 04 2006 | 11:54 pm
On 4-Dec-2006, at 22:15, Wesley Smith wrote:
> Instead of writing Altivec or SSE specific code, you can use Apple's > Accelerate framework for more cross-platform code: > http://developer.apple.com/performance/accelerateframework.html . It > does the platform specific business for you.
What do you do for your externals that run on XP?
-------------- http://www.bek.no/~pcastine/Litter/ ------------- Peter Castine +--> Litter Power & Litter Bundle for Jitter Universal Binaries on the way iCE: Sequencing, Recording & Interface Building for |home | chez nous| Max/MSP Extremely cool |bei uns | i nostri| http://www.dspaudio.com/ http://www.castine.de
• Dec 04 2006 | 11:58 pm
On 12/4/06, Peter Castine wrote: > On 4-Dec-2006, at 22:15, Wesley Smith wrote: > > > Instead of writing Altivec or SSE specific code, you can use Apple's > > Accelerate framework for more cross-platform code: > > http://developer.apple.com/performance/accelerateframework.html . It > > does the platform specific business for you. > > What do you do for your externals that run on XP?
Not use windows :p
Just kidding. I've never written a vectorized piece of code, so I have no idea about windows. wes
• Dec 15 2006 | 1:47 pm
Dear all
I give up! I just can't get my head around optimisation now. My main issue is with the Linear Congruence PRNG, that uses unsigned long truncation feature to stay within control. I just cannot code what I wanted to do, which is a vector random generator between 0. and 1. because
1) I cannot do it with a floating seed (it will go to infinite!)
2) I cannot perform (vector unsigned long) Altivec multiplication
-- It is probably due to my inexperience, but this is too hard for me from now.
As I write, I have this idea that come back of masking my 128 bit word to bring it where I want... masking a 32 bit word with 3FFF FFFF and then typecast to float and substract one gives me 0 to 0.9999999 so I will give it a try.
any input is appreciated
pa
• Dec 15 2006 | 5:08 pm
One last note for today...
I found this algorythm and was able to code it in vector, but it does not sound as good as a LC, is it just me or does it make sense?
I have even code this noise generator alone to compare it with noise~ and it has some grain in it that the LC has not... I need to get my maths together to understand why...
any help welcome !
pa
float g_fScale = 2.0f / 0xffffffff; int g_x1 = 0x67452301; int g_x2 = 0xefcdab89;
void whitenoise( float* _fpDstBuffer, // Pointer to buffer unsigned int _uiBufferSize, // Size of buffer float _fLevel ) // Noiselevel (0.0 ... 1.0) { _fLevel *= g_fScale;
while( _uiBufferSize-- ) { g_x1 ^= g_x2; *_fpDstBuffer++ = g_x2 * _fLevel; g_x2 += g_x1; } }
my version:
vec_semance1 = vec_xor(vec_semance1, vec_semance2); vec_semance2 = vec_add(vec_semance1, vec_semance2);
• Dec 23 2006 | 9:05 pm
Dear all
OK, sorry to revive an old thread, but I was wrong in my last post, so for the posterity, let's put thing straight...
2 things I said that were not totally true:
> masking a 32 bit word with 3FFF FFF and then typecast to float and > substract one gives me 0 to 0.9999999
Actually, after reading the excellent Numerical Recipes in C (Press, W. and al) I found out that my binary operation were not right. It is masking with 0x003F FFFF then adding the 1 base by OR 0x3F80 0000, then the typecast and subtraction.
> 2) I cannot perform (vector unsigned long) Altivec multiplication
I found on Apple website this link, that explain how to do it.
So I am now able to do 4 Linear Congruence Pseudo-Random Generators in parallel, saving much CPU (about another 50%)
I am pretty sure that my code could be optimised again, but I might leave this for later (v.3?). I will post my code on this thread if anyone is interested.
pa
• Dec 25 2006 | 2:10 am
On 12/23/06, Pierre Alexandre Tremblay wrote: > > Dear all > > OK, sorry to revive an old thread, but I was wrong in my last post, > so for the posterity, let's put thing straight... > > 2 things I said that were not totally true: > > > masking a 32 bit word with 3FFF FFF and then typecast to float and > > substract one gives me 0 to 0.9999999 > > Actually, after reading the excellent Numerical Recipes in C (Press, > W. and al) I found out that my binary operation were not right. It > is masking with 0x003F FFFF then adding the 1 base by OR 0x3F80 0000, > then the typecast and subtraction. > > > 2) I cannot perform (vector unsigned long) Altivec multiplication > > I found on Apple website this link, that explain how to do it. > > http://developer.apple.com/hardwaredrivers/ve/code_optimization.html > > So I am now able to do 4 Linear Congruence Pseudo-Random Generators > in parallel, saving much CPU (about another 50%) > > I am pretty sure that my code could be optimised again, but I might > leave this for later (v.3?). I will post my code on this thread if > anyone is interested. > > pa >
• Dec 25 2006 | 11:08 am
I will comment it in English later today or tomorrow, and will post it
pa
• Dec 26 2006 | 12:31 pm
Have you experimented with bit shifting? Perhaps adding a shifted version to itself and then shifting the entire deal. Keeping the float's decimal point all the way to the left data-wise.
Never tried it or anything, but if you need a random idea...
• Jan 02 2007 | 4:46 am
On 26-Dec-2006, at 7:31, Chuck Ritola wrote:
> Have you experimented with bit shifting? Perhaps adding a shifted > version to itself and then shifting the entire deal. Keeping the > float's decimal point all the way to the left data-wise. > > Never tried it or anything, but if you need a random idea...
As John von Neuman said, generating random numbers should not be left to chance.
Just adding a bit-shifter may work, but it's just as likely to shorten the cycle of your RNG faster than you can say "Linear Congruence".
There is a rich literature on generating pseudo-randoms, starting with the Knuth I referred to very early in this thread (a more solid resource than Numerical Recipes, btw).