## The most efficient way to obtain quasi-random number between 0. and 1.

Nov 29, 2006 at 9:55pm

# The most efficient way to obtain quasi-random number between 0. and 1.

Dear all

I am looking for C code to obtain 4 quasi-random number between 0.
and 1. per sample.

Right now, I use the good old

seed = seed * 1103515245 + 12345;

and I divise by 4294967296.0

but is is cycle consuming…

Is there a good place to find some help on this? I do not care that
much if the random is pure or not, I just want some good number
crushing…

Does it worth me having an altivec call for 4 randoms (if such a
thing exist)?

I have an intuition that I could avoid the division by casting
somehow, but I am not sure…

Any help welcome, even sending me to a funky manual ;-)

p.a.

#28970
Nov 29, 2006 at 10:08pm

On 06-11-29, at 1355, Pierre Alexandre Tremblay wrote:
>
> Any help welcome, even sending me to a funky manual ;-)
>
http://www.ciphersbyritter.com/NEWS4/
RANDC.HTM#369B5E30.65A55FD1@stat.fsu.edu

r.

#89568
Nov 29, 2006 at 10:57pm

#89569
Nov 30, 2006 at 1:39pm

The easiest way to avoid division is to multiply with

const double kOneOverMaxULong = 1.0 / 4294967296.0;

(The division is calculated once at compile time rather than on every
call to your RNG). A really smart optimizing compiler would do that
for you, but since you’re probably using XCode, better do the above.-)

There is a technique to map unsigned ints to double by typecasting
and some fancy bitshifting. In my experience it is _not_
significantly faster than converting to double and multiplying. I can
dig up the code if you really want it, I originally saw it in a post
by James McCartney to the musicdsp list (which is probably still

As for fast RNGs…

There is the feedback shift register technique which is probably the
fastest out there. However it only produces 1s and 0s, not 32-bit words.

Tausworthe 88 is very fast (a few shifts and XORs) and has a very
good pedigree.

LC (what you’re doing) is one of the fastest but has poor random
quality (too many correlations pop up). You say you don’t care.

I have numerous random processes in MSP objects in Litter Power that
require as many as 4 pseudo-randoms per sample and, in my experience,
the core RNG is not the real bottleneck. I recently implemented a
major speed-up to lp.gsss~ by using a clever algorithm that avoids
the trig functions required in the standard Box-Muller algorithms.
What I have done for MSP (and Jitter) objects is to write the code
such that all data is kept in registers for the duration of a perform
loop. That made a significant performance improvement.

What are you trying to do that Litter Power doesn’t already do?

On 29-Nov-2006, at 22:55, Pierre Alexandre Tremblay wrote:

> Dear all
>
> I am looking for C code to obtain 4 quasi-random number between 0.
> and 1. per sample.
>
> Right now, I use the good old
>
> seed = seed * 1103515245 + 12345;
>
> and I divise by 4294967296.0
>
> but is is cycle consuming…
>
> Is there a good place to find some help on this? I do not care
> that much if the random is pure or not, I just want some good
> number crushing…
>
> Does it worth me having an altivec call for 4 randoms (if such a
> thing exist)?
>
> I have an intuition that I could avoid the division by casting
> somehow, but I am not sure…
>
> Any help welcome, even sending me to a funky manual ;-)
>
> p.a.

————– http://www.bek.no/~pcastine/Litter/ ————-
Peter Castine +–> Litter Power & Litter Bundle for Jitter
Universal Binaries on the way
iCE: Sequencing, Recording &
Interface Building for |home | chez nous|
Max/MSP Extremely cool |bei uns | i nostri|
http://www.dspaudio.com/ http://www.castine.de

#89570
Nov 30, 2006 at 1:46pm

On 29-Nov-2006, at 23:57, Pierre Alexandre Tremblay wrote:

> any other hints?

Oh, the other obvious thing… you *are* inlining the RNG in your
perform method, aren’t you?

————– http://www.bek.no/~pcastine/Litter/ ————-
Peter Castine +–> Litter Power & Litter Bundle for Jitter
Universal Binaries on the way
iCE: Sequencing, Recording &
Interface Building for |home | chez nous|
Max/MSP Extremely cool |bei uns | i nostri|
http://www.dspaudio.com/ http://www.castine.de

#89571
Nov 30, 2006 at 2:16pm

Dear Peter (and other readers I hope…)

I was hoping for your answer to my call for help ;-)

> const double kOneOverMaxULong = 1.0 / 4294967296.0;

indeed, I will document the gain here.

> There is a technique to map unsigned ints to double by typecasting
> and some fancy bitshifting. In my experience it is _not_
> significantly faster than converting to double and multiplying. I
> can dig up the code if you really want it, I originally saw it in a
> post by James McCartney to the musicdsp list (which is probably
> still available to the determined Googler).

I was exploring that avenue yesterday. A great site to get my head
around my intuition was http://babbage.cs.qc.edu/IEEE-754/. I was
thinkin of masking with 3FFF FFFF and then typecast to float and
substract one…

> Tausworthe 88 is very fast (a few shifts and XORs) and has a very
> good pedigree.
> LC (what you’re doing) is one of the fastest but has poor random
> quality (too many correlations pop up).

Which algorithm is faster? T88 or LC

> You say you don’t care.

at all! I just want 4 random values per sample.

> I have numerous random processes in MSP objects in Litter Power
> that require as many as 4 pseudo-randoms per sample and, in my
> experience, the core RNG is not the real bottleneck. I recently
> implemented a major speed-up to lp.gsss~ by using a clever
> algorithm that avoids the trig functions required in the standard
> Box-Muller algorithms. What I have done for MSP (and Jitter)
> objects is to write the code such that all data is kept in
> registers for the duration of a perform loop. That made a
> significant performance improvement.

any hints of how I should do this? My next step was to use altivec
for this number crushing, because when I remove my random stuff, I
divide the processor load by 4…

> Oh, the other obvious thing… you *are* inlining the RNG in your
> perform method, aren’t you?

nope. I have done a function called alea() that does my LC, but

> What are you trying to do that Litter Power doesn’t already do?

I want to improve my butt~ external. It is a dirty filter – it
simply adds some noise on the coefficient of a biquad. The code will
be public domain as soon as I release this version 2. It is
shamelessly simple, but sounds so good!

pa

#89572
Nov 30, 2006 at 5:08pm

> Which algorithm is faster? T88 or LC

It’s pretty close. LC looks like it should be a bit faster, but I
would want to benchmark before making a call. It could even depend on
processor architecture.

In any case, replacing div by multiplication and inlining (see below)
will bring far more gain than the difference between the two algorithms.

>> I have numerous random processes in MSP objects in Litter Power
>> that require as many as 4 pseudo-randoms per sample and, in my
>> experience, the core RNG is not the real bottleneck. I recently
>> implemented a major speed-up to lp.gsss~ by using a clever
>> algorithm that avoids the trig functions required in the standard
>> Box-Muller algorithms. What I have done for MSP (and Jitter)
>> objects is to write the code such that all data is kept in
>> registers for the duration of a perform loop. That made a
>> significant performance improvement.
>
> any hints of how I should do this? My next step was to use altivec
> for this number crushing, because when I remove my random stuff, I
> divide the processor load by 4…

Really depends on what you’re doing. I spent many hours on the web
and in the library (yes! physical relocation to a depository of dead
trees!!) researching alternatives in the literature before I found an
appropriate algorithm.

BTW, for all we go on about being efficient in MSP, it was doing
Gaussian noise in Jitter where the optimizations really paid off.
Audio is _pippifax_ compared to video.

>> Oh, the other obvious thing… you *are* inlining the RNG in your
>> perform method, aren’t you?
>
> nope. I have done a function called alea() that does my LC, but
> where can I read about inlining? Sorry to be thick here.

static inline double alea(void /*or whatever*/)
{
// whatever
}

That one reserved word will speed up processing in this case by
something in the 20-50% ballpark.

While we’re here, you’re not calling it in a loop, are you?

for (i = 0; i < 4; i += 1)
x[i] = alea();

is slower than

x0 = alea();
x1 = alea();
x2 = alea();
x3 = alea();

>> What are you trying to do that Litter Power doesn’t already do?
>
> I want to improve my butt~ external. It is a dirty filter – it
> simply adds some noise on the coefficient of a biquad. The code
> will be public domain as soon as I release this version 2. It is
> shamelessly simple, but sounds so good!

Cute. Wish I’d thought of that.

————– http://www.bek.no/~pcastine/Litter/ ————-
Peter Castine +–> Litter Power & Litter Bundle for Jitter
Universal Binaries on the way
iCE: Sequencing, Recording &
Interface Building for |home | chez nous|
Max/MSP Extremely cool |bei uns | i nostri|
http://www.dspaudio.com/ http://www.castine.de

#89573
Dec 1, 2006 at 9:08am

#89574
Dec 1, 2006 at 1:14pm

Dear Peter and all

The multiplication instead of division, but more important once
instead of 4 times, did save more than 25% of horsepower!

So to make it clear, the version one was:

while (n–)
{
in0 = *entree++;
yn = b0a0*in0 + b1a0*in1 + b2a0*in2 – a1a0*ou1 – a2a0*ou2;
*sortie++ = yn;
in2 = in1 * (1. – (sale * alea() / 4294967296.0));
in1 = in0 * (1. – (sale * alea() / 4294967296.0));
ou2 = ou1 * (1. – (sale * alea() / 4294967296.0));
ou1 = yn * (1. – (sale * alea() / 4294967296.0));
}

where alea gives me an unsigned long and sale is a dirtiness factor.

and now is:

while (n–)
{
in0 = *entree++;
yn = b0a0*in0 + b1a0*in1 + b2a0*in2 – a1a0*ou1 – a2a0*ou2;
*sortie++ = yn;
in2 = in1 * (1. – (sale * alea()));
in1 = in0 * (1. – (sale * alea()));
ou2 = ou1 * (1. – (sale * alea()));
ou1 = yn * (1. – (sale * alea()));
}

where sale is now the dirtiness factor multiplied by the
kOneOverMaxULong constant as proposed…

The inline did not make any difference, which surprised me. If I
understood well, I just have to change my declaration by adding
‘static inline’ to it?

I still want to optimise, because the following code is still 75%
faster. The only thing I remove is the dirtiness in the biquad
memory swap…

while (n–)
{
in0 = *entree++;
yn = b0a0*in0 + b1a0*in1 + b2a0*in2 – a1a0*ou1 – a2a0*ou2;
*sortie++ = yn;
in2 = in1; // * (1. – (sale * alea()));
in1 = in0; // * (1. – (sale * alea()));
ou2 = ou1; // * (1. – (sale * alea()));
ou1 = yn; // * (1. – (sale * alea()));
}

I am thinking of going vectorial since I have 4 times the same maths,
but maybe it is not relevant…

any hints welcome again!

pa

#89575
Dec 2, 2006 at 2:54pm

Hi PA,

I haven’t time to read this thread totally now, but I do have an audio rate rand generator (between -1 an 1 I think) for max which could be modified to produce numbers between 0 and 1. It’s a replacemnet for noise~ that seeds differently (but not randomly) for each new instance, up to 600 odd different versions. It’s not ultra efficient, but it’s not that much more expensive than noise~, so if your results are not as efficient as noise~ than might be useful to you. I can send it to you off-list if you’d like, both as a UB external, and also as code.

Cheers

Alex

#89576
Dec 3, 2006 at 2:31pm

On 1-Dec-2006, at 14:14, Pierre Alexandre Tremblay wrote:
> The inline did not make any difference, which surprised me. If I
> understood well, I just have to change my declaration by adding
> ‘static inline’ to it?

Surprises me a little, too. But some details of inline handling are
compiler dependent.

There is a very slight chance that a really intelligent compiler
would already be inlining automatically, but that seems unlikely to
me. You could look at the disassembled code to check. (Reading
assembly code is one of my favorite pastimes. Pas.)

I would check:

1) The definition of alea() (not just the declaration) needs to be
qualified as inline.

2) The definition of alea() needs to be available to the file using
it (either directly in the file or #include’d)

3) For some compilers it may be necessary for the definition to occur
before the calling function.

If you’re only using alea() your perform method, then something like

static UInt32 sSeed = 0; // Initialize to uptime()
// or something from main()
static inline double alea()
{
const double kFactor = 1.0 / 4294967296.0;

sSeed = 123456789 * sSeed + 987654321; // *really* bad RNG
return (double) sSeed * kFactor;
}

long*
ButtPerform(
long* iParams)

{
// whatever
}

should convince any compiler to do the right thing. Defining the
function static inline has the side-effect that you should not need
to declare it, providing the definition occurs before the first call
to the function.

Hope this helps,
P.

(Who really regrets aiding and abetting anyone using Linear
Congruence as a RNG… c’est comme manger de sanglier avec sauce de
minthe.)

————– http://www.bek.no/~pcastine/Litter/ ————-
Peter Castine +–> Litter Power & Litter Bundle for Jitter
Universal Binaries on the way
iCE: Sequencing, Recording &
Interface Building for |home | chez nous|
Max/MSP Extremely cool |bei uns | i nostri|
http://www.dspaudio.com/ http://www.castine.de

#89577
Dec 3, 2006 at 4:26pm

#89578
Dec 4, 2006 at 10:31am

#89579
Dec 4, 2006 at 11:34am

#89580
Dec 4, 2006 at 8:24pm

> Vectorizing algorithms is something of a black art: I’ve read
> reports where uncautious vectorization even resulted in worse
> performance.

Vectorising is not so hard. I have done quite a lot of this, and the apple docs are very good – I learnt everything from there.

It is indeed possible to write less efficient code for altivec than the same algoritm in straightforward scalar code (I’ve done it – but I’ve also written very efficient code). This basically occurs where you are processing data that is not easy to process in parallel (either because of the algorithm, or the way the data is stored) and so you iccur CPU costs for re-packing your data into vectors and/or pulling out separete elements, or when you do operations across vectors (like summing all four values). If you are processing four values in parallel all the time and hence can avoid all these bad practices then you should see an improvement…….

For UB purposes it is possible to write code which identifies the platform to compile for and use macros for vector functions which are replaced by the pre-compiler for the appropriate platform, thus allowing the use of SSE on intel macs (note that you can assume SSE2 for intel macs, as it is a requirement of the OS – the extra functions in SSE3 will probably be irrelevant in this case and also most others). For simple adds and multiplies etc. the correspondence between altivec and SSE is very direct, except there is no vec_madd equivalent for SSE, you must use two seperate operations instead. For me porting to SSE was very easy from altivec, but if you’re starting vectorising now I would think about this straight away, as you might as well be efficient whatever the processor.

Good Luck,

Alex

#89581
Dec 4, 2006 at 9:02pm

Hi guys,

Since we are on the topic, I have done quite some
vectorization myself, both using Altivec and SSE.
As Alex pointed out, it’s not too hard to go from one
to the other.
However something that I’ve always been uncertain
about is if in the perform method of any MSP external
the audio vectors are passed 16-byte aligned ready for
vector calculations or if we have to move data around
to align them.

I checked the times~.c example in the SDK folder, but
I couldn’t find any mention of memory alignment.

Does anybody know?

Thanks.

- Luigi

— Alex Harker wrote:

>
>
> > Vectorizing algorithms is something of a black
> > reports where uncautious vectorization even
> resulted in worse
> > performance.
>
> Vectorising is not so hard. I have done quite a lot
> of this, and the apple docs are very good – I learnt
> everything from there.
>
> It is indeed possible to write less efficient code
> for altivec than the same algoritm in
> straightforward scalar code (I’ve done it – but I’ve
> also written very efficient code). This basically
> occurs where you are processing data that is not
> easy to process in parallel (either because of the
> algorithm, or the way the data is stored) and so you
> iccur CPU costs for re-packing your data into
> vectors and/or pulling out separete elements, or
> when you do operations across vectors (like summing
> all four values). If you are processing four values
> in parallel all the time and hence can avoid all
> these bad practices then you should see an
> improvement…….
>
> For UB purposes it is possible to write code which
> identifies the platform to compile for and use
> macros for vector functions which are replaced by
> the pre-compiler for the appropriate platform, thus
> allowing the use of SSE on intel macs (note that you
> can assume SSE2 for intel macs, as it is a
> requirement of the OS – the extra functions in SSE3
> will probably be irrelevant in this case and also
> most others). For simple adds and multiplies etc.
> the correspondence between altivec and SSE is very
> direct, except there is no vec_madd equivalent for
> SSE, you must use two seperate operations instead.
> For me porting to SSE was very easy from altivec,
> but if you’re starting vectorising now I would think
> efficient whatever the processor.
>
> Good Luck,
>
> Alex
>
>

————————————————————
THIS E-MAIL MESSAGE IS FOR THE SOLE USE OF THE INTENDED RECIPIENT AND MAY CONTAIN CONFIDENTIAL AND/OR PRIVILEGED INFORMATION. ANY UNAUTHORIZED REVIEW, USE, DISCLOSURE OR DISTRIBUTION IS PROHIBITED. IF YOU ARE NOT THE INTENDED RECIPIENT, CONTACT THE SENDER BY E-MAIL AT SUPERBIGIO@YAHOO.COM AND DESTROY ALL COPIES OF THE ORIGINAL MESSAGE. WITHOUT PREJUDICE UCC1-207.
————————————————————

Yahoo! Music Unlimited
Access over 1 million songs.

http://music.yahoo.com/unlimited

#89582
Dec 4, 2006 at 10:04pm

> However something that I’ve always been uncertain
> about is if in the perform method of any MSP external
> the audio vectors are passed 16-byte aligned ready for
> vector calculations or if we have to move data around
> to align them.
>
> I checked the times~.c example in the SDK folder, but
> I couldn’t find any mention of memory alignment.
>
> Does anybody know?

I have no certain answer, but everything in the max/msp sdk and my personal experience with msp externals suggests that yes this is the case, or at least this is definitely the case for ppc (in that all the altivec stuff from c74 I’ve seen just takes the pointers and casts them as *vFloat or *(vector float) which implies that they are already 16-byte aligned). I’m 99.999% sure this is the case for intel as well (in that I’ve never had to do non-aligned loads to make it work), but perhaps an official answer from one of the c74 guys would absolutely clear this up if you want to know for sure.

If you try to read from a non 16-byte aligned address it should just read from the nearest aligned address, so it is unlikely that one wouldn’t notice if this was happening as you would either incur a sample delay of 2-3 with dropped samples at the beginning of every msp vector (if it didn’t crash from writing to protected memory on the last vector), or in my case (working with FFTs a lot) a bin shift of 2-3 bins. Both of these would be clearly audible…..

Alex

P.S. Thanks wes – yes that’s the proper way to do it, which I should have known having used the FFT stuff cross-platform, but I will now check out stuff for basic vector ops as well.

#89584
Dec 4, 2006 at 11:54pm

On 4-Dec-2006, at 22:15, Wesley Smith wrote:

> Instead of writing Altivec or SSE specific code, you can use Apple’s
> Accelerate framework for more cross-platform code:
> http://developer.apple.com/performance/accelerateframework.html . It
> does the platform specific business for you.

What do you do for your externals that run on XP?

————– http://www.bek.no/~pcastine/Litter/ ————-
Peter Castine +–> Litter Power & Litter Bundle for Jitter
Universal Binaries on the way
iCE: Sequencing, Recording &
Interface Building for |home | chez nous|
Max/MSP Extremely cool |bei uns | i nostri|
http://www.dspaudio.com/ http://www.castine.de

#89585
Dec 15, 2006 at 1:47pm

Dear all

I give up! I just can’t get my head around optimisation now. My
main issue is with the Linear Congruence PRNG, that uses unsigned
long truncation feature to stay within control. I just cannot code
what I wanted to do, which is a vector random generator between 0.
and 1. because

1) I cannot do it with a floating seed (it will go to infinite!)

2) I cannot perform (vector unsigned long) Altivec multiplication

It is probably due to my inexperience, but this is too hard for me
from now.

As I write, I have this idea that come back of masking my 128 bit
word to bring it where I want… masking a 32 bit word with 3FFF FFFF
and then typecast to float and substract one gives me 0 to 0.9999999
so I will give it a try.

any input is appreciated

pa

#89587
Dec 15, 2006 at 5:08pm

One last note for today…

I found this algorythm and was able to code it in vector, but it does
not sound as good as a LC, is it just me or does it make sense?

I have even code this noise generator alone to compare it with noise~
and it has some grain in it that the LC has not… I need to get my
maths together to understand why…

any help welcome !

pa

float g_fScale = 2.0f / 0xffffffff;
int g_x1 = 0×67452301;
int g_x2 = 0xefcdab89;

void whitenoise(
float* _fpDstBuffer, // Pointer to buffer
unsigned int _uiBufferSize, // Size of buffer
float _fLevel ) // Noiselevel (0.0 … 1.0)
{
_fLevel *= g_fScale;

while( _uiBufferSize– )
{
g_x1 ^= g_x2;
*_fpDstBuffer++ = g_x2 * _fLevel;
g_x2 += g_x1;
}
}

my version:

vec_semance1 = vec_xor(vec_semance1, vec_semance2);

#89588
Dec 23, 2006 at 9:05pm

Dear all

OK, sorry to revive an old thread, but I was wrong in my last post,
so for the posterity, let’s put thing straight…

2 things I said that were not totally true:

> masking a 32 bit word with 3FFF FFF and then typecast to float and
> substract one gives me 0 to 0.9999999

Actually, after reading the excellent Numerical Recipes in C (Press,
W. and al) I found out that my binary operation were not right. It
is masking with 0x003F FFFF then adding the 1 base by OR 0x3F80 0000,
then the typecast and subtraction.

> 2) I cannot perform (vector unsigned long) Altivec multiplication

I found on Apple website this link, that explain how to do it.

http://developer.apple.com/hardwaredrivers/ve/code_optimization.html

So I am now able to do 4 Linear Congruence Pseudo-Random Generators
in parallel, saving much CPU (about another 50%)

I am pretty sure that my code could be optimised again, but I might
leave this for later (v.3?). I will post my code on this thread if
anyone is interested.

pa

#89589
Dec 25, 2006 at 2:10am

On 12/23/06, Pierre Alexandre Tremblay
wrote:
>
> Dear all
>
> OK, sorry to revive an old thread, but I was wrong in my last post,
> so for the posterity, let’s put thing straight…
>
> 2 things I said that were not totally true:
>
> > masking a 32 bit word with 3FFF FFF and then typecast to float and
> > substract one gives me 0 to 0.9999999
>
> Actually, after reading the excellent Numerical Recipes in C (Press,
> W. and al) I found out that my binary operation were not right. It
> is masking with 0x003F FFFF then adding the 1 base by OR 0x3F80 0000,
> then the typecast and subtraction.
>
> > 2) I cannot perform (vector unsigned long) Altivec multiplication
>
> I found on Apple website this link, that explain how to do it.
>
> http://developer.apple.com/hardwaredrivers/ve/code_optimization.html
>
> So I am now able to do 4 Linear Congruence Pseudo-Random Generators
> in parallel, saving much CPU (about another 50%)
>
> I am pretty sure that my code could be optimised again, but I might
> leave this for later (v.3?). I will post my code on this thread if
> anyone is interested.
>
> pa
>

#89590
Dec 25, 2006 at 11:08am

I will comment it in English later today or tomorrow, and will post it

pa

#89591
Dec 26, 2006 at 12:00pm

#89592
Dec 26, 2006 at 12:31pm

Have you experimented with bit shifting? Perhaps adding a shifted version to itself and then shifting the entire deal. Keeping the float’s decimal point all the way to the left data-wise.

Never tried it or anything, but if you need a random idea…

#89593
Jan 2, 2007 at 4:46am

On 26-Dec-2006, at 7:31, Chuck Ritola wrote:

> Have you experimented with bit shifting? Perhaps adding a shifted
> version to itself and then shifting the entire deal. Keeping the
> float’s decimal point all the way to the left data-wise.
>
> Never tried it or anything, but if you need a random idea…

As John von Neuman said, generating random numbers should not be left
to chance.

Just adding a bit-shifter may work, but it’s just as likely to
shorten the cycle of your RNG faster than you can say “Linear
Congruence”.

There is a rich literature on generating pseudo-randoms, starting
with the Knuth I referred to very early in this thread (a more solid
resource than Numerical Recipes, btw).

P.

————– http://www.bek.no/~pcastine/Litter/ ————-
Peter Castine +–> Litter Power & Litter Bundle for Jitter
Universal Binaries on the way
iCE: Sequencing, Recording &
Interface Building for |home | chez nous|
Max/MSP Extremely cool |bei uns | i nostri|
http://www.dspaudio.com/ http://www.castine.de

#89594
Jan 6, 2007 at 4:54pm

Dear

> There is a rich literature on generating pseudo-randoms, starting
> with the Knuth I referred to very early in this thread (a more
> solid resource than Numerical Recipes, btw).

Thanks! But Numerical Recipes helped me with the job… I am quite
happy with my Altivec zero-filling right shifting bit mapping…
theoritically, I use the best of the LC RNG using its high order
bits,a nd keep all the precision of the IEEE float…