John MacCallum

I was wondering if anyone would like to share their solutions for dealing with denormals. It's recently come up as a problem in CNMAT's smooth-biquad~:

https://cycling74.com/forums/index.php?t=msg&th=33661&start=0&rid=4586&S=2551c966f7c03e3c1250963adb02068f

and it seems like the same problem was in cascade~:

https://cycling74.com/forums/index.php?t=msg&th=34447&start=0&rid=4586&S=2551c966f7c03e3c1250963adb02068f

We fixed the problem in smooth-biquad~ by doing this:

    _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON)

    //Read the old environment and set the new environment using default flags and denormals off

    fesetenv( FE_DFL_DISABLE_SSE_DENORMS_ENV );

    _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_OFF)

This solution seems to work well, but the object is a little more expensive--about 1.2x the number of cpu cycles. I suspect the performance hit is due to the compiler translating the code to SSE instructions and that we will need to tune our code for SSE.

If anyone else would be willing to share the ways in which they're dealing with this problem, I'd love to hear them and would be happy to benchmark them against the code above.

I was wondering if anyone would like to share their solutions for dealing with denormals.  It's recently come up as a problem in CNMAT's smooth-biquad~:

#ifdef WINDOWS
#include 
#else
#include 
#pragma STDC FENV_ACCESS ON
#endif

#ifdef WINDOWS
	_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON)
#else
	fenv_t oldEnv;
	//Read the old environment and set the new environment using default flags and denormals off
	fegetenv( &oldEnv );
	fesetenv( FE_DFL_DISABLE_SSE_DENORMS_ENV );
#endif

#ifdef WINDOWS
	_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_OFF)
#else
	fesetenv( &oldEnv );
#endif

This solution seems to work well, but the object is a little more expensive--about 1.2x the number of cpu cycles.  I suspect the performance hit is due to the compiler translating the code to SSE instructions and that we will need to tune our code for SSE.  

denormals

This might have some useful information, possibly a bit dated though:

http://www.musicdsp.org/files/other001.txt

On Jan 30, 2009, at 3:49 PM, John MacCallum wrote:

> I was wondering if anyone would like to share their solutions for 

> dealing with denormals. It's recently come up as a problem in 

> and it seems like the same problem was in cascade~:

> We fixed the problem in smooth-biquad~ by doing this:

>     _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON)

>     //Read the old environment and set the new environment using 

>     fesetenv( FE_DFL_DISABLE_SSE_DENORMS_ENV );

>     _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_OFF)

> This solution seems to work well, but the object is a little more 

> expensive--about 1.2x the number of cpu cycles. I suspect the 

	http://www.musicdsp.org/files/other001.txt

>
> Hi all,
>
> I was wondering if anyone would like to share their solutions for  
> dealing with denormals.  It's recently come up as a problem in  
> CNMAT's smooth-biquad~:
>
> https://cycling74.com/forums/index.php?t=msg&th=33661&start=0&rid=4586&S=2551c966f7c03e3c1250963adb02068f
>
> and it seems like the same problem was in cascade~:
>
> https://cycling74.com/forums/index.php?t=msg&th=34447&start=0&rid=4586&S=2551c966f7c03e3c1250963adb02068f
>
> We fixed the problem in smooth-biquad~ by doing this:
>
> #ifdef WINDOWS
> #include 
> #else
> #include 
> #pragma STDC FENV_ACCESS ON
> #endif
>
> ...
>
> t_int *biquad2_perform(t_int *w){
>
> #ifdef WINDOWS
> 	_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON)
> #else
> 	fenv_t oldEnv;
> 	//Read the old environment and set the new environment using  
> default flags and denormals off
> 	fegetenv( &oldEnv );
> 	fesetenv( FE_DFL_DISABLE_SSE_DENORMS_ENV );
> #endif
>
> // do inner loop calculations here
>
> #ifdef WINDOWS
> 	_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_OFF)
> #else
> 	fesetenv( &oldEnv );
> #endif
>
> ...
> }
>
> This solution seems to work well, but the object is a little more  
> expensive--about 1.2x the number of cpu cycles.  I suspect the  
> performance hit is due to the compiler translating the code to SSE  
> instructions and that we will need to tune our code for SSE.
>
> If anyone else would be willing to share the ways in which they're  
> dealing with this problem, I'd love to hear them and would be happy  
> to benchmark them against the code above.
>
> Thanks in advance,
> JM
>


Thanks a lot Brad--that's a good resource. Anyone using anything other than some variant of one of these techniques?

Thanks a lot Brad--that's a good resource.  Anyone using anything other than some variant of one of these techniques?

On 31 janv. 09, at 01:47, John MacCallum wrote:

> Thanks a lot Brad--that's a good resource. Anyone using anything 

> other than some variant of one of these techniques?

Have also a look to this thread. Graham posted a link to an 

interesting article, as well as showing the standard macros.

https://cycling74.com/forums/index.php?t=tree&th=36065&mid=154847&rid=0&S=a4b487a0345094eeccd60bcf0479a709&rev=&reveal=

> Thanks a lot Brad--that's a good resource.  Anyone using anything  
> other than some variant of one of these techniques?

Have also a look to this thread. Graham posted a link to an  
interesting article, as well as showing the standard macros.

I've been using the "flipping number solution" from the link that Brad provided (alternatively known as square injection) in my [gverb~] object and others for a while. Works like a charm. Here it is in pseudo-code.

// fix for denormal through square injection of dc offset

// define value during preprocessing for easy updates

#define TINY_DC        0.0000000000000000000000001f

// maintain dc_offset for square injection in object struct

// initialize square injection value in object "new" method

// flip sign on square inhection for each block

val_dry += sqinject_val; // add very small dc offset

I've been using the "flipping number solution" from the link that Brad provided (alternatively known as square injection) in my [gverb~] object and others for a while.  Works like a charm.  Here it is in pseudo-code.

// fix for denormal through square injection of dc offset
// define value during preprocessing for easy updates
#define TINY_DC		0.0000000000000000000000001f

// maintain dc_offset for square injection in object struct
double sqinject_val;

// initialize square injection value in object "new" method
x->sqinject_val = TINY_DC;

// in "perform" method before while loop
// flip sign on square inhection for each block
sqinject_val = x->sqinject_val * -1.0;

val_dry = *in_dry;  // grab input values
val_dry += sqinject_val; // add very small dc offset

For info: SuperCollider uses an inline function something like:

    return (absx > (float)1e-15 && absx < (float)1e15) ? x : (float)0.;

.. which gets rid of denorms and other nasties.

inline float zap(float x) throw()
{
	float absx = std::abs(x);
	return (absx > (float)1e-15 && absx < (float)1e15) ? x : (float)0.;
}

.. which gets rid of denorms and other nasties.


> This solution seems to work well, but the object is a little more expensive--about 1.2x the number of cpu cycles. I suspect the performance hit is due to the compiler translating the code to SSE instructions and that we will need to tune our code for SSE.

From my experience this seems an unlikely reason for the code to run slower on Mac OS at least - as most floating point code will most likely generate (non-vectorised) SSE code anyway on Max OS X - which you can see by examining the assembly generated in shark or something. Here's a relevant quote from the apple sse/altivec document:

"The scalar-on-vector feature is used by MacOS X on Intel to do most scalar floating point arithmetic. 

So, if you write a normal floating point expression, such as float a = 2.0f; that will be done on 

XMM. (For compiler illuminati, the GCC compiler flag, -mfpmath=sse, is on by default.) Single and double precision scalar floating point arithmetic is done on the SSE unit both for speed and also so 

as to deliver computational results much more like those obtained from PowerPC. The legacy x87 

scalar floating point unit is still used for long double, because of its enhanced precision. "

So I'd imagine the cost you're setting is actually the cost of changing the floating point environment on such a frequent basis.

If you KNOW that your code is generating SSE instructions (or I suppose if you are using SSE intrinsiccs) there is what I believe might be a more lightweight way to turn denormal flushing on for the SSE unit - here's the code I'm using - in my case for sse vector code (so I know it's SSE instructions I'm generating):

#if defined( __i386__ ) || defined( __x86_64__ )     

int oldMXCSR = _mm_getcsr(); // read the old MXCSR setting 

int newMXCSR = oldMXCSR | 0x8040; // set DAZ and FZ bits 

_mm_setcsr( newMXCSR );            // write the new MXCSR setting to the MXCSR 

#if defined( __i386__ ) || defined( __x86_64__ )    

I remember trying a few things (I think including setting the floating point environment), and this was the fastest for what I wanted to do. It would faster still not to have to set the bits every signal vector, but this unfortunately is necessary.

Branching in loops is always slow, so selectively flushing will be slower than adding noise of any kind. Adding noise/dc/"flipped numbers/ square wave" may well be negligible in terms of cpu (you often get very small ops "for free", because the bottleneck in your code is writing to/from memory, rather than the actual operations), - it's up to you whether you mind adding noise to the filter or not.

There is a slightly different noise algorithm used in 2up.svf~ (code here 

 that I used to generate noise to feed to the standard svf~ object to fix a denormal issue I was having. With a filter you may be able to add noise at only the input/one stage to avoid denormals, depending on the filter and the magnitude of the noise - in the svf case there shaping calculation that takes the 4th power of the signal which causes most of the denormals.

> This solution seems to work well, but the object is a little more expensive--about 1.2x the number of cpu cycles.  I suspect the performance hit is due to the compiler translating the code to SSE instructions and that we will need to tune our code for SSE.  

"The scalar-on-vector feature is used by MacOS X on Intel to do most scalar floating point arithmetic. 
So, if you write a normal floating point expression, such as float a = 2.0f; that will be done on 
XMM. (For compiler illuminati, the GCC compiler flag, -mfpmath=sse, is on by default.) Single and double precision scalar floating point arithmetic is done on the SSE unit both for speed and also so 
as to deliver computational results much more like those obtained from PowerPC. The legacy x87 
scalar floating point unit is still used for long double, because of its enhanced precision. "

#if defined( __i386__ ) || defined( __x86_64__ )	 
int oldMXCSR = _mm_getcsr();            // read the old MXCSR setting 
int newMXCSR = oldMXCSR | 0x8040;       // set DAZ and FZ bits 
_mm_setcsr( newMXCSR );			// write the new MXCSR setting to the MXCSR 
#endif

#if defined( __i386__ ) || defined( __x86_64__ )	
_mm_setcsr(oldMXCSR); 
#endif

There is a slightly different noise algorithm used in 2up.svf~ (code here http://2uptech.com/objects.html) that I used to generate noise to feed to the standard svf~ object to fix a denormal issue I was having. With a filter you may be able to add noise at only the input/one stage to avoid denormals, depending on the filter and the magnitude of the noise - in the svf case there shaping calculation that takes the 4th power of the signal which causes most of the denormals.

Denormals