christripledot

Max 5: 256 [saw~] objects consume roughly 6% CPU at 64/64 vector settings (w/overdrive + scheduler in interrupt) on a 2.2GHz i7 MBP (Snow Leopard).

I've noticed quite dramatic increases in CPU consumption in almost all of my patches.

I realise that MSP now uses 64 bit signals, so some additional overhead is unavoidable, but is a 3x slowdown something we just have to accept?

Somebody awesome (I forget who, no disrespect intended) recently shared a bunch of externals that make use of SSE parallelism to improve performance. When I saw this, I must confess that I was dismayed to learn that Max 5 didn't already make use of SSE parallelism. I would have thought that the 64 bit overhead could have been overcome, or even bettered, by using SSE instructions to process two doubles at once (Max 5 processing only a single float at once).

Now that Max 6 is Intel Mac only, I think it's time to start leveraging some extra processing power. Could someone confirm whether or nor Max 6's code uses SSE intrinsics, and can we expect to see some performance improvements in the release version?

slight-concerns-about-msp-performance

I've seen a couple of externals which attempted SSE optimizations, but the ones I saw were for MacOS only so I couldn't test them.

When Native Instruments added SSE2 support in Reaktor 3, a little over 10 years ago, they garnered astonishing performance improvements. In the modular design environment there's still little that can approach them in terms of polyphonic performance. However, Reatkor has no SDK or ability to compile standalones, and Native Instruments has indicated it will not make anything comparable to Jitter.

I've seen a couple of externals which attempted SSE optimizations, but the ones I saw were for MacOS only so I couldn't test them. 

When Native Instruments added SSE2 support in Reaktor 3, a little over 10 years ago, they garnered astonishing performance improvements. In the modular design environment there's still little that can approach them in terms of polyphonic performance. However, Reatkor has no SDK or ability to compile standalones, and Native Instruments has indicated it will not make anything comparable to Jitter.


@christripledot - I guess you might mean me...

So, I can say a few relevant things here.

1 - I will port to Max 6 soon (or as soon as I have the time) with double precision SSE versions of stuff.

2 - I have been a fan of the idea of Max natively supporting SSE for some time, and had various conversations with the dev team about it. The last of these was at the expo a few days ago. The internal testing that c74 carried out suggested that the performance gains were very very small. I'm not sure about that (as it is not my experience), but they do not have time to pursue avenues that do not look fruitful. That is understandable. On the other hand they are very receptive to possible performance improvement ideas, and at the least are willing to do things to ensur that 3rd party devs (like me) can leverage SSE vectorisation effectively.

3 - I think it is fairly safe to say that ASFAIK c74 code does not make use of SSE intrinsics. It turns out that on a mac a lot/most of it will compile to SSE code, but by this I mean SSE scalar instructions, not vectorised code. That is not true under Windows, which may lead to different speed results from vectorising bits of DSP under that platform (due to switching between FPU and the vector unit).

4 - I have stuff in-house for Windows now, which I hope to release soon. Probably all further releases I make (personally and as part of the HISSTools stuff) will be dual-platform, although generally (and probably for several good reasons) I am seeing worse performance under Windows with the same source code. I'll try to improve what I can, but there may be limits there I cannot surmount, or do not have time to deal with.

5 - The speed-up with SSE vectorised doubles is not going to be as good as 2x for most things, so it is not necessarily going to be a massive difference. I hear that 4 doubles together is possible on the newest intel processors, which will be great, but would only be available to a few users with new machines...

So, I can say a few  relevant things here.

2 - I have been a fan of the idea of Max natively supporting SSE for some time, and had various conversations with the dev team about it. The last of these was at the expo a few days ago. The internal testing that c74 carried out suggested that the performance gains were very very small. I'm not sure about that (as it is not my experience), but they do not have time to pursue avenues that do not look fruitful. That is understandable. On the other hand they are very receptive to possible performance improvement ideas, and at the least are willing to  do things to ensur that 3rd party devs (like me) can leverage SSE vectorisation effectively.

Oh and you should read Joshua's post in this thread:

https://cycling74.com/forums/msp-performances-in-maxmsp-6

which I just saw - the SIMD comment pleasantly surprised me!

Yes, I did mean you - sorry for not acknowledging you by name! :)

I've actually started doing this myself - not to steal your idea or anything ;) - if you like we can swap source code and see what gains are achievable. You might not like my source though! I prefer to work in inline assembly where possible; the C intrinsics seem like more of a mouthful than the mnemonics! Plus now that Macs and PCs have the same instruction set under the hood, assembly is a bit more portable than it used to be.

The potential for speedup depends greatly on how far you can unroll your loops, and how much data you can prefetch into the cache. What works well on one machine won't be so quick on another. Having said that, even a ~20% speedup of core objects that tend to get nested deeply (like +~, *~, etc.) can boost the performance of large patches significantly. I don't think there's much excuse not to use SIMD for add, sub, mul, div, round, floor, ceil, etc.

As you say, on the latest Intel chips (Sandy Bridge) it is possible to process 4 doubles at a time, using the new AVX instruction set. Unfortunately these instructions aren't supported in 32-bit code, and there aren't very many assemblers, let alone C compilers, that support the mnemonics/intrinsics yet. Deciphering Intel's docs and writing machine instructions byte by byte is too painful for me. One day though...

Re: Point 1:
I've actually started doing this myself - not to steal your idea or anything ;) - if you like we can swap source code and see what gains are achievable. You might not like my source though! I prefer to work in inline assembly where possible; the C intrinsics seem like more of a mouthful than the mnemonics! Plus now that Macs and PCs have the same instruction set under the hood, assembly is a bit more portable than it used to be.

Re: Point 5:
The potential for speedup depends greatly on how far you can unroll your loops, and how much data you can prefetch into the cache. What works well on one machine won't be so quick on another. Having said that, even a ~20% speedup of core objects that tend to get nested deeply (like +~, *~, etc.) can boost the performance of large patches significantly. I don't think there's much excuse not to use SIMD for add, sub, mul, div, round, floor, ceil, etc.

As you say, on the latest Intel chips (Sandy Bridge) it is possible to process 4 doubles at a time, using the new AVX instruction set. Unfortunately these instructions aren't supported in 32-bit code, and there aren't very many assemblers, let alone C compilers, that support the mnemonics/intrinsics yet. Deciphering Intel's docs and writing machine instructions byte by byte is too painful for me. One day though...


1 - sure. I actually use common source for the SSE objects and just replace the necessary ops by setting some defines, so I can port quite quickly when I need to. I also use macros for all vector ops, as for a while I was supporting Altivec too, and the SSE names are pretty nasty. Feel free to email me offlist to discuss this further though.

5 - Yes, obviously there are other factors that i didn't mention. Actually I don't loop unroll for anything other than very small loops, as I don't tend to see a speed-up from it. I think for the stuff I'm doing now cache behaviour is likely to be much more of an issue, but my experiements with prefetching have also never yielded significant speed-ups...

Yeah, the prefetch instructions are a bit of a mystery, aren't they?

Re: unrolling, I tend to just unroll until every available XMM register is filled.

The best all-round tradeoff I've found so far is to use negative addressing offsets for the output, like this:

void fastmul_perform64(t_fastmul *x,
                       t_object *dsp64,
                       double **ins,
                       long numins,
                       double **outs,
                       long numouts,
                       long sampleframes,
                       long flags,
                       void *userparam) {

    t_double *in1 = ins[0];
    t_double *in2 = ins[1];
    t_double *out = outs[0];
    int n = sampleframes;

    asm {
            mov    ecx,    n
            mov    esi,    in1
            mov    edi,    in2
            mov    ebx,    out
            shr    ecx,    3

        loopStart:
            movapd    xmm0,    [esi]
            movapd    xmm1,    [esi + 16]
            movapd    xmm2,    [esi + 32]
            movapd    xmm3,    [esi + 48]

            movapd    xmm4,    [edi]
            movapd    xmm5,    [edi + 16]
            movapd    xmm6,    [edi + 32]
            movapd    xmm7,    [edi + 48]

            mulpd    xmm0,    xmm4
            add    ebx,    64
            mulpd    xmm1,    xmm5
            add    esi,    64
            mulpd    xmm2,    xmm6
            add    edi,    64
            mulpd    xmm3,    xmm7

            movntpd    [ebx - 64],    xmm0
            movntpd    [ebx - 48],    xmm1
            movntpd    [ebx - 32],    xmm2
            movntpd    [ebx - 16],    xmm3

            sub    ecx,    1
            jnz    loopStart
    }
}

(This code assumes a minimum signal vector size of 8.)

void fastmul_perform64(t_fastmul *x,
                       t_object *dsp64,
                       double **ins,
                       long numins,
                       double **outs,
                       long numouts,
                       long sampleframes,
                       long flags,
                       void *userparam) {

	t_double *in1 = ins[0];
	t_double *in2 = ins[1];
	t_double *out = outs[0];
	int n = sampleframes;

	asm {
			mov	ecx,	n
			mov	esi,	in1
			mov	edi,	in2
			mov	ebx,	out
			shr	ecx,	3

		loopStart:
			movapd	xmm0,	[esi]
			movapd	xmm1,	[esi + 16]
			movapd	xmm2,	[esi + 32]
			movapd	xmm3,	[esi + 48]

			movapd	xmm4,	[edi]
			movapd	xmm5,	[edi + 16]
			movapd	xmm6,	[edi + 32]
			movapd	xmm7,	[edi + 48]

			mulpd	xmm0,	xmm4
			add	ebx,	64
			mulpd	xmm1,	xmm5
			add	esi,	64
			mulpd	xmm2,	xmm6
			add	edi,	64
			mulpd	xmm3,	xmm7

			movntpd	[ebx - 64],	xmm0
			movntpd	[ebx - 48],	xmm1
			movntpd	[ebx - 32],	xmm2
			movntpd	[ebx - 16],	xmm3

			sub	ecx,	1
			jnz	loopStart
	}
}

(This code assumes a minimum signal vector size of 8.)


Interesting. I don't hand code assembly, for various reasons (but I can read the above no problem), I'd be interested to see the results in action in a comparison.

This is the kind of loop where unrolling is worth thinking about (it's pretty small), but it's possible that I'd leave this to the compiler in an object like this (it's a flag in Xcode gcc)....

I'll email you tonight or tomorrow with some bits and bobs, if that's OK.

Obviously, you are some very talented programmers! I can understand the performance gains for a single voice are rather limited; but what are your thoughts on implementing SIMD to parallelize voices?

There are no data dependencies between voices--until they are summed into a monophonic voice--So instead of unrolling a loop acorss the pipelines for each sample, multiple samples can be processed, one for each voice, in parallel, and theoretically at least, SIMD should then be able to do the same for MAX as it did for Reaktor 3.0 over a decade ago.

Obviously, you are some very talented programmers! I can understand the performance gains for a single voice are rather limited; but what are your thoughts on implementing SIMD to parallelize voices? 

There are no data dependencies between voices--until they are summed into a monophonic voice--So instead of unrolling a loop acorss the pipelines for each sample, multiple samples can be processed, one for each voice, in parallel, and theoretically at least, SIMD should then be able to do the same for MAX as it did for Reaktor 3.0 over a decade ago.


Parallelising voices is a perfect application for SIMD.

Alex is the master here; he's already implemented sample-accurate voice management. I haven't the foggiest what goes on under [poly~]s hood...

@Ernest:
Parallelising voices is a perfect application for SIMD.

Alex is the master here; he's already implemented sample-accurate voice management. I haven't the foggiest what goes on under [poly~]s hood...


So SIMD to parallelize voices/channels is totally doable, and in some cases a bigger gain than single channel SIMD because you can do sample level feedback etc. without issue. I have used this once to do a double precision stereo filter external for the same cost roughly as a single channel scalar one.

However, the way voices work in poly~ or in my dynamicdsp~ object means that this is not possible without changing the underlying model. It would be hard to make it work that way, as in the MaxMSP model the voices are made up of a copy of all the objects in the patch, each with its own state, which needs to be maintained and dealt with. This is not impossible theoretically, but is not possible within the max object model as it exists and so is unlikely to ever happen.

Multichannel paralleisation with SIMD is much more realistic, especially as this would most likely mean stereo for most people, and the packed 64bit type covers two lots of data.

I doubt c74 is going to either of those things anytime soon though.

Indeed! I haven't been able to squeeze more than ~30% more speed out of a single-channel SIMD multiply, but I'm sure there are lots of people whose patches could make use of multichannel arithmetic objects (I'm thinking 8 SSE registers = 16 channel mixer gain block...)

30% might be it. For multichannel you need to do quite a bit of processing to beat the cost of interleaving the channels and deinterleaving, so for a single multiply I'm not sure whether this will be the limit.

I have some time tonight and I will try to use some it to port a whole load of vector objects and benchmark. I'll report back, although packed doubles is never going to give you the 3-4x win that packed floats is capable of - 1.5-2x would be me expectation...

Would it be a good request to make the gen~ object work in simd with some limited set of objects, do you think?

Slight concerns about MSP performance