BENCHMARKING max4 vs max5 etc.

Kessito's icon

Hi All,

To finally get clear what exactly is the difference in audio performance max4 vs max5, I made a patch which tests the cpu usage with 4 different very common audio calculations: biquads, divide, multiply and add. it tests the audio performance with the non-realtime driver so in this way you don't get differences caused by drivers. Also there is no graphical stuff going on so we're just testing pure audio performance. I would like as much people as possible to do this test on their machines and report the results and their machine description. my results are:

MAX 4:
500 biquads :55%
1000 multiplies: 49%
5000 divisions: 61%
5000 adds: 58%

MAX 5:
500 biquads :68%
1000 multiplies: 63%
5000 divisions: 62%
5000 adds: 62%

Machine :Pentium 4, 2,4Ghz, 1Gb Ram, WinXPPro SP2

I hope that in this way we can help the devs to get insight in what is going on, so that they can fix this strange difference between max4 and 5

cheers,
Kessito

[deleted patch as text - it's attached below]

Kessito's icon

I'm sorry, didnt realize that it would be so much text, here is the patcher file

Kessito's icon

A little more info: it could be that you have to press start test twice, the test takes 5 seconds/test.

One thing that I find very strange is that 5000 divisions take up the same cpu as 1000 multiplies, while I was always learnt in C/C++/DSP classes that division takes 20 times as much cpu cycles as a multiply

Kessito's icon

Doesn't anyone find this interesting? remember that this could be a very nice overview in the performance mac vs pc, pentium vs athlon vs core2due vs xeon etc. etc. the more you measure, the more you know!

ComfortableInClouds's icon

i definitely find this interesting, and am curious to see how others compare.

for some reason, this patch wouldnt open in max4 for me. i would right click it, choose max 4.63 to open it with, and nothing would happen. ?

with Max 5 here is roughly how my cpu performed (i say roughly because i ran the test multiple times and got slightly different values each time):

biquad 48, multiplies 38, division 64, addition 64.

Im on a mac lap top, OS X 10.4.11, processor 2.16 GHz Intel Core Duo.

it surprised me how much CPU addition takes.

cool experiment btw :D

Jakob Riis's icon

I ran the test on a MacBook Pro, 2.2 GHz, Mac OS X 10.4.11.

Max 4.6.3:

1000 multiplies: 25
5000 divisions: 41
5000 adds: 42

Max 5.0.3:
500 biquads: 41
1000 multiplies: 39
5000 divisions: 47
5000 adds: 45

/J

Jakob Riis's icon

the missing line should be

martinrobinson's icon

Quote: Kessito wrote on Thu, 14 August 2008 13:21
----------------------------------------------------
> One thing that I find very strange is that 5000 divisions take up the same cpu as 1000 multiplies, while I was always learnt in C/C++/DSP classes that division takes 20 times as much cpu cycles as a multiply
----------------------------------------------------

When [/~] is operating on a signal and a float as in your patch (rather than two signals) it is possible that c74 take 1./float and then do a multiply instead of a divide in the DSP loop. Of course you can't do this optimisation when operating on two signals. This doesn't explain why [/~] is apparently cheaper than [*~]. (I modified the patch to have 1000 [*~] and 1000 [/~]. [/~] uses the same or less CPU on my machine.)

Perhaps [/~] is super optimised to try and counteract the relative slowness of division whereas optimisations for [*~] have been paid less attention since it's assume to be fast.

So perhaps we can still learn from Aesop in MaxMSP?:
http://en.wikipedia.org/wiki/The_Tortoise_and_the_Hare

I'll be using [/~] and a preceding [!/ 1.] in its right inlet for scalar multiplication in future :)

Kessito's icon

I have done the test on my studio machine:
Core2QUAD processor 2,4Ghz , 4GBRAM, WINXPpro SP2

MAX 4.6.3:
biquads :17%
multiplies:17%
divisions: 20%
adds:20%

MAX 5.0.4:
biquads:26%
multiplies:26%
divisions:24%
adds:24%

When there will be enough info from different systems I will create an overview graph and upload it, so we have a nice overview off all different systems

sub0's icon

Hello Kessito, good idea this benchmarking.

My results:

Macbook OSX 10.5.4, 2GHz, 2GB
Max MSP 4.6.3
biquad 62, multiplies 29, division 46, addition 52.
Max MSP 5
biquad 100, multiplies 45, division 56, addition 49.

Best Lucas

Andrew Pask's icon

Could someone please post this patch as an attachment so I can delete that scrolling nightmare? My copy of FF locks up trying to select it all.

-A

Kessito's icon

Hi all,

Thanks to the ones who have already responded and done the test I hope that a lot more people will do this small effort for the good cause: the more input the more reliable the results!

One little thing: since the cpu counter in max (just as the ones in xp and osx) can be a bit wobbly, it's a good idea to run the test a couple of times to see if the results are sort of less consistent.

greetings,
Kessito

ComfortableInClouds's icon

the patch is already uploaded as an attachment in kessito's second post.

If someone gets 100% for biquads, trying running the test again. I found that the first time I run the test, I get 100%, but in subsequent trials, biquad stays consistently slightly less than 40.

Peter Castine's icon

Quote: martinrobinson wrote on Fri, 15 August 2008 10:15
----------------------------------------------------

> When [/~] is operating on a signal and a float as in your patch (rather than two signals) it is possible that c74 take 1./float and then do a multiply instead of a divide in the DSP loop.

This is indeed the case. It is actually documented. Possibly in the 4.x SDK, which would explain why relatively few people are aware of it.

> This doesn't explain why [/~] is apparently cheaper than [*~]. (I modified the patch to have 1000 [*~] and 1000 [/~]. [/~] uses the same or less CPU on my machine.)

This is odd, and I am not aware of anything documented that would explain it. I can make a couple of guesses, but it's all stabbing in the dark. Avoiding a single/double conversion inside the DSP loop? Only Max's hairdresser knows for sure.

(How old does one have to be to get the hairdresser joke?)

-- P.

nathan wolek's icon

Only Max's hairdresser knows for sure.
(How old does one have to be to get the hairdresser joke?)

I am 31 and I got it. I can't remember the specific product for that advert though. Was it Vidal Sassoon?
--Nathan

Kessito's icon

I have done the test on my Old laptop:
P-3 processor 1,8Ghz(overclocked) , 512MBRAM, WINXPhome SP1

MAX 4.6.3:
biquads :47%
multiplies:48%
divisions: 66%
adds:66%

MAX 5.0.4:
biquads:59%
multiplies:60%
divisions:78%
adds:79%

Kessito

zoid's icon

Max 4.6.3

27%-biquads
25%-multiplication
41%-division
40%-addition

**********

Max 5.0.3:

39%-biquads
38%-multiplication
47%-division
41%-addition

**********
MacIntel 2.16 GHz Intel Core 2 Duo
2 GB 667 MHz
Mac OSX 10.4.11

zoid's icon

meant to add that it's a MacBook Pro i'm using.

MuShoo's icon

Max 4.6.3: (ranges of multiple tests)
Biquad: 40 - 12
Multiplies: 12 - 10
Divides: 23 - 18
Adds: 25 - 20

Max5.0.3:
Biquads: 35 - 21
Multiples: 25 - 18
Divides: 25 - 20
Adds: 28 - 23

8-core 3ghz Mac Pro with 5gb RAM, leopard 10.5.4

I wonder how Max5 would fare under Snow Leopard compared to Max4. Or if Max5 is ever going to get multiprocessor support on it's own. Anyone know how Max5 deals with multithreading? It definitely wasn't using 21% on the biquads across all my CPUs, my total CPU usage (I have another meter) never went above about 10%, including all the other things I have running (iTunes, Adium, both version of Max are open at the same time, etc).

jonathan segel's icon

gee, on my 1.5Ghz G4 Powerbook, OS 10.4.11 (2GB ram)
with 4.6.3 it gives vastly different results every time i run the
test. first time it's 100% every thing. then it calms down some but
still every run it gives different results.

with 5.0.3 it's pinned from the moment i turn on the audio, but
running it open i can see that there is a low end to the cpu
readings. it always peaks at 100% of course, cuz of spikes while
running.

4.6.3
biquads 66-74
multiplies 42 -62
divisions 43--60
adds ~45-67

5.0.3
biquads ~70-100
multiplies ~40-100
divisions ~98-100
adds ~97-100

great. that really tells me something.
__________________________
jonathan segel
jsegel-at-magneticmotorworks.com
etc.

Kessito's icon

Thanks for all the replies so far, I would only like to see some more replies from PC users so we have some more averaging.
It is clear that the tests are not perfect in stability, but if enough people join in, the average results will be fairly reliable.
If only like 5 more people join in, I will put the results in a XL sheet and post the sheet here.

Keep up the good work, and for all the people who haven

Rob Ramirez's icon

what exactly do you expect the devs to be able to do with the results from your experiment?

it's already been determined and addressed that max5 will run slower then max4 for some tasks on some computers. i'm sure all the devs are well aware of the performance differences between the two

so...

if you are trying to determine the fastest way to perform a specific task, then empirical tests like these are very useful, to your specific task on your specific hardware. but other than that, not so useful.

of course, imho.

Kessito's icon

I do not agree, when max5 was released it was stated by the devs that graphical stuff would be slower, cause off all the 32bit graphical stuff e.d. But it was also stated by the devs that the audio performance should be the same as in max4. When I said that the audio performance had gone down, the devs replied that this shouldn

kjg's icon

Quote: Kessito wrote on Tue, 19 August 2008 20:09
----------------------------------------------------
Come on guys, it

dondelion's icon

max 4 -

500 biquads 75
1000 multiplies 24
5000 divisions 39
5000 adds 32

max 5 -

500 biquads 92
1000 multiplies 59
5000 divisions 38
5000 adds 37

win xp,2gb,core duo 1.6ghz,using cacky on-board sound card...

Andrew Pask's icon

Here are some notes from a conversation I have had with Joshua which might prove illuminating. We'll try and have a more detailed look at this later on.

One *huge* important thing which is lacking in people reports is their signal vector size which *must* be the same for these things to have any meaning whatsoever.

However, some major things have changed for these simple objects:

- For Macintosh PPC users: We no longer use Altivec optimization on this new version due to compiler differences. On PPC Max 4 would use CodeWarrior CFM code for the altivec optimization. This is no longer done.

- For biquad there is a new and improved internal algorithm: "It includes smooth coef changes, and synchronous coef changes, as well as a stoke feature allowing sample memory to be artificially set, thereby accomodating things like ringing oscillators."

- Some places there are more rigorous testing and denormal detection, in order to avoid huge performance spikes when numbers get really small in a feedback situation like reverb or a delay line.

However, the important thing to realize is that since these simple objects are ususally greatly overwhelmed by more expensive objects, that even while the performance discrepancies of 5000 multiplication objects seem to be large, an average "real world" patch is not that much slower for DSP processing.

Cheers

-A

sub0's icon

So maybe it is time to create a benchmarking system that tests both
tests raw mathematics as well as real world situations? Iv'e been
thinking lately of a benchmark system for Jitter as well. It would be
great to have relevant test results for Max/MSp/Jitter regarding its
performance on various hardware and os configurations.

>
> However, the important thing to realize is that since these simple
> objects are ususally greatly overwhelmed by more expensive objects,
> that even while the performance discrepancies of 5000 multiplication
> objects seem to be large, an average "real world" patch is not that
> much slower for DSP processing.
>
>
> Cheers
>
> -A

Telcosystems
PO box 174
3000 AD
Rotterdam
www.telcosystems.net
info@telcosystems.net

Stefan Tiedje's icon

Andrew Pask schrieb:
> Could someone please post this patch as an attachment so I can delete
> that scrolling nightmare? My copy of FF locks up trying to select it
> all.

On the mailing list it didn't show up at all...

I'd love to see why its faster in Max 4, I can only guess, no patch till
now...

--
Stefan Tiedje------------x-------
--_____-----------|--------------
--(_|_ ----|-----|-----()-------
-- _|_)----|-----()--------------
----------()--------www.ccmix.com

barbara.myrick's icon

I will be out of the office from August 7 until August 25 and will not
have access to my email account. If you need assisitance while I am away
please contact Ana Varas at 754-321-2050 or ana.varas@browardschools.com

Thank you

Kessito's icon

>One *huge* important thing which is lacking in people reports is >their signal vector size which *must* be the same for these >things to have any meaning whatsoever.

This is exactly the reason why I set comments in the test telling people whr vector size to use, so I'm sure everyone is using the same settings

>However, some major things have changed for these simple >objects:

This I find very funny, when I originally commented in a different thread that pure audio performance had gone down, I was told by the devs that this should not be possible since the audio calculations hadn't changed

>- For Macintosh PPC users: We no longer use Altivec optimization >on this new version due to compiler differences. On PPC Max 4 >would use CodeWarrior CFM code for the altivec optimization. >This is no longer done.

>"- For biquad there is a new and improved internal algorithm: >"It includes smooth coef changes, and synchronous coef changes, >as well as a stoke feature allowing sample memory to be >artificially set, thereby accomodating things like ringing >oscillators."
>
>- Some places there are more rigorous testing and denormal >detection, in order to avoid huge performance spikes when >numbers get really small in a feedback situation like reverb or >a delay line.

I have never ever in all the years that I have used MSP experienced any denormal problems, except for externals that had not been written by Cycling, so it seems that this was already covered fine in MSP
>
>However, the important thing to realize is that since these >simple objects are ususally greatly overwhelmed by more >expensive objects, that even while the performance discrepancies >of 5000 multiplication objects seem to be large, an average >"real world" patch is not that much slower for DSP processing.
>
>
>Cheers
>
>-A"

I'm sorry but I really have to dissagree with you on this. The excact reason why I made this benchmark test is because all my "real world" patches had become much slower when moving them from max4 to max 5. I think it is very hard to say what exactly is a "real world" patch since this differs greately for all users. I find myself always making patches with lots off biquads in it for example. It seems not more than logical to do these kind off tests with very basic objects, since these are the fundamentals off every patch and it would be virtually impossible to make a good comparison in any other way.

Cheers

Kessito

Kessito's icon

Hi to you all,

I made a graph with the results off all the benchmark tests so far.
From people who have posted multiple results I have taken all the lowest values, go have a look!

p.s. this doesn't mean that we don't need more results, keep em coming folks!

Cheers,
Kessito

Andrew Pask's icon

Joshua sent me another email about this thread and the patch in it this morning. I present it to you in all its glory for your delectation.

-A

If I change the "peak" object to a running average (e.g. slide 4 4), which makes more sense for effective benchmarking, I get much closer numbers for Max 4 and Max 5 benchmarking, with the exception of biquad. Biquad is completely explained by the new smoothing algorithm costs. The other objects' slight difference in performance are attributable to compiler changes (gcc 4.0 vs gcc 3.3 on OS X and VS 2005 vs 2008 on PC), and/or some additional small overhead (2-3%) we have in the dsp chain for things like signal probing.

If the users wish to use the older and cheaper biquad algorithm in version 5.0.4, they can most likely send the smooth 0 message to biquad with the universal object or similar to regain similar to old performance (must be done before turning on DSP). However, I've made it so that the new smooth algorithm only incurs greater cost when necessary in version 5.0.5.

Max 5
500 biquads: 43 (33 with smooth 0, or new 5.0.5 handling)
1000 multiplies: 29
5000 divisions: 50
5000 adds: 48

Max 4
500 biquads: 30
1000 multiplies: 27
5000 divisions: 44
5000 adds: 48

AdrianFreed's icon

My filter objects (and for that matter my additive synthesis ones) have always had coefficient interpolation but this feature is optional so as not to penalize the common case of fixed parameters.

THis kind of smoothing is hard to do right with all the details of overdrive atomicity etc. I hope that the new biquad source is part of the 5.0 SDK so we can have a single model of how to do it.

Kessito's icon

Thanks for the response Joshua! It's good to hear that the main difference is in the biquad and can be solved. Would it be possible that you upload the modified patch that youu've used so we can try this too?
Many Thanks,
Kessito

zerox_'s icon

could you post the patch again ? please

zerox_'s icon

bump

Roald Baudoux's icon

Stefan Tieje wrote:
"I have never ever in all the years that I have used MSP experienced any denormal problems, except for externals that had not been written by Cycling, so it seems that this was already covered fine in MSP".

I am not so sure about it. I have often encountered NaNs with average~ in Max 4.x and I suspect it had something to do with poor denormalization.

Best,

Roald Baudoux

Emmanuel Jourdan's icon

If you have an example in Max 5, we'll be happy to look into it.