BENCHMARKING max4 vs max5 etc.
To finally get clear what exactly is the difference in audio performance max4 vs max5, I made a patch which tests the cpu usage with 4 different very common audio calculations: biquads, divide, multiply and add. it tests the audio performance with the non-realtime driver so in this way you don’t get differences caused by drivers. Also there is no graphical stuff going on so we’re just testing pure audio performance. I would like as much people as possible to do this test on their machines and report the results and their machine description. my results are:
500 biquads :55%
1000 multiplies: 49%
5000 divisions: 61%
5000 adds: 58%
500 biquads :68%
1000 multiplies: 63%
5000 divisions: 62%
5000 adds: 62%
Machine :Pentium 4, 2,4Ghz, 1Gb Ram, WinXPPro SP2
I hope that in this way we can help the devs to get insight in what is going on, so that they can fix this strange difference between max4 and 5
[deleted patch as text - it's attached below]
I’m sorry, didnt realize that it would be so much text, here is the patcher file
A little more info: it could be that you have to press start test twice, the test takes 5 seconds/test.
One thing that I find very strange is that 5000 divisions take up the same cpu as 1000 multiplies, while I was always learnt in C/C++/DSP classes that division takes 20 times as much cpu cycles as a multiply
Doesn’t anyone find this interesting? remember that this could be a very nice overview in the performance mac vs pc, pentium vs athlon vs core2due vs xeon etc. etc. the more you measure, the more you know!
i definitely find this interesting, and am curious to see how others compare.
for some reason, this patch wouldnt open in max4 for me. i would right click it, choose max 4.63 to open it with, and nothing would happen. ?
with Max 5 here is roughly how my cpu performed (i say roughly because i ran the test multiple times and got slightly different values each time):
biquad 48, multiplies 38, division 64, addition 64.
Im on a mac lap top, OS X 10.4.11, processor 2.16 GHz Intel Core Duo.
it surprised me how much CPU addition takes.
cool experiment btw :D
I ran the test on a MacBook Pro, 2.2 GHz, Mac OS X 10.4.11.
1000 multiplies: 25
5000 divisions: 41
5000 adds: 42
500 biquads: 41
1000 multiplies: 39
5000 divisions: 47
5000 adds: 45
the missing line should be
Quote: Kessito wrote on Thu, 14 August 2008 13:21
> One thing that I find very strange is that 5000 divisions take up the same cpu as 1000 multiplies, while I was always learnt in C/C++/DSP classes that division takes 20 times as much cpu cycles as a multiply
When [/~] is operating on a signal and a float as in your patch (rather than two signals) it is possible that c74 take 1./float and then do a multiply instead of a divide in the DSP loop. Of course you can’t do this optimisation when operating on two signals. This doesn’t explain why [/~] is apparently cheaper than [*~]. (I modified the patch to have 1000 [*~] and 1000 [/~]. [/~] uses the same or less CPU on my machine.)
Perhaps [/~] is super optimised to try and counteract the relative slowness of division whereas optimisations for [*~] have been paid less attention since it’s assume to be fast.
So perhaps we can still learn from Aesop in MaxMSP?:
I’ll be using [/~] and a preceding [!/ 1.] in its right inlet for scalar multiplication in future :)
I have done the test on my studio machine:
Core2QUAD processor 2,4Ghz , 4GBRAM, WINXPpro SP2
When there will be enough info from different systems I will create an overview graph and upload it, so we have a nice overview off all different systems
Hello Kessito, good idea this benchmarking.
Macbook OSX 10.5.4, 2GHz, 2GB
Max MSP 4.6.3
biquad 62, multiplies 29, division 46, addition 52.
Max MSP 5
biquad 100, multiplies 45, division 56, addition 49.
Could someone please post this patch as an attachment so I can delete that scrolling nightmare? My copy of FF locks up trying to select it all.
Thanks to the ones who have already responded and done the test I hope that a lot more people will do this small effort for the good cause: the more input the more reliable the results!
One little thing: since the cpu counter in max (just as the ones in xp and osx) can be a bit wobbly, it’s a good idea to run the test a couple of times to see if the results are sort of less consistent.
the patch is already uploaded as an attachment in kessito’s second post.
If someone gets 100% for biquads, trying running the test again. I found that the first time I run the test, I get 100%, but in subsequent trials, biquad stays consistently slightly less than 40.
Quote: martinrobinson wrote on Fri, 15 August 2008 10:15
> When [/~] is operating on a signal and a float as in your patch (rather than two signals) it is possible that c74 take 1./float and then do a multiply instead of a divide in the DSP loop.
This is indeed the case. It is actually documented. Possibly in the 4.x SDK, which would explain why relatively few people are aware of it.
> This doesn’t explain why [/~] is apparently cheaper than [*~]. (I modified the patch to have 1000 [*~] and 1000 [/~]. [/~] uses the same or less CPU on my machine.)
This is odd, and I am not aware of anything documented that would explain it. I can make a couple of guesses, but it’s all stabbing in the dark.
(How old does one have to be to get the hairdresser joke?)
Only Max’s hairdresser knows for sure.
(How old does one have to be to get the hairdresser joke?)
I am 31 and I got it. I can’t remember the specific product for that advert though. Was it Vidal Sassoon?
I have done the test on my Old laptop:
P-3 processor 1,8Ghz(overclocked) , 512MBRAM, WINXPhome SP1
MacIntel 2.16 GHz Intel Core 2 Duo
2 GB 667 MHz
Mac OSX 10.4.11
meant to add that it’s a MacBook Pro i’m using.
Max 4.6.3: (ranges of multiple tests)
Biquad: 40 – 12
Multiplies: 12 – 10
Divides: 23 – 18
Adds: 25 – 20
Biquads: 35 – 21
Multiples: 25 – 18
Divides: 25 – 20
Adds: 28 – 23
8-core 3ghz Mac Pro with 5gb RAM, leopard 10.5.4
I wonder how Max5 would fare under Snow Leopard compared to Max4. Or if Max5 is ever going to get multiprocessor support on it’s own. Anyone know how Max5 deals with multithreading? It definitely wasn’t using 21% on the biquads across all my CPUs, my total CPU usage (I have another meter) never went above about 10%, including all the other things I have running (iTunes, Adium, both version of Max are open at the same time, etc).
gee, on my 1.5Ghz G4 Powerbook, OS 10.4.11 (2GB ram)
with 4.6.3 it gives vastly different results every time i run the
test. first time it’s 100% every thing. then it calms down some but
still every run it gives different results.
with 5.0.3 it’s pinned from the moment i turn on the audio, but
running it open i can see that there is a low end to the cpu
readings. it always peaks at 100% of course, cuz of spikes while
multiplies 42 -62
great. that really tells me something.
Thanks for all the replies so far, I would only like to see some more replies from PC users so we have some more averaging.
It is clear that the tests are not perfect in stability, but if enough people join in, the average results will be fairly reliable.
If only like 5 more people join in, I will put the results in a XL sheet and post the sheet here.
Keep up the good work, and for all the people who haven
what exactly do you expect the devs to be able to do with the results from your experiment?
it’s already been determined and addressed that max5 will run slower then max4 for some tasks on some computers. i’m sure all the devs are well aware of the performance differences between the two
if you are trying to determine the fastest way to perform a specific task, then empirical tests like these are very useful, to your specific task on your specific hardware. but other than that, not so useful.
of course, imho.
I do not agree, when max5 was released it was stated by the devs that graphical stuff would be slower, cause off all the 32bit graphical stuff e.d. But it was also stated by the devs that the audio performance should be the same as in max4. When I said that the audio performance had gone down, the devs replied that this shouldn
Quote: Kessito wrote on Tue, 19 August 2008 20:09
Come on guys, it
Pentium M 1.8 Ghz 2 GB RAM (Dell Inspiron 9300 running XP SP2)
Biquads x / +
100 66 94 92
87 68 94 82
100 56 93 82
100 93 94 82
97 67 83 82
Biquads x / +
54 33 50 49
56 32 43 49
73 39 50 41
54 39 50 48
54 39 51 44
This is absolutely not a marginal difference!!! Nearly twice as efficient!!
max 4 -
500 biquads 75
1000 multiplies 24
5000 divisions 39
5000 adds 32
max 5 -
500 biquads 92
1000 multiplies 59
5000 divisions 38
5000 adds 37
win xp,2gb,core duo 1.6ghz,using cacky on-board sound card…
Here are some notes from a conversation I have had with Joshua which might prove illuminating. We’ll try and have a more detailed look at this later on.
One *huge* important thing which is lacking in people reports is their signal vector size which *must* be the same for these things to have any meaning whatsoever.
However, some major things have changed for these simple objects:
- For Macintosh PPC users: We no longer use Altivec optimization on this new version due to compiler differences. On PPC Max 4 would use CodeWarrior CFM code for the altivec optimization. This is no longer done.
- For biquad there is a new and improved internal algorithm: "It includes smooth coef changes, and synchronous coef changes, as well as a stoke feature allowing sample memory to be artificially set, thereby accomodating things like ringing oscillators."
- Some places there are more rigorous testing and denormal detection, in order to avoid huge performance spikes when numbers get really small in a feedback situation like reverb or a delay line.
However, the important thing to realize is that since these simple objects are ususally greatly overwhelmed by more expensive objects, that even while the performance discrepancies of 5000 multiplication objects seem to be large, an average "real world" patch is not that much slower for DSP processing.
So maybe it is time to create a benchmarking system that tests both
tests raw mathematics as well as real world situations? Iv’e been
thinking lately of a benchmark system for Jitter as well. It would be
great to have relevant test results for Max/MSp/Jitter regarding its
performance on various hardware and os configurations.
> However, the important thing to realize is that since these simple
> objects are ususally greatly overwhelmed by more expensive objects,
> that even while the performance discrepancies of 5000 multiplication
> objects seem to be large, an average "real world" patch is not that
> much slower for DSP processing.
Andrew Pask schrieb:
> Could someone please post this patch as an attachment so I can delete
> that scrolling nightmare? My copy of FF locks up trying to select it
On the mailing list it didn’t show up at all…
I’d love to see why its faster in Max 4, I can only guess, no patch till
I will be out of the office from August 7 until August 25 and will not
have access to my email account. If you need assisitance while I am away
please contact Ana Varas at 754-321-2050 or email@example.com
>One *huge* important thing which is lacking in people reports is >their signal vector size which *must* be the same for these >things to have any meaning whatsoever.
This is exactly the reason why I set comments in the test telling people whr vector size to use, so I’m sure everyone is using the same settings
>However, some major things have changed for these simple >objects:
This I find very funny, when I originally commented in a different thread that pure audio performance had gone down, I was told by the devs that this should not be possible since the audio calculations hadn’t changed
>- For Macintosh PPC users: We no longer use Altivec optimization >on this new version due to compiler differences. On PPC Max 4 >would use CodeWarrior CFM code for the altivec optimization. >This is no longer done.
>"- For biquad there is a new and improved internal algorithm: >"It includes smooth coef changes, and synchronous coef changes, >as well as a stoke feature allowing sample memory to be >artificially set, thereby accomodating things like ringing >oscillators."
>- Some places there are more rigorous testing and denormal >detection, in order to avoid huge performance spikes when >numbers get really small in a feedback situation like reverb or >a delay line.
I have never ever in all the years that I have used MSP experienced any denormal problems, except for externals that had not been written by Cycling, so it seems that this was already covered fine in MSP
>However, the important thing to realize is that since these >simple objects are ususally greatly overwhelmed by more >expensive objects, that even while the performance discrepancies >of 5000 multiplication objects seem to be large, an average >"real world" patch is not that much slower for DSP processing.
I’m sorry but I really have to dissagree with you on this. The excact reason why I made this benchmark test is because all my "real world" patches had become much slower when moving them from max4 to max 5. I think it is very hard to say what exactly is a "real world" patch since this differs greately for all users. I find myself always making patches with lots off biquads in it for example. It seems not more than logical to do these kind off tests with very basic objects, since these are the fundamentals off every patch and it would be virtually impossible to make a good comparison in any other way.
Hi to you all,
I made a graph with the results off all the benchmark tests so far.
From people who have posted multiple results I have taken all the lowest values, go have a look!
p.s. this doesn’t mean that we don’t need more results, keep em coming folks!
Joshua sent me another email about this thread and the patch in it this morning. I present it to you in all its glory for your delectation.
If I change the "peak" object to a running average (e.g. slide 4 4), which makes more sense for effective benchmarking, I get much closer numbers for Max 4 and Max 5 benchmarking, with the exception of biquad. Biquad is completely explained by the new smoothing algorithm costs. The other objects’ slight difference in performance are attributable to compiler changes (gcc 4.0 vs gcc 3.3 on OS X and VS 2005 vs 2008 on PC), and/or some additional small overhead (2-3%) we have in the dsp chain for things like signal probing.
If the users wish to use the older and cheaper biquad algorithm in version 5.0.4, they can most likely send the smooth 0 message to biquad with the universal object or similar to regain similar to old performance (must be done before turning on DSP). However, I’ve made it so that the new smooth algorithm only incurs greater cost when necessary in version 5.0.5.
500 biquads: 43 (33 with smooth 0, or new 5.0.5 handling)
1000 multiplies: 29
5000 divisions: 50
5000 adds: 48
500 biquads: 30
1000 multiplies: 27
5000 divisions: 44
5000 adds: 48
My filter objects (and for that matter my additive synthesis ones) have always had coefficient interpolation but this feature is optional so as not to penalize the common case of fixed parameters.
THis kind of smoothing is hard to do right with all the details of overdrive atomicity etc. I hope that the new biquad source is part of the 5.0 SDK so we can have a single model of how to do it.
Thanks for the response Joshua! It’s good to hear that the main difference is in the biquad and can be solved. Would it be possible that you upload the modified patch that youu’ve used so we can try this too?
could you post the patch again ? please
Stefan Tieje wrote:
"I have never ever in all the years that I have used MSP experienced any denormal problems, except for externals that had not been written by Cycling, so it seems that this was already covered fine in MSP".
I am not so sure about it. I have often encountered NaNs with average~ in Max 4.x and I suspect it had something to do with poor denormalization.
If you have an example in Max 5, we’ll be happy to look into it.