cpu use of += before clip 0 nn in gen~ ?
Hi !
perplex around gen~.curve.maxpat : the += continues to count after the clipped value is reached. Maybe in gen~ is it without consequence, and add 0 or 1 is the same charge of cpu. Maybe one time compilled, if a "branch" of calcul has no more usage, the process is ignored ? Is it more "sane" to use a counter and stop it at the end of a ramp ?
sorry for my english ( i learned it essentially on the Max forums :-)) Thanks You all !
In Max domain, i learned to optimise and carefully stop processes when the goal is reached. Maybe connecting something after the += says him to continue ?
best to you all !
Hi Mizu,
It is true that the += will keep on counting on every sample, even though the result is clipped. If you want, you can avoid this behaviour by using a codebox to do the equivalent, and an if() statement to skip this when the result is reached. HOWEVER, the operation of addition is incredibly cheap on a modern CPU, and in this case quite likely cheaper or equal to the cost of an if() condition. Basically -- don't worry!
General guide for optimizing:
Optimize last. Make it work, make it work right, and only then worry about optimizing performance last of all.
Use profiling metrics: see the "performance" tab of the gen~ help patcher for getting CPU performance measurements. The simplest thing is to delete operators in the patch (or replace them with
pass
objects, which do nothing) to see if it makes any noticeable difference to the CPU cost. Only optimize with informed knowledge about the patch!Pareto rule: Usually less than 20% of your code/patch is responsible for more than 80% of the CPU cost, so find out what that 20% is and focus only on optimizing that.
Once it is time to optimize, and you know what parts need attention, here's some things you can consider:
In a gen patch, everything is running all of the time; without using codebox you can't "turn things off" (and even using a codebox if() to do this might actually make performance worse, see below). So one of the best things you can do is work out how to share as much code as you can between algorithms.
E.g. let's say you want an LFO that can switch between different waveform options -- saw, triangle, sine, pulse, etc. Instead of making 4 different LFO subpatches, one for each waveform, and using a selector to choose between them, it's much better to make a single LFO patch, driven by a single
phasor
, through atriangle
, and then acycle @index phase
and a>
operator. Each one of these gives you a different waveshape that you can select between. Not only is this cheaper than 4 different subpatches, it also opens up more variations of shapes by blending and morphing parameters of it. The GO book chapter 2 shows a couple of variations on this theme.Or likewise going the other way -- if several different sub-parts tend to have the same kinds of processing near the output, can you find a way to merge these parts? Or, can you re-arrange the order of operations in a way to ensure this? E.g. if you end up smoothing lots of parameters, is it possible instead to apply the algorithm with the unsmoothed parameters and then put the smoothing on the result? It's not always possible, because the behaviour may sound different, but in a general linear case it may work.
In general, the performance of a patch or the equivalent codebox is exactly the same (codebox is not more efficient). The exceptions are the things that you can only do in codebox -- branching structures like if() and for() etc, and functions (which are similar to but not the same as subpatchers).
But watch out -- CPUs can be surprising, and often a simple branchless version of an algorithm can actually be cheaper than trying to turn things off with if() etc. That is, using if() might actually make CPU performance worse. The only way to know is to measure.
Similarly, using codebox functions vs. using subpatchers might make CPU performance worse -- you have to measure the performance, because it is not predictable.
Param outlets update at "control rate", once per signal vector, rather than every sample. Operations downstream of a param may also compute once per signal vector, unless they also have an audio-rate input (downstream of an
in
orhistory
etc.) or is an inherently audio-rate operation (e.g. using anaccum
ordelta
etc.). For a typical vector size of 64, that's quite a big saving. So for example, if you are re-scaling a param or sending it through a lookup function, it might be worth doing that before applying audio-rate smoothing.In general, operations like + and * and logic tests are so cheap it's not worth worrying about them. In contrast, operations like division, trigonometry and power functions are significantly more expensive. Also note that conversion operations like
scale
,atodb
,mtof
,mstosamps
etc. often have an expensive division and/or power operation inside.Some of these expensive ops have much cheaper alternatives, like
fastpow
andfastsin
-- but these are not as mathematically accurate. You'll need to test whether using them makes your patch sound worse or behave badly.Sometimes you can replace them with lookup tables (see below)
Again -- only bother doing this if these really are the problem part of the patch!
Sometimes you can save CPU by "baking" math into a lookup table. The
cycle
operator is an example of this -- instead of calling the expensivesin
, it looks up the sinusoid value from a pre-computed block of memory. You can do the same for any function of a single bounded parameter -- either reading it from abuffer
created in Max, or from adata
that you fill when the patch loads. Often useful for wavetables, window/envelope shaping functions, quantization maps, etc. However if the math is simple, it is probably cheaper to just do the math than read from a data.How big does a table need to be? The larger the table, the more accurate the result, but also the more expensive to run. Make them as small as you can without causing any audible distortion. Power-of-two sizes are usually better for the CPU. Usually using linear interpolation on reading the table is the best compromise between table size and signal distortion. The wavetable used by
cycle
is 16384 samples long, which is probably far bigger than it needs to be in many situations (but it had to be that big for some very sensitive patches). Sometimes you can get away with just 1024 samples.A typical pattern is to use a codebox testing for
if (elapsed==0)
and if true, to run a for() loop to fill the entiredata
. However, this may cause quite a large CPU spike as it computes all of this work in a single sample. That can especially be an issue when exporting code for an embedded device. An alternative method is to spread the work one sample at a time until thedata
is filled, e.g. usingif (elapsed < dim(mydata))
and writing one sample at the position ofelapsed
in the data.One minor tip: reading multiple values from a multi-channel buffer/data by using the
@channels
attribute is almost the same cost as reading a single value -- so you might be able to use this to reduce the number of table lookups in a patch.
Graham