Forums > Dev

Max-specific C code optimization guidelines ?

July 26, 2012 | 10:11 am

Hello,

There are many Web pages providing guidelines to optimize C code, like this one for instance:

http://www.eventhelix.com/realtimemantra/basics/optimizingcandcppcode.htm

However, are there Max-specific guidelines ?


July 26, 2012 | 12:09 pm

Thanks for the links Nicolas.

However what is the difference between profiling and optimizing ?


July 26, 2012 | 12:26 pm

Hi.

the first two things I can think of are:

- use gensym() as little as possible (store whatever symbol you use often into some global variables, and always use commonsyms.c)
- critical regions are reentrant mutexes, so they’re much more expensive than regular mutexes; the global critical region is the worst of all, because it gets locked by a lot of stuff inside max. btw, in some cases (when the protected region is small) locking your thread with an atomic counter instead than a mutex can be noticeably more efficient.

and for the rest…
http://en.wikipedia.org/wiki/Profiling_(computer_programming)

http://en.wikipedia.org/wiki/Program_optimization

cheers
aa


July 26, 2012 | 1:40 pm

Thanks Andrea.

Is it correct to write that mutexes and critical regions matter only when one explicitely uses multiple threads ?


July 26, 2012 | 2:40 pm

I would reitarate Nicolas advice, don’t optimize first. The assumptions are usually wrong, the compiler does a lot of work to optimize the code for you. Then if you have performance issues, do some profiling. Making the code easier to read is far better than 0.001 % of CPU cycle saved ;-)


July 26, 2012 | 4:11 pm

I understand the dangers of a bad optimization. However, I think my current project really needs it (if possible) because my object interpolates between lists (four or eight) and it takes 8% or 9% CPU. Maybe it’s not possible to improve those figures however.


July 26, 2012 | 6:31 pm

How do you get the 8-9% figure?
Is your object a DSP object and you’re looking at the Status window?

In my experience the micro-optimization suggested in the article:
http://www.eventhelix.com/realtimemantra/basics/optimizingcandcppcode.htm
are pretty much useless nowadays.

I would revise data structures and algorithms instead.
If you post some code we might be able to give you more practical suggestions.

BTW, did you make sure you compiled your objects with the appropriate optimization level flags set in Xcode?
That can make a big difference for DSP objects…

- Luigi


July 26, 2012 | 7:53 pm

Hi,

8-9% doesn’t tell you anything as per se. It can be good, it can be too much. I developed an additive synthesizer a couple of years ago where the most CPU consuming part was the interpolation between lists and it consumed normally 50-60% of my CPU, which was a quite good result compared with the amount of lists to be managed (a couple of hundreds/thousands). If you’re doing _very_ sophisticated stuff, don’t expect to get a CPU usage of 0.0001%…

With that said, I can just second Nicolas’s advice: first, just do the object and see that it works as expected. Then try to find the ‘bottleneck’ of your process. Maybe you can do some of the interpolation off-line, before going to the MSP? Or maybe you could simplify your lists? That really depends on the final purpose of your code; maybe if you could give us more insight to your project, you could get some specific help.

Another small trick with mutexes, what I also do quite often: if you _really_ need to access shared data in your DSP routine, the best way to do this is to start with a mutexed block and duplicate all relevant information into local variables (memcpy might be your friend depending on your actual data structures). IMHO calling a mutex once per each DSP cycle shouldn’t be that CPU-consuming. But of course, if you have the choice, you should _always_ go with atomic operations instead of mutexed blocks. Sometimes that needs some re-designing of your code, but it might be worth…

But again: don’t start optimizing your code until you can’t locate the actual bottlenecks. You might end up spending weeks in figuring out the smart way of mutexing while the actual CPU-cost is produced somewhere completely else…

Best,
Ádám


July 26, 2012 | 9:12 pm

Have to be careful when examining profiling stats. Real-tome drivers tend to cause peaks when the interrupt kicks in, so profilers which only provide average utilization can be misleading.


July 26, 2012 | 10:06 pm

@nicolas:

1. I have tried Shark yet but the results are not very meaningful for what I understand. Anyway I think that almost only two function are used because I am only getting lists and combining them.
2. I have put most variables and arrays in the object structure yet, I figured out it would spare some operations. However I don’t think I am using cache pointers (excepted if I am doing without knowing as a kind of M. Jourdain ). Please check that in the code below.

/*
	@file
	rb.interpol2d.c

	TODO :
	- make right inlets compatible with single ints and floats
	- add argument to set listlength with optional zeropadding
	- add argument to change maximum list size

	© Roald Baudoux 2012
*/

#include "ext.h"				// you must include this - it contains the external object's link to available Max functions
#include "ext_obex.h"			// this is required for all objects using the newer style for writing objects.
#include "ext_systhread.h"

#define MAXSIZE 256

typedef struct _rb_interpol2d {				// defines our object's internal variables for each instance in a patch
	t_object	ob;							// object header - ALL objects MUST begin with this...
    t_atom		l;							// stored value from left inlet
    t_atom		r;							// stored value from right inlet
	t_atom		pos2d[2];					// 2d position
	t_atom		input_list1[MAXSIZE];	    // first list to be interpolated (second inlet)
	long		list1_length;
	t_atom		input_list2[MAXSIZE];		// second list to be interpolated (third inlet)
	long		list2_length;
	t_atom		input_list3[MAXSIZE];		// third list to be interpolated (fourth inlet)
	long		list3_length;
	t_atom		input_list4[MAXSIZE];		// fourth list to be interpolated (fifth inlet)
	long		list4_length;
	t_atom		output_list[MAXSIZE];		// list resulting from interpolation
	void		*outlet;					// outlet creation - inlets are automatic, but objects must "own" their own outlets
    void		*proxy;						// proxy inlet
    long		proxy_inletnum;				// # of inlet currently in use
	long		min_length;					// length of list to be output
	t_atom		temp_atom[MAXSIZE];			// temporary container for list handling
	int			inited;						// flag for the reception of the four lists
} t_rb_interpol2d;

// these are prototypes for the methods that are defined below
void *rb_interpol2d_new(long n);
void rb_interpol2d_free(t_rb_interpol2d *x);
void rb_interpol2d_assist(t_rb_interpol2d *x, void *b, long m, long a, char *s);
void rb_interpol2d_bang(t_rb_interpol2d *x);
void rb_interpol2d_int(t_rb_interpol2d *x, long n);
void rb_interpol2d_float(t_rb_interpol2d *x, double f);
void rb_interpol2d_list(t_rb_interpol2d *x, t_symbol *s, long ac, t_atom *av);
void all_lists_inited_test(t_rb_interpol2d *x);

t_class *rb_interpol2d_class;		// global pointer to the object class - so max can reference the object 

//--------------------------------------------------------------------------

int main(void)
{
	t_class *c;

	c = class_new("rb.interpol2d", (method)rb_interpol2d_new, (method)rb_interpol2d_free, sizeof(t_rb_interpol2d), 0L, A_GIMME, 0);

    class_addmethod(c, (method)rb_interpol2d_bang,	"bang",		0);									// the method it uses when it gets a bang in the left inlet
    class_addmethod(c, (method)rb_interpol2d_int,	"int",		A_LONG, 0);							// the method for ints in any inlet
    class_addmethod(c, (method)rb_interpol2d_float,	"float",	A_FLOAT, 0);						// the method for floats in any inlet
	class_addmethod(c, (method)rb_interpol2d_list,	"list",		A_GIMME, 0);				// the method for lists in any inlet
    class_addmethod(c, (method)rb_interpol2d_assist,	"assist",	A_CANT, 0);							// (optional) assistance method needs to be declared like this

	class_register(CLASS_BOX, c);
	rb_interpol2d_class = c;
	finder_addclass("Roald Baudoux","rb.interpol2d");

	post("rb.interpol2d v26/07/2012 - © Roald Baudoux",0);	// post any important info to the max window when our class is loaded
	return 0;
}

//--------------------------------------------------------------------------

void *rb_interpol2d_new(long n)		// n = int argument typed into object box (A_DEFLONG) -- defaults to 0 if no args are typed
{
	t_rb_interpol2d *x;				// local variable (pointer to a t_rb_interpol2d data structure)
	int i ;

	x = (t_rb_interpol2d *)object_alloc(rb_interpol2d_class); // create a new instance of this object
	if(x){
		x->proxy = proxy_new(x, 4, &x->proxy_inletnum);	// fully-flexible inlet for any type
        x->proxy = proxy_new(x, 3, &x->proxy_inletnum);	// fully-flexible inlet for any type
		x->proxy = proxy_new(x, 2, &x->proxy_inletnum);	// fully-flexible inlet for any type
		x->proxy = proxy_new(x, 1, &x->proxy_inletnum);	// fully-flexible inlet for any type
        //x->outlet = outlet_new(x, NULL);				// fully-flexible outlet for any type
		x->outlet = listout((t_object *)x);				// specialized output for lists only

	    // initialize L and R inlet atoms to (int)0
        atom_setlong(&x->l, 0);
        atom_setlong(&x->r, 0);

		// initialize input_lists and output list to (double) 0
		for (i = 0; i < MAXSIZE ; i++) {
			x->input_list1[i].a_type = A_FLOAT;
			x->input_list2[i].a_type = A_FLOAT;
			x->input_list3[i].a_type = A_FLOAT;
			x->input_list4[i].a_type = A_FLOAT;
			x->output_list[i].a_type = A_FLOAT;
			atom_setfloat(&x->input_list1[i], 0.);
			atom_setfloat(&x->input_list2[i], 0.);
			atom_setfloat(&x->input_list3[i], 0.);
			atom_setfloat(&x->input_list4[i], 0.);
			atom_setfloat(&x->output_list[i], 0.);
		}
		// initialize input_lists to (double) 0
		for (i = 0; i < 2 ; i++) {
			x->pos2d[i].a_type = A_FLOAT;
			atom_setfloat(&x->pos2d[i], 0.5);
		}

		// resets list sizes until lists arrive
		x->list1_length = 0;
		x->list2_length = 0;
		x->list3_length = 0;
		x->list4_length = 0;

		//we haven't received a list in each of the four rightmost inlets yet
		x->inited = 0;
	}

	return(x);					// return a reference to the object instance
}

void rb_interpol2d_free(t_rb_interpol2d *x)
{
    object_free(x->proxy);		// frees all resources associated with the proxy
}

//--------------------------------------------------------------------------

void rb_interpol2d_assist(t_rb_interpol2d *x, void *b, long m, long a, char *s) // 4 final arguments are always the same for the assistance method
{
	if (m == ASSIST_INLET) {
		switch (a) {
			case 0:
				sprintf(s,"Inlet %ld: 2d position (float, 0. to 1.)", a);
				break;
			case 1:
				sprintf(s,"Inlet %ld: First list to be interpolated", a);
				break;
			case 2:
				sprintf(s,"Inlet %ld: Second list to be interpolated", a);
				break;
			case 3:
				sprintf(s,"Inlet %ld: Third list to be interpolated", a);
				break;
			case 4:
				sprintf(s,"Inlet %ld: Fourth list to be interpolated", a);
				break;
		}

	} else
		sprintf(s,"Interpolation of the four input lists");
}

//--------------------------------------------------------------------------

void rb_interpol2d_list(t_rb_interpol2d *x, t_symbol *s, long ac, t_atom *av)
{
	long inlet = proxy_getinlet((t_object *)x); // what inlet did this message come in through?
	int i;
	double partie1;
	double partie2;
	double partie3;
	double partie4;

	// converts incoming values to float if necessary
	for (i = 0; i < ac; i++) {
		switch(av[i].a_type){
			case A_FLOAT:
				atom_setfloat(&x->temp_atom[i], (av[i].a_w.w_float));
				break;
			case A_LONG:
				atom_setfloat(&x->temp_atom[i], (double)av[i].a_w.w_long);
				break;
			case A_SYM:
				post("Numbers only!");
				break;
		}
	}

	switch (inlet) {
		case 0: // first inlet (position for interpolation)
			if (ac == 2) {
				for(i=0;ipos2d[i] = x->temp_atom[i];
					if (x->pos2d[i].a_w.w_float > 1.) {
						atom_setfloat(x->pos2d+i, 1.0);
					} else if (x->pos2d[i].a_w.w_float < 0.) {
						atom_setfloat(x->pos2d+i, 0.0);
					}
				}
			} else post ("List in first inlet should have 2 items rather than %d.n", ac);
			break;
		case 1:
			x->list1_length = ac;
            all_lists_inited_test(x);
			for(i=0;iinput_list1[i].a_w.w_float = x->temp_atom[i].a_w.w_float;
			}

			break;
		case 2:
			x->list2_length = ac;
            all_lists_inited_test(x);
			for(i=0;iinput_list2[i].a_w.w_float = x->temp_atom[i].a_w.w_float;
			}

			break;
		case 3:
			x->list3_length = ac;
			all_lists_inited_test(x);
			for(i=0;iinput_list3[i].a_w.w_float = x->temp_atom[i].a_w.w_float;
			}

			break;
		case 4:
			x->list4_length = ac;
			all_lists_inited_test(x);
			for(i=0;iinput_list4[i].a_w.w_float = x->temp_atom[i].a_w.w_float;
			}

			break;
	}

	if (x->inited == 1) {

		// calculate results
		for(i=0;imin_length;++i)
		{
			partie1 = (1. - (x->pos2d[0].a_w.w_float)) * (x->input_list1[i].a_w.w_float);
		    partie2 = (x->pos2d[0].a_w.w_float) * (x->input_list2[i].a_w.w_float);
			partie1 = partie1 + partie2;
			partie1 = (1. - (x->pos2d[1].a_w.w_float)) * partie1;
			partie3 = (1. - (x->pos2d[0].a_w.w_float)) * (x->input_list3[i].a_w.w_float);
			partie4 = (x->pos2d[0].a_w.w_float) * (x->input_list4[i].a_w.w_float);
			partie3 = partie3+partie4;
		    partie3 = (x->pos2d[1].a_w.w_float) * partie3;
			partie1 = partie1+partie3;
			atom_setfloat(x->output_list+i, partie1);
		}
		outlet_list(x->outlet, s, x->min_length, x->output_list);

	} else post ("A list should be sent in each of the four rightmost inlets first.");

}

//--------------------------------------------------------------------------

	void rb_interpol2d_bang(t_rb_interpol2d *x)
	{
		outlet_list(x->outlet, 0 , x->min_length, x->output_list);
	}

//--------------------------------------------------------------------------

	void rb_interpol2d_int(t_rb_interpol2d *x, long n)
	{
		long inlet = proxy_getinlet((t_object *)x); // what inlet did this message come in through?

    if (inlet != 0) { // INLETS 2 TO 5
		// transmit to list
    }
	else post("Inlet 1 demands a list of two floats");

}

//--------------------------------------------------------------------------

void rb_interpol2d_float(t_rb_interpol2d *x, double f)
{
	long inlet = proxy_getinlet((t_object *)x); // what inlet did this message come in through?

    if (inlet != 0) { // INLETS 2 TO 5
        // transmit to list
    }
	else post("Inlet 1 demands a list of two floats");
}

//--------------------------------------------------------------------------

void all_lists_inited_test(t_rb_interpol2d *x) {
	if (x->list1_length != 0 && x->list2_length != 0 && x->list3_length != 0 && x->list4_length)
	{
		x->inited = 1;
		x->min_length = x->list1_length;
		if (x->list2_length < x->min_length) x->min_length = x->list2_length;
		if (x->list3_length < x->min_length) x->min_length = x->list3_length;
		if (x->list4_length < x->min_length) x->min_length = x->list4_length;
	}
}

July 26, 2012 | 10:17 pm

@Luigi : 8%-9% is the difference of CPU percentage I got when sending many messages to it with a pictslider instead of letting it idle in my patcher. I used the activity monitor.

I realize it might be "normal" for the object to use this percentage but as I want to use five to ten of these objects I’d be glad if they could be less hungry.

By the way here is a patch for testing purposes :

– Pasted Max Patch, click to expand. –

July 27, 2012 | 12:07 pm

Yes, that’s why I asked how you got the 8-9% figure…
Your method of measuring CPU usage is not viable.
When you start moving your pictslider and send messages to your object there’s so much more that goes on in the patcher.
All of that is reflected in your measurement.

Looking through your code there’s nothing very wrong that I can see… except the proxy creation.
I would fix that and move on. Sure, some things might be done a little better like Nicolas suggested but overall it’s fine.

About the proxy inlets:
if you want 4 proxy inlets you need to declare them in your data structure as follows:

void *proxy1;
void *proxy2;
void *proxy3;
void *proxy4;

and then you create them like this:

x->proxy4 = proxy_new(x, 4, &x->proxy_inletnum);
x->proxy3 = proxy_new(x, 3, &x->proxy_inletnum);
x->proxy2 = proxy_new(x, 2, &x->proxy_inletnum);
x->proxy1 = proxy_new(x, 1, &x->proxy_inletnum);  // notice the reversed order...

in your free method of course you will have:

object_free(x->proxy1);
object_free(x->proxy2);
object_free(x->proxy3);
object_free(x->proxy4); // order doesn't matter now

The bottom line:
probably your external works a lot faster than you think.
I wouldn’t spend any more time optimizing such a simple external.
What you could do instead is to improve the overall design by making it more general for example.
As a suggestion, you could make it work with any number of lists of any length.

Hope this helps

- Luigi


July 27, 2012 | 11:38 pm

Ok for referemnce, on windows I use sleepy for profiling. With default audio settings, it shows almost half cpu cycles in sched_setqueuethrottle, with the following breakdown within that thread for a test patch

sched_set_takeover 14.6%
systhread_mutex_newlock 16.6%
memset 10.6%
TlsGetValue 8.4386%
main 8.3%
TlsGetValue 4.2%
TlsSetValue 4..2%
RtlEnterCriticalSection 4.2%
qelem_set 4.1%
GetCurrentThreadId 2.1%
getbytes 2.1%
drawstr 8.4%

Is sched_set_takeover an interrupt handler, and systhread_mutex_newlock is blocking?

Do shorter symbol names reduce time in drawstr significantly?


July 28, 2012 | 7:40 am

I don’t know what does what or exactly how, I’m trying to underrstand, and my Mac is on loan. Thanks for the picture

drawstr is a string routine, so I thought maybe if what some call ‘symbols’-the names of variables, objects etc–affect Max performance. What I do know is from a designer of javascript parsers. In them, the more different object names there were, the longer it would take for the parser to traverse the object list, and it would start a new object list traversal every time any insteance of any object was invoked. The designer of the parser said the most effective way to improve performance was sometimes to reduce the number of object names. After that, the search across names AA,AB would be significnatly faster than objects named AAB, AAC, as examples, because the number of chars to examine is less.

That’s partly because javascript is interpreted in an environment that shares global objects which can change at any time. I am wondering if that is so true for Max. She demonstrated 10x faster code in some cases simply by object and name reorganization. I don’t understand enough of the details of Max yet to know if it would affect Max performance or data interchange with a C++ DLL.


July 28, 2012 | 4:19 pm

Thanks Nicolas and Luigi.

I have questions, especially concerning answers from Nicolas:
1. "it is not a good idea to break encapsulation rules with code like that :
x->input_list2[i].a_w.w_float ; "
I don’t understand what you mean at all. Which rule do I break ?
2. Proxies : In the "Inlets and outlets" part of Max 6 API Documentation I read: "proxies are generally a better way to create an inlet that can respond to any message". That’s why I decided to use proxies. Is it possible to have inputs compatible with both ints, floats and lists without them ?
3. "you can see that most part is used by [multislider] connected to the outlet." : do you mean that if I’d connect the external to something else than a multislider the values could be different ? Or do you mean something else ?


July 28, 2012 | 5:16 pm

1. He means that the standard way to assign/retrieve values to/from atoms is by using functions such as:

atom_setfloat(x->input_list2 + i, f);
f = atom_getfloat(x->input_list2 + i);

which hide the implementation details to you by making the code easier to read and generally clearer. Also if C74 one day decided to change something about how values get assigned or retrieved from atoms, your code would not need to be changed. All the details are encapsulated in those functions. Under the hood they do pretty much what can be done more explicitly with something like:

x->input_list2[i].a_type = A_FLOAT;
x->input_list2[i].a_w.w_float = f;

Good advice, I would say, but not critical.

2. No. You are exactly right. Proxies are the way to go in your situation.

3. I think Nicolas means that multislider is consuming most of the CPU cycles…

Cheers.

- Luigi


July 28, 2012 | 6:57 pm

Thanks, I am beginning to understand )


July 30, 2012 | 7:52 am

Hi Nicolas.

Well, first of all in principle you should use ATOMIC_INCREMENT_BARRIER, which will ensure the correct order of the memory transactions.

Anyway, what you should do for a lock (not a trylock) is

while (ATOMIC_INCREMENT_BARRIER(&mycounter) != 1)
 ATOMIC_DECREMENT_BARRIER(&mycounter);
/*
 protected region
*/
ATOMIC_DECREMENT_BARRIER(&mycounter);

This is safe, as the increment is performed atomically, and the result you get is not shared among threads. So everything is ok – am I wrong?

aa


Viewing 18 posts - 1 through 18 (of 18 total)