Zachary Seldess

Newbish question follows... sorry if it's been answered in the forums or docs...

So, I'm building a custom panning external (Max5 SDK) modeled on an abstraction I've been using lately, in hope that I'll get a performance boost. Anyways, it appears that the majority of the CPU hit occurs in passing the output signals out of the object's outlet on to the inlets of the dac~, or other MSP objects.

Rather than creating outlets for the external, grabbing inlet/outlet pointers from the t_signal array in the dsp method and passing those to the perform routing , can I somehow access the pointers to the actual sample memory Max allocates for the dac~'s inputs, and pass those to the perform routine along with inlet pointer(s)?

This is idiosyncratic, I know, but I'm guessing, if it's possible, it will make the object much less expensive (ideally I'd like to have 1000+ of these externals instantiated in a patch, and they'll just be going to the dac~ directly anyways). I've taken a look at the dac~ source from Pd, just for some insight, but that's not triggering any lightbulbs.

Question:
Rather than creating outlets for the external, grabbing inlet/outlet pointers from the t_signal array in the dsp method and passing those to the perform routing , can I somehow access the pointers to the actual sample memory Max allocates for the dac~'s inputs, and pass those to the perform routine along with inlet pointer(s)? 

send-signals-directly-to-dac-from-within-perform-routine

Ok, so that was such an uniformed question that no one deemed it deserving of a response (I can't blame you :} ).

Anyways, I quickly realized that what I was really looking for was an implementation of a kind of send~/receive~ functionality inside my panner object. I want to have the option in the panner to sum all like-numbered channels in each panner to one or more panner.receive~ objects (i.e. summing and reading all panner outputs to shared buffers), avoiding the overhead of passing each signal out of it's object's outlets, etc. I won't go into more detail here yet, until I have more to show.

I looked around for sample custom send~/receive~ implementations, to help me come up with my own internal panner version, and there seems to be a real lack of such examples. So, with much theft and help from Toshiro Yamada, I have I have a pretty good start: zns.send~ and zns.receive~

Source, patches, and externals (for Mac) are all there. I'm actually currently beating the performance of send~/receive~ on my machines, when instantiating lots and lots of them. These externals work across multiple cores (i.e. [poly~ @parallel 1] ), with a few exceptions that I'll ask about at the end:

Here's an overview of multi-thread syncing scenario (using mutexes):

When a zns.send~ object gets instantiated with a symbol argument, an instance of a shared NO_CLASS object zns_send_impl gets created and bound to that symbol. All later zns.send~s with that name will be bound to the same shared zns_send_impl. That shared object has a primary t_sample array which all zns.receive~s read from, with an associated t_systhread_mutex (l_mx_final), and an array of 64 t_sample arrays with an array of 64 associated t_sythread_mutexes (*l_mx). I've found that 64 gets the best performance. More than that crashes Max (I think this might have something to do the SYSPARALLEL_MAXWORKERS #define in ext_sysparallel.h…).

The basic idea with having multiple mutexes is minimizing each thread's chances of being locked out and sleeping. It also seems to matter in what order and quantity I associate the mutexes to the zns.send~ instances (that's where the second #define comes in in zns_send_impl.c, if you're interested in looking at the code).

1. Can anyone tell me how to determine inside the perform routine, if an instance (which happens to be inside a poly~) is muted or not (using the poly~ mute message…)? I'm currently checking for dsp muting, but poly~ muting doesn't seem to trigger the z_disabled flag. Right now, when you do mute any instance of zns.send~ inside a poly~, it screws up the multi-thread stuff (try it and you'll hear what I'm talking about).

2. If you copy the poly~ while dac is on, funky behavior ensues. It seems the dsp chain doesn't get reset when doing this, or something like that. Toggling the dac fixes it. This is a very obscure case… If you copy the poly~ in zns.send~.maxhelp WITH the sfplay~, everything's fine. sfplay~ being copied seems to force a dsp chain reset. Or maybe it's something else (I have one other cockeyed theory, but this is a long post). Does anyone know how to prevent this?

Any comments, pointers, answers would be really appreciated. I'm stuck regarding the above problems.

Anyways, I quickly realized that what I was really looking for was an implementation of a kind of send~/receive~ functionality inside my panner object. I want to have the option in the panner to sum all like-numbered channels in each panner to one or more panner.receive~ objects (i.e. summing and reading all panner outputs to shared buffers), avoiding the overhead of passing each signal out of it's object's outlets, etc. I won't go into more detail here yet, until I have more to show. 

I looked around for sample custom send~/receive~ implementations, to help me come up with my own internal panner version, and there seems to be a real lack of such examples. So, with much theft and help from Toshiro Yamada, I have I have a pretty good start: zns.send~ and zns.receive~  

Source, patches, and externals (for Mac) are all there. I'm actually currently beating the performance of send~/receive~ on my machines, when instantiating lots and lots of them. These externals work across multiple cores (i.e.  [poly~ @parallel 1] ), with a few exceptions that I'll ask about at the end:

Here's an overview of multi-thread syncing scenario (using mutexes):
-------------------------------------
When a zns.send~ object gets instantiated with a symbol argument, an instance of a shared NO_CLASS object zns_send_impl gets created and bound to that symbol. All later zns.send~s with that name will be bound to the same shared zns_send_impl. That shared object has a primary t_sample array which all zns.receive~s read from, with an associated t_systhread_mutex (l_mx_final), and an array of 64 t_sample arrays with an array of 64 associated t_sythread_mutexes (*l_mx). I've found that 64 gets the best performance. More than that crashes Max (I think this might have something to do the SYSPARALLEL_MAXWORKERS #define in ext_sysparallel.h…). 

The basic idea with having multiple mutexes is minimizing each thread's chances of being locked out and sleeping. It also seems to matter in what order and quantity I associate the mutexes to the zns.send~ instances (that's where the second #define comes in in zns_send_impl.c, if you're interested in looking at the code). 

Thanks Vanille. Those links are both helpful.

Yes, I do think it would be much better to avoid the mutexes (and I'm sure, even with the mutexes the way they are, I could come up with a way to avoid memory allocation inside the perform, which I know is ugly and a bad idea in general).

But so far, I haven't found a way to make these objects synchronized across multiple threads without mutexes. The best I could come up with (that works in poly~ @parallel 1) is minimizing the chance of a thread being locked out by using 64 mutexes, and allocating them in a specific way. It does perform well, from my tests, but it non-ideal for sure.

One way to do this is simply pre-allocate a hardcoded amount of memory (suggested in the memory allocation section of that link):

*** Pre-allocate a big chunk of memory and implement your own deterministic dynamic allocator that’s only invoked from the audio callback (and hence doesn’t need locks).

So what I might try first, is to have all send instances write to a unique t_sample array in my shared zns_send_impl CLASS_NOBOX object, and sum when all write are done, as I am doing now. I'd apply some hardcoded limit to the number of arrays (maybe 1024), above which I won't guarantee playing well across threads.

But another way I'd prefer over that, would be to dynamically update the amount of shared t_sample arrays (10, if only 10 sends, 200 if 200 sends, 1024 if 1024 sends, etc.). Both variations remove the need for mutexes, but increase the memory requirements of the object (depending on how many are using).

Any other thoughts on this from the community would be great. And of course, it would be great to know how to detect that your object is inside a muted instance of a poly~ (question 1 from my last post).

Thanks Vanille. Those links are both helpful. 

Yes, I do think it would be much better to avoid the mutexes (and I'm sure, even with the mutexes the way they are, I could come up with a way to avoid memory allocation inside the perform, which I know is ugly and a bad idea in general). 

But another way I'd prefer over that, would be to dynamically update the amount of shared t_sample arrays (10, if only 10 sends, 200 if 200 sends, 1024 if 1024 sends, etc.). Both variations remove the need for mutexes, but increase the memory requirements of the object (depending on how many are using). 

i think sending and summing 1000 channels of audio will always take

up about the same CPU, no matter how you do it.

maybe you want to build your external so that it directly talks to a

hardware driver (itself, with no dac~ needed)?

i think sending and summing 1000 channels of audio will always take
up about the same CPU, no matter how you do it.

maybe you want to build your external so that it directly talks to a
hardware driver (itself, with no dac~ needed)?

send signals directly to dac~ from within perform routine?