Question about ext_sysparallel.h
Hello,
I'm writing a msp external which has fair amount of computation that can be parallelize. I was looking at the SDK and found a file called ext_sysparallel.h with t_sysparallel_task and t_sysparallel_worker structs. Looking at the methods, this looks like what I want to do, which is to distribute processing over multiple cores. However, I can't seem to find any documentation or example of how to use these.
I would appreciate if someone could provide me with more information or examples.
Thank you!
Toshiro
Hello Toshiro,
We probably won't be completely documenting this anytime soon, but here is a simplified example of the portions of creating, executing, and freeing a t_sysparallel_task. This is reduced from poly~, alternating patcher voices in the threads.
If you can describe what you want to accomplish, or have some attempt at using this code with specific questions, I would be happy to try and answer them.
-Joshua
typedef struct _myobj_parallel
{
t_myobj *myobj;
long count;
} t_myobj_parallel;
void myobj_workerproc(t_sysparallel_worker *w)
{
t_myobj *x = ((t_myobj_parallel *)(w->data))->myobj;
long count = ((t_myobj_parallel *)(w->data))->count;
long i,threadcount;
threadcount = w->task->workercount;
// alternate every threadcount voices between threads
for (i = w->id; i < count; i+=threadcount) {
if (x->p_patchers[i].r_mute)
continue;
if (x->p_patchers[i].r_chain) {
// compute dsp chain
}
}
}
void myobj_run(t_myobj *x, long count)
{
t_max_err err;
long i;
t_myobj_parallel p;
// setup our workerproc data pointer
p.myobj = x;
p.count = count;
if (!x->p_paralleltask)
x->p_paralleltask = sysparallel_task_new(&p,(method)myobj_workerproc,x->p_threadcount);
// set task priority to audio
x->p_paralleltask->priority = SYSPARALLEL_PRIORITY_HIGH;
sysparallel_task_data(x->p_paralleltask,&p);
// execute task. if there is an error it means we're trying to run parallel in nested instances
if (err=sysparallel_task_execute(x->p_paralleltask)) {
if (!x->p_parallelerror) {
object_error((t_object *)x,"use of nested, parallel enabled objects is not supported. disabling for inner myobj~ object(s)");
x->p_parallelerror = TRUE;
}
// do your fallback here
}
}
void myobj_free(t_myobj *x)
{
dsp_free((t_pxobject *)x);
if (x->p_paralleltask)
sysparallel_task_free(x->p_paralleltask);
}
Many thanks, Joshua!
The example looks pretty straight forward. I'll try to implement this in my code and see how it goes.
Toshiro
Hi!
sorry guys, but I'm not as advanced as that, I have some more questions...
let's presume that I have an object which does additive synthesis, and I want to make it multithreaded. what I suppose I should do is have each thread calculate one half of the oscillators, then sum together the results from the threads and send the sum out of my signal outlet.
so, in random order:
at which moments should I create and execute the parallel task - dsp method, perform method, ...? am I wrong for guessing that the parallel task should be executed once per vector, from within the perform method?
should the perform method wait for each parallel task to finish, then add the partial results together and send the sum out of the outlet? or, maybe, is it more clever to have the perform method itself doing a part of the computation, and at the end of it wait for the other threads?
if so, how do I know if one thread has finished its job? should it set/increment a variable somewhere, to be checked by the perform method? or is there a more direct way to do it?
... or am I totally misunderstanding the whole thing???
thank you!
aa
jkc... please... some hints...
* what I suppose I should do is have each thread calculate one half of the oscillators, then sum together the results from the threads and send the sum out of my signal outlet.
Yes. However keep in mind if there are more cores available, there will be more than two threads, so it would be 1/Nth rather than half.
* at which moments should I create and execute the parallel task - dsp method, perform method, ...?
We create the task the first time the perform method is called. In the above example, myobj_run() would be called from myobj_perform().
* am I wrong for guessing that the parallel task should be executed once per vector, from within the perform method?
You are correct. parallel_execute should be called once from your perform method.
*should the perform method wait for each parallel task to finish, then add the partial results together and send the sum out of the outlet?
Yes. However you don't need to do anything special w/r/t synchronization. parallel_execute will run one potion in the main audio thread and additional portions in other threads, waiting for the other threads to complete before returning. No need to do any of the additional management you mention in your message. It's all done for you.
I would say start implementing your solution, and let us know any issues you encounter.
-Joshua
thank you! I'll check it out and let you know...
aa
... it works! thank you very much for the example and explainations!
now I have another couple of questions:
- would it be a very bad idea to put a mutex in the DSP chain?
- do I have a way to know which worker is running in the main audio thread? maybe the one with w->id == 0?
aa
If you do use a mutex, make sure you limit the locking to as small a region both in your audio perform routine as well as any other functions that access the lock elsewhere.
For buffer access we use an atomic increment as a cheaper, but not as robust locking mechanism. Perhaps if you presented some sample of your code we could offer some suggestions of ways to limit your locking time, or perhaps even the need for it at all.
Well, in fact I probably don't need locking... I was blaming threading for a much more stupid problem.
Any clue about the other point - knowing which worker runs in the main audio thread?
Thank you again
aa
Yes, it should be worker zero.
great, thank you
Hi Joschua,
You said the buffer is atomic, do you mean the outs** array from the perform function?
I am not sure if I should be placing locks in my worker threads - every worker thread will be incrementing the value of the same output signal array, for example:
worker_proc(…)
{
for (int i = 0 to vectorsize)
{
x->out[i] += random();
}
}
should i lock the region around the increment? or what is the recommended way of handling this? Thanks in advance!