Question about ext_sysparallel.h

Toshiro Yamada's icon

Hello,

I'm writing a msp external which has fair amount of computation that can be parallelize. I was looking at the SDK and found a file called ext_sysparallel.h with t_sysparallel_task and t_sysparallel_worker structs. Looking at the methods, this looks like what I want to do, which is to distribute processing over multiple cores. However, I can't seem to find any documentation or example of how to use these.

I would appreciate if someone could provide me with more information or examples.

Thank you!

Toshiro

Joshua Kit Clayton's icon

Hello Toshiro,

We probably won't be completely documenting this anytime soon, but here is a simplified example of the portions of creating, executing, and freeing a t_sysparallel_task. This is reduced from poly~, alternating patcher voices in the threads.

If you can describe what you want to accomplish, or have some attempt at using this code with specific questions, I would be happy to try and answer them.

-Joshua

typedef struct _myobj_parallel
{
    t_myobj    *myobj;
    long    count;
} t_myobj_parallel;

void myobj_workerproc(t_sysparallel_worker *w)
{
    t_myobj *x = ((t_myobj_parallel *)(w->data))->myobj;
    long count = ((t_myobj_parallel *)(w->data))->count;
    long i,threadcount;

    threadcount = w->task->workercount;

    // alternate every threadcount voices between threads
    for (i = w->id; i < count; i+=threadcount) {
        if (x->p_patchers[i].r_mute)
            continue;
        if (x->p_patchers[i].r_chain) {
            // compute dsp chain
        }
    }
}

void myobj_run(t_myobj *x, long count)
{
    t_max_err err;
    long i;
    t_myobj_parallel p;

    // setup our workerproc data pointer
    p.myobj = x;
    p.count = count;

    if (!x->p_paralleltask)
        x->p_paralleltask = sysparallel_task_new(&p,(method)myobj_workerproc,x->p_threadcount);

    // set task priority to audio
    x->p_paralleltask->priority = SYSPARALLEL_PRIORITY_HIGH;

    sysparallel_task_data(x->p_paralleltask,&p);

    // execute task. if there is an error it means we're trying to run parallel in nested instances
    if (err=sysparallel_task_execute(x->p_paralleltask)) {
        if (!x->p_parallelerror) {
            object_error((t_object *)x,"use of nested, parallel enabled objects is not supported. disabling for inner myobj~ object(s)");
            x->p_parallelerror = TRUE;
        }
        // do your fallback here
    }
}

void myobj_free(t_myobj *x)
{
    dsp_free((t_pxobject *)x);

    if (x->p_paralleltask)
            sysparallel_task_free(x->p_paralleltask);

}

Toshiro Yamada's icon

Many thanks, Joshua!

The example looks pretty straight forward. I'll try to implement this in my code and see how it goes.

Toshiro

andrea agostini's icon

Hi!
sorry guys, but I'm not as advanced as that, I have some more questions...

let's presume that I have an object which does additive synthesis, and I want to make it multithreaded. what I suppose I should do is have each thread calculate one half of the oscillators, then sum together the results from the threads and send the sum out of my signal outlet.

so, in random order:

at which moments should I create and execute the parallel task - dsp method, perform method, ...? am I wrong for guessing that the parallel task should be executed once per vector, from within the perform method?

should the perform method wait for each parallel task to finish, then add the partial results together and send the sum out of the outlet? or, maybe, is it more clever to have the perform method itself doing a part of the computation, and at the end of it wait for the other threads?

if so, how do I know if one thread has finished its job? should it set/increment a variable somewhere, to be checked by the perform method? or is there a more direct way to do it?

... or am I totally misunderstanding the whole thing???

thank you!
aa

andrea agostini's icon

jkc... please... some hints...

Joshua Kit Clayton's icon

* what I suppose I should do is have each thread calculate one half of the oscillators, then sum together the results from the threads and send the sum out of my signal outlet.

Yes. However keep in mind if there are more cores available, there will be more than two threads, so it would be 1/Nth rather than half.

* at which moments should I create and execute the parallel task - dsp method, perform method, ...?

We create the task the first time the perform method is called. In the above example, myobj_run() would be called from myobj_perform().

* am I wrong for guessing that the parallel task should be executed once per vector, from within the perform method?

You are correct. parallel_execute should be called once from your perform method.

*should the perform method wait for each parallel task to finish, then add the partial results together and send the sum out of the outlet?

Yes. However you don't need to do anything special w/r/t synchronization. parallel_execute will run one potion in the main audio thread and additional portions in other threads, waiting for the other threads to complete before returning. No need to do any of the additional management you mention in your message. It's all done for you.

I would say start implementing your solution, and let us know any issues you encounter.

-Joshua

andrea agostini's icon

thank you! I'll check it out and let you know...
aa

andrea agostini's icon

... it works! thank you very much for the example and explainations!

now I have another couple of questions:

- would it be a very bad idea to put a mutex in the DSP chain?

- do I have a way to know which worker is running in the main audio thread? maybe the one with w->id == 0?

aa

Joshua Kit Clayton's icon

If you do use a mutex, make sure you limit the locking to as small a region both in your audio perform routine as well as any other functions that access the lock elsewhere.

For buffer access we use an atomic increment as a cheaper, but not as robust locking mechanism. Perhaps if you presented some sample of your code we could offer some suggestions of ways to limit your locking time, or perhaps even the need for it at all.

andrea agostini's icon

Well, in fact I probably don't need locking... I was blaming threading for a much more stupid problem.

Any clue about the other point - knowing which worker runs in the main audio thread?

Thank you again
aa

Joshua Kit Clayton's icon

Yes, it should be worker zero.

andrea agostini's icon

great, thank you

MF's icon

Hi Joschua,

You said the buffer is atomic, do you mean the outs** array from the perform function?

I am not sure if I should be placing locks in my worker threads - every worker thread will be incrementing the value of the same output signal array, for example:

worker_proc(…)
{
    for (int i = 0 to vectorsize)
    {
        x->out[i] += random();
    }
}

should i lock the region around the increment? or what is the recommended way of handling this? Thanks in advance!