convolutional neural network in max/msp

Abao

Anyone has experience using CNN recognition for sound in max/msp? or is it possible to achieve the same result by MuBu library?

Floating Point

this is so specialized and processor-intensive it would not be practical in max-- even if you wrote your own external in c with graphics card processing it would take hours or even days or weeks to train the network using thousands of standardized training examples (and it might not even work that well, depending on the CNN architecture you chose). and forget trying to do it using standard max /jitter objects.
Although it might be fun to try something simple in max, like the standard hand-written number-recognition task (ie MNIST), using convolution kernals in jitter and creating a network maybe in javascript -- but even that would most likely be more nightmarish than fun...

Abao

Hi floating point

Thanks for the advice, by my understanding it’s maybe not too heavy process to apply CNN in audio process in max? I was using MuBu library to recognize signals from a CCTV, it was fun! the model was GMM, but it loss it if I move the camera (it only recongnize he whole audio as one feature) therefore that’s why I was looking for solution to recognize particular audio features within one audio file.
https://vimeo.com/290429577

Floating Point

OK why don't you try doing an fft of an audio signal in max, convert to a jitter matrix, and put it through your GMM, which would be a reliable path, but to actually do CNN with back propogation, I'm not aware of any max friendly packages out there

Abao

It was in fft format, this helps the performance much better indeed but maybe can you please explain why to convert it into jitter matrix? since MuBu is for audio machine learning process. The tricky thing is the scan lines of cctv don’t process as the same way pixels data does therefore I did not plan to deal with it by visual and audio process is still okay fast in max.

Floating Point

Sorry, I'm not familiar with MuBu. I was just responding to your question about CNN, and thought you wanted to adapt an audio signal to a CNN. In max a feasible way is to convert the fft signal into a jitter matrix format, so 2d convolution can be achieved afterwards-- but evidently you'd probably then have to export that out to another software environment to do the actual training if MuBu doesn't do CNN or can't handle 2d GMM representations.

Abao

You are right, Mubu doesn’t do CNN, so the other way around I'm also trying to route the signal out to python for CNN solution. But I’m still interested in solving this in max, what puzzles me is: do I really need to covert it to 2D matrix first regarding to the scanline structure? because it won’t make sense with the video and audio converting, its only fast with audio process. I have no experience for CNN audio processing, isn’t it 1D process too?

Floating Point

you seem to be asking two different questions-- one about CCD scanlines and one about audio signals.
CNN can be theoretically be any dimension, but is understood to be 2D, because that's how it has been developed, as a 2D image-analysis method. Spectral analysis of an audio signal can be treated as 2D images, and as such can be used in CNNs. But as far as I know it cannot be done in Max. You would need to determine the convolution kernel(s) via training, which is done via back-propagation, and you'd need to write your own external to do that.
Here is some examples of how audio can be represented as a 2-D image:
https://cycling74.com/tools/charles-spectral-tutorials/
As you say, you could then export that sort of representation out to some cnn implemented in python for training and analysis

Abao

That sounds a bit disappointing, I understand your point, thank you very much for the directions, I will look into it more, cheers.