Block-based processing inside pfft~

Alexis Baskind's icon

Dear all,

I'm thinking about developing a series of low-latency fft-based spectral processing tools (pitch-shifting, frequency shifting, spectral warping). This requires being able to address the whole fft blocks at once, meaning not fft bin per fft bin, which does not allow (as far as I know) for instance to shift frequency bins down without the cost of an extra latency of one block size (overall latency of two block sizes instead of one).

As far as I know, the only way to do that is with an external, like for instance gizmo~ does, since gen~ is like traditional Max patching sample-based processing and not block-based, meaning there is no way I know in gen to fetch the input data from the whole incoming block, only the last samples.

Am I right?

Best

Alexis

webe's icon

Dear Alexis,

I am also only aware of being able to process sample/bin-wise data within the [pfft~] object. Did you see you could use [capture~] to accumulate a buffer? But I guess this doubles your desired processing time (two frames).

Did you look at the [fft~] object too? It atleast gives you a sync signal as soon as one block is passed through.

One workaround might be to rephrase your algorithm to work recursively for each sample instead of requiring entire blocks - if that is possible.

Keeping an eye on this. This is a very interesting question and might lead to: Not natively possible and requires the user to compile an external...

Alexis Baskind's icon

Dear Webe,

thanks for your answer. Indeed, the problem with storing the fft info in a buffer~ or using capture~ is the extra delay of one block, and this for one reason: otherwise it's impossible to use the high-frequency input to create some low-frequency output (since the high-frequency input is not yet there when we need to produce the low-frequency output). A simple example: shifting the frequencies down with 1000 Hz requires, when one needs to produce for instance the 500 Hz information for fftout~, the knowledge of the 1500 Hz information coming at fftin~.

This is why I think that block-based processing is the only solution to overcome this, since then all fft bins are processed at once with an overview of all bins coming at the input.

Best

Alexis

Graham Wakefield's icon

Unfortunately there are no MSP capabilities to process blocks of audio at once in that way. Since both pfft~ and fft~ output their spectra sample by sample, you will always end up with an extra block of delay as you accumulate the spectrum back into a buffer~ (or gen~ data or whatever) for processing en-masse. I guess if you are using large FFT sizes in a real-time performance where latency is critical this would be an issue.

The only way I can imagine avoiding the extra block of delay is to do the block-based processing in Jitter, via jit.fft, as this really should give you access to the spectrum as a block of data as soon as it is ready.

Graham

Graham Wakefield's icon

Well, the attached below is about the best I can do in this vein. Definitely jit.catch~/jit.release~ is more useful here, and opens up a realm of Jitter-based block processing of audio signals that I at least hadn't considered before. Some quite interesting possibilities to explore.

For your purposes, I wonder if this helps or not. With SIAI and jit.catch~, this is all happening in the audio thread between block boundaries, so in theory there is no latency added in the Jitter processing; but there is latency added for getting the audio back & forth between MSP and Jitter. For a 512 sample FFT, it would only be 512 samples of jit.catch~, 256 samples due to overlapping FFTs (I did 2x overlap in this patcher), and the approximately 256 samples of latency added to jit.release~, which was the lowest I could get it before the audio started glitching. This might be slightly better than what pfft~ and buffer~ processing would get you; but it demands SIAI and may glitch out if there's a lot of processing going on elsewhere.

Graham

Max Patch
Copy patch and select New From Clipboard in Max.

Alexis Baskind's icon

Dear Graham,

thanks for your answers. I wouldn't have thought that a solution through Jitter would be that reliable for audio processing, this is very interesting. A latency of 2 blocksizes is still too much for my purposes (I'm trying to get under 10ms). pfft~ allows an overall latency of (Blocksize - Vector size) for overlaps of 2 and 4, which is much closer to what I'm looking for.

I will investigate that further.

Best

alexis

Luigi Castelli's icon

Hi Alexis,

I have been thinking about something like you describe for quite some time now.
According to my research, in order to achieve low-latency FFT block processing, the only way that makes sense is to write externals in C.

Once we established that, there is an important design decision to be made.
Are you only thinking about a set of externals designed to be used inside a pfft~ to do FFT processing the Max/MSP way, or would you also consider externals that manage internally the FFT processing parameters? (FFT size, overlap, windowing, etc...) Going the Max/MSP way obviously is easier and integrates better into the environment. However one of the things I have been missing in Max is to be able to interact with the FFT analysis parameters in real-time. (FFT size, hop size, window types, zero-padding, etc...) This is not currently possible using pfft~.
Last but not least, various opportunities for parallelization come to mind.
A 16x overlap FFT processing could be spread across many CPU or even GPU cores.
I am pretty sure that pfft~ does not use the GPU, but is pfft~ able to parallelize its processing load across CPU cores?

Any further thoughts?

Graham Wakefield's icon

Hi Alexis,

I did some latency measuring in the modified patcher below. First, play some kind of pure tone through it and wind down the metro/latency duration as low as it can safely go before it glitches. For me, it works even at 4ms, lower than that and I get drop outs in the output (and higher than 11ms I get dropouts in the input).

Then switch the selector~ to chose the click~ input, and the gen~ patcher will measure the latency of the click from input to output. The roundtrip latency seems to depend on Max's IOVS (vector size); at 64 samples the latency is under 20ms (sometimes as low as 12ms); at 512 samples the latency is 30ms. This is at 44.1khz sampling rate. (At 48kHz I can get 11ms with 256 iovs.) It might also depend on some of Max's scheduler settings.

This is for a 512-point FFT. At 44.1kHz there is a minimum latency of 12ms just to get a 512-point FFT spectrum no matter what technique is used. So the Jitter route is adding less than 8ms to this minimum necessary latency.

At least some of this is due to the [delay~] objects in there, which I used as a hack for quickly getting 2x overlap. But it should be possible to do it without using delay, but rather two jit.catch~/jit.release~ objects that are triggered alternately. Potentially this might shave up to 6ms off the roundtrip latency. I'm just not sure if I can figure out how to get them to do it. The reason jit.catch~ works is that it is driven by the metro which is on the high priority thread.

That said, you could also probably achieve all this using plain fft~, buffer~ and gen~ objects, which would be easier than writing an external.

Max Patch
Copy patch and select New From Clipboard in Max.

Alexis Baskind's icon

@Graham: your patch is a very interesting way to prototype: contrary to pfft~ it allow a perfect reconstruction of the input signal if no transformation is applied, at least with a rectangular window. With a triangle window, perfect reconstruction does not work any more, which is normal since there is no normalization like Griffin-Lim, which should lead (I have to double-check) to a modulated output. With your patch, IOVS set to 128, I get a latency of 1024 samples, not bad at all. The problem I encoutered are unstabilities and crashes, so I would exclude it for a live performance at least until I better understand how to overcome them

@Luigi: if I had to start writing an external I'd probably go directly for an imbedded fft using FFTW, for the reasons you mention, and also to be able to control the way the time signal is being reconstructed, since even if pfft has the advantage of simplicity and low latency (the latency actually equals block size minus vector size) I didn't find a way to ensure perfect reconstruction if the STFT frames are not modified, even with a square window.

Best

Alexis

Graham Wakefield's icon

Yes, I see instabilities too -- it's definitely pushing Max in a direction it wasn't exactly designed for, so not that surprising. I had a different version of this working really solidly, then suddenly it crashed and the crash recovery didn't recover the gen~ code; now I can't seem to rebuild it, ugh.

Proably it doesn't matter. At least for low-latency, pfft~ with 8x overlap is already pretty amazing; the overlap cancels out a lot of the latency incurred by buffering up spectra before processing.

For a 512 FFT the roundtrip latency is 448 samples; with a buffer~ based processing internal to it, and with 8x overlap, it only grows to 512 samples of roundtrip latency, 11.6ms at 44.1kHz, or 10.7ms at 48khz. That's pretty good. That is, using buffer~ based processing can add less than 2ms if you are using overlap.

Be careful to make the buffers have unique names though...

iamnotnicola's icon

Hello here, what an insightful thread. Are there any updates or some Max examples that I could build upon? I am trying to build a pitch detector with zero padding in order to support larger samples

Graham Wakefield's icon

Something that was perhaps missed from the discussion above is that pfft~ is not the only option in Max -- there's also fft~, which in a way gives you lower-level access to the fft process. There was a wonderful tutorial on using fft~ somwhere in the archives but I can't find it now...

Alexis Baskind's icon

Dear Graham,

as far as I know, fft~ also presents the same problem regarding my initial concern: for instance in the case of frequency shifting of a negative number of bins, the whole block would have then to be stored in a buffer, then processed, and then converted back in time domain. This implies a total delay of two block sizes (one for FFT, and one for processing + inverse FFT). If the same would be implemented for instance as C or C++ code, only one block size of delay is implied, which corresponds to the time the FFT needs to gather samples before being applied.

Am I right?

Alexis

Graham Wakefield's icon

Yes, the throughput is 2 blocks with fft~/pfft~ because you don't get the full frame in entirety but instead sample-by-sample over time. I just wanted to make sure that the existence of fft~ wasn't missed in the thread (mainly because it gives more direct ability to design overlap, zero padding etc. yourself, rather than have pfft~ do that automatically -- responding to another comment further up).

The jit.fft approach is interesting but I wouldn't trust it in a live performance either. It is likely better suited to non-realtime duties -- for example, it might be a very useful way to be able to process very large Fourier transforms (e.g. Audiosculpt/Soundhack realms), which isn't possible via fft~/pfft~.

For the absolute minimum throughput latency using a C++ external seems like the only way. Still, one could prototype this using gen~ in pfft~ or jit.gen with jit.fft and then export the C++ code to wrap in some FFT library inside an MSP external. You might also need to deal with the buffering on input & output via a ringbuffer to deal with the DSP vector size being unlikely to be the same as the FFT frame size, unless the FFT library of choice does that for you.

Graham

Luigi Castelli's icon

Hi folks,

yes, the only way to achieve minimum throughput latency is with a C/C++ custom external.
However I wouldn't use gen~ in pfft~ for prototyping. I respectfully disagree with Graham here. I think there are far better resources online. Those would be my starting point if I were to undertake a project of this kind. (BTW totally worth it and needed if you ask me)

For instance, Apple has an open-source CASpectralProcessor C++ class: (part of their Core Audio Public Utilities)
https://github.com/robovm/apple-ios-samples/tree/master/CoreAudioUtilityClasses/CoreAudio/PublicUtility

Simple, fast, clearly written and works very well. The nice thing is that it allows you to abstract away all the FFT analysis and resynthesis code, letting you focus on the actual frequency domain processing algorithms of your choice. Ring buffering on input and output is managed automatically by the class so nothing to worry about if your MSP and FFT vector sizes are different.
So it could work as a good starting point for whatever set of FFT externals one would like to build. The only disadvantage is that - being an Apple library - it uses vDSP under the hood and it's therefore not cross-platform. If you want to achieve cross-platform compatibility you would have to readapt it to use something like FFTW for example. Not a huge amount of work, but a fair amount nevertheless.
I have implemented an external using it and I have been very pleased with the results.
The only latency it ever incurs is one FFT vector size.

Another good resource is the [spectrumdraw~] external by Alex Harker. (University of Huddersfield)
You can change FFT size, window type, zero padding all in real-time. All open source.
Obviously it's only a spectrum analyzer, so no processing involved, however Alex did what we are talking about in terms of implementing a minimum latency FFT framework independent of all the Max/MSP facilities.

I hope Alexis is still motivated to take on such a project.
I think there is need for it and it would certainly find its good place in the Max package collection.

Best.

- Luigi

P.S.
I am in no way affiliated with either Apple or the University of Huddersfield.

Alexis Baskind's icon

Dear Luigi,

thanks for the very informative insider's view about it. I'm still motivated to work on it, even if I have other balls to juggle at the moment.

Talking about Alex Harker's code: besides spectraldraw~, Alex Harker also published two things which are worth considering:
. as part of the HISStools project (that contains spectraldraw~), the FFT library "HISSTools_FFT", which is cross-platform (at least windows and PC) seems to be very well written and efficient. No dynamic libraries besides the standard ones from the OS (i.e. accelerate for Apple) needed
. even more interesting: I just noticed that Alex Harker published another library called FrameLib (https://github.com/AlexHarker/FrameLib) that seems to be very powerful, and specifically address the issue I mentioned.

best,

Alexis