Questions about The Phase Vocoder - Part 1 by Richard Dudas and Cort Lippe

Link to the tutorial and files:

https://cycling74.com/2006/11/02/the-phase-vocoder-%E2%80%93-part-i/

or just a picture of a pfft∼ subpatcher:

http://img543.imageshack.us/img543/9639/pfftsubpatch.png

1. What is the role of the actual phase vocoder part in the pfft∼ subpatch? It seems to me that there is a synchronous granular synthesis going on in the pfft∼ and that the FFT is just resynthesizing the sound (for nothing). I know I am missing something important here but I just can’t understand what is going on. I tried to move the time domain part of the pfft∼ subpatch into the main patch and erase the pfft∼. The result was a “SGS” with no overlap or overlap 2 (if I manually set the hop size for the “previous” window which is otherwise set by fftinfo∼). Where/how does some "proper" overlap happen because the sound is evidently smooth?

2. The sampling rate in the pfft∼ subpatch is 4 times bigger with overlap 4 if I understand that right (compared to the mother patch). Does this create the 4x2=8 overlap? *2 because we are reading 2 windows at the same time...?

3. Why is there a need for frameaccum∼ in this patch? I think I am really confused with the running phase and the frameaccum∼ object in general. As far as I understand, the phase difference is calculated between two equivalent bins in successive frames. So if you jump from frame 1 to frame 10, the difference is calculated between frame 10 and 9 instead of 1. Is that not enough to calculate the frequency? Why frameaccum∼? According to the formulas found in the article “A Tutorial on Spectral Sound Processing Using Max/MSP and Jitter” by Jean-Francois Charles this data should be enough:

“center frequency fc (Hz) of the frequency bin m is

fc = m × (sr/FFTSize)

assuming no more than one frequency is present in each frequency bin in the analyzed signal, its value in Hz can be expressed as

f = fc + "delta"φ × (sr/(2π × WindowSize))

where "delta"φ is the phase difference, wrapped within the range [–π, π].”

I am sorry for such a long post but I am evidently very confused about some essential basics....thanks for any answer!

Jean-Francois Charles

Hello,
Quite subtle tutorial, of course, and interesting questions!
1. The [fft~] object does the "analysis" part of the phase vocoder. The "re-synthesis" part, or "inverse FFT" is processed by the output of the [pfft~], through the [fftout~] door. The windowing can remind you of granular synthesis, but here, it is really the windowing function added before the FFT process. This windowing is kind-of hidden when you use the inputs of [pfft~] (i.e. [fftin~]) to go from time to spectral domain.
2. Overlap 4 means that, in a way, you are processing 4 windows at the same time. I'm not sure I understand your question, sorry.
3. In case you would like to know where the "one frequency in the frequency bin" is, you would use the formula (actually, it's just proportionality). But here, what you want to do is re-synthesize the sound. And what the inverse-FFT engine wants is a x & a y (cartesian coordinates). You will get them by giving the polar coordinates to [poltocar~]. But to do that, you need a phase value, not a phase difference. That's why you use [frameaccum~], to translate these phase differences back into phases, usable by [poltocar~].
Hope that helps a little.

AlexHarker

I think the most obvious thing you are missing is:

You take the differences of the phases and then move between frames at a different rate, and hence the differences sum to produce a set of running phases that *are not the same* as the input (unless the speed is 1, in which case this is what we want). This makes the result quite different from SGS.

This running phase / phase difference business:

1 - is where all the problems of the phase vocoder start.

2 - is the essential difference between a kind of SGS and the phase vocoder - in theory the phase vocoder sounds smooth in a way that SGS will not, because we are taking into account phase in the reconstruction of (and phase is a relative measure - so it's the differences that are important) - by continuing each bin according to its phase changes over time we hope to achieve something that SGS can't.

Maybe that makes things a little clearer?

Thank you very much to both of you! After reading your posts 10 times and switching between various articles I had a relieving "ahaaa" moment. I can see now the cunningness of that patch! Brilliant! Thanks again!

Jan G.

dear T (or anybody else),
thanks a lot for this post, but could you share or elaborate a bit on your "ahaa" moment?

i understand the concept of phase difference (or phase derivative), and i see why this would give us more precise information on the "fine-tuning" of the frequency with the highest energy in the current bin, as opposed to the "basic, raw frequency" this bin represents.

i also see why, when isolating (as in storing) FFT frames, you could, instead of storing the phase, store the phase difference in order to reconstruct the phase. when doing this, however, you would need to start at an absolute reference point (that might be just "0") somewhen because phase difference is relative. so, when NOT changing the speed or re-ordering frames, you could either store the phase, or store the phase difference and reconstruct this with "frameaccum~", have i understood this right?
but now, if you change the order of FFT frames, your signal will get permutated anyway, right? say, if - for a specific bin, frame 1 had a phase of 0.5, frame 2 a phase of 0.6 and frame 3 a phase of 0.7, then the phase differences for these bin frames would be 0.5 (when starting at 0 before), 0.1 and 0.1. accumulating would give the right result when playing back in the same order (0 + 0.5 = 0.5; 0.5 + 0.1 = 0.6; 0.6 + 0.1 = 0.7). but, when playing back in the order 3 1 2, for example, the reconstructed phase would be: 0.1 (when referring to 0), 0.6 and 0.7, is that right? here, the original phases for the specific frames were not reconstructed (which, as i understand, is not what we want anyhow), but "something else". how do i know this "right"?

another question i have is: why do we need to manually do the slicing for the phase vocoder? i haven't quite thought it through completely because it still puzzles me a bit, because in the Max examples (fft-fun > phase-vocoder-sampler by zoax, luke and _M) I've found a version that works and sounds similar, but makes use of both fftin~ and fftout~.
to understand this, I guess I would need to know in detail how pfft~ does the slicing.
what i think i understand is (assuming windowsize = 1024, overlap = 4): overlap, in general, means not to do an additional slicing on the 1024 samples of the signal frame before doing the FFT, but doing an individual FFT for samples 0-1024 of the input signal, then skipping 256 samples and doing an FFT for samples 256-1280, etc. - and for correct overlapping, this means, that the distance between each FFT frame is already specified precisely with the "overlap" factor (here: 256 samples), hence making it impossible to do time stretch or compression - even if you can access the overlapping frames individually, right?
now, what i don't know, is: does each frame in the pfft~ have a size of 1024, and is subdivided into overlapping subframes, or is each frame a single FFT? i don't know how to precisely say this... when i move through the frames of a buffer that stored the successive frames inside of the pfft~ object, do i have to move 4 frames to get to the point where the original time signal would be one frame later, because in the buffer, successive overlapping frames are stored? (that would mean, the distance between frames inside of the pfft~ was 256 samples of the time signal)
so if you can access each overlapping frame individually, you could very well do compression or time-stretch, but when stretching, the window functions would be sized wrong, and overlap would be wrong, either leaving silent gaps between the frames or giving too much output.
does this mean that the Max example patch by zoax, luke and _M is wrong?
mhm, yeah, i don't seem to understand why both approaches (custom fft~ vs fftin~ with automatic overlap) seem to work..?
also, wouldn't what i said above mean that when compressing using speed factor 2, we would need shorter windows and half the overlap in reconstructing, and when stretching with speed factor 0.5, we would need longer windows and twice the overlap?

sorry, hope someone can help me out a bit. all the best, j.