The "phase" part of the phase vocoder

digiology

Hi all,

I'm trying to implement a phase vocoder with an ability to seamlessly transition between "live" and real-time sampled material.
At a playback speed of 1.0 the signal passes through unaffected, for slower speeds as soon as the difference between the between the recording "frame index" (position in record buffer) and the playback index reaches >= 1 (i.e. the playback is one or more frames "behind" the recorded input) then the patch switches to the buffered sound from the phase vocoder.

So it essentially allows you to time stretch live sound.

I'm having trouble with phase however, I thought it was to do with the phase of the live and sampled sound being switched midway through an fft frame but I used sah~ to avoid this.

I can tell there is a problem because of how it sounds. I also implemented a resettable frameaccum~ which improves the sound temporarily after resetting but the sound turns uglier as the signal is slowed down.

I'd love some guidance and perhaps an explanation of why the running phase seems to be so problematic. I would have thought that phase differences for frames have a local effect on the sound without having a long lasting or cumulative effect on the sound.

The 'framePosition' patch may look a little counter-intuitive but the main problem probably stems from the phaseVoc subpatch.

1031.framePosition.maxpat

Max Patch

digiology

Ok, I've simplified everything to make the problem clearer. FFT data is written to a buffer and the playback index is 1 behind the record index.

Initially the timbre of the input is fine, then press space bar to slow down the audio by 0.9999. Why has the timbre changed so drastically and how do I fix it?

1034.phaseVoc3.maxpat

Max Patch

AlexHarker

I haven't time to look in detail at the patch right now, but if you want to implement a phase vocoder for time stretching (I think you posted previously in relation to freezing) then you almost certainly want to phase lock the vocoder - I'd also advise trying to work from time domain data (it'll be *much* smoother) and taking the fft~ of the required bit of the time domain data when you need it using [fft~] and [index~]....

I'm pretty sure that the following tutorial (and possibly also the first part) may be handy in answering some of these questions, and in reaching a better implementation (I personally favour cartesian implementations of the phase vocoder, as they are a lot cheaper). I haven't looked all the way through to check if everything is covered, but definitely check it out:

https://cycling74.com/tutorials/the-phase-vocoder-part-ii/

Over time the running phase is essentially problematic for two reasons - floating point inaccuracy is one, but more importantly it accumulates phase separately in each bin or channel of the FFT. Thus, phase relationships between bins are not maintained (and over time become essentially arbitrary). Most simplified explanations of FFT for audio are really vague on the meaning of phase, and many important facts of the FFT. To understand this you need to understand that

1 - even a sinusoidal component will excite all bins of the FFT to some extent - the nearby bins to the centre frequency will be phase related in an important manner.

2 - transients are heavily phase dependant - consider that a single sample impulse gives a flat amplitude plot (like ideal white noise) - the difference is that the phase relationships concentrate the energy into a single sample. Transients normally suffer very badly in a phase vocoder.

The problem is well known, and various improvements of varying complexity have been proposed. Here's one (which works slightly differently to the miller puckette style "phase locking":

http://www.ee.columbia.edu/~dpwe/papers/LaroD97-phasiness.pdf

Alex

digiology

Thanks Alex, that's very much appreciated.

That tutorial doesn't address the phase issue, but is an interesting implementation. I'll have to research this in a little more depth it seems.

Looking at the polar implementation, it seems as though the time stretch is done before the fft and that the fft doesn't do much.
It almost looks like a constrained granular time stretch with an fft at the end! Anyway, I'm sure its a well informed implementation, its too bad that the results sound no better than my own attempts.

AlexHarker

Yes - sorry - the phase locking confusion came from a conversation somewhere else - I thought it had been added to the patch, but I see now that it wasn't.

The fft~ is vital - This implementation *should* be better than the one you posted, because although the phase is still an issue, the window can slide over the input more gradually, rather than in steps of the hop size (as in your implementation - because you only store data for each real time fft you take). Put simply, if the speed is slow storing fft data will result in the same frame being read potentially several times in a row, whereas, a time domain + fft when needed approach allows a new frame of data, better representing the changing sound over time. When I first saw this kind of implementation it confused me, but any decent phase vocoder works this way - it's just not as conceptually straightforward. Actually, looking at the patches now the window can theoretically slide less than a sample, as play~ can do fractional sample positions, although personally I am not usually a fan of play~ (due to accuracy issues - I have an older implementation by dudas somewhere that uses index~, but this is at least equally limited in terms of accuracy, so I suppose play~ is probably better here).

To test this difference you should test the patches with slower speeds (perhaps 5 times slower or more) - I would expect the difference to become fairly obvious.

Anyway, the paper covers phase locking of both types - it is worth saying however, that pfft~ does a zero phase fft~ and so we expect consecutive bins representing a sinusoid to be in phase with one another, rather than 180 degrees out of phase, as is the case with the standard FFT (which is not zero phase).

Alex