The Phase Vocoder – Part I


    The Phase Vocoder is a tool used to perform time-stretching and pitch-shifting on recorded sounds. Its name derives from the early "vocoders" (contraction from "voice encoders"), which used a set of bandpass filters in parallel over many frequency bands, to crudely analyze and reconstruct speech. The infamous hardware vocoders of the 1960s and 1970s (as used, for example, by Wendy Carlos in the soundtrack of Kubrick's film "A Clockwork Orange") were based on this technology, and allowed the spectrum of one sound (in the Carlos example a synthesizer) to be controlled by that of another (a voice). In the Max/MSP examples folder, there is an example by Marcel Wierckx called "classic_vocoder.pat" (located in the "effects" sub-folder) which shows how this traditional vocoder works. Unlike the classic vocoder, which is based on bandpass filters, the phase vocoder is based on a Short-Term Fourier Transform (STFT) – a Fourier Transform performed sequentially on short segments of a longer sound – and in practical use has little to do with the hardware vocoders of the 1960s and 1970s. The phase vocoder can, however, be considered a type of vocoder because the Fourier Transform returns a set of amplitude values for a set of frequency bands spaced evenly across the sound's spectrum, similar to the older vocoder's set of bandpass filters. Of course the phase vocoder, as its name suggests, not only takes into account the amplitude of these frequency bands, but also the phase of each band.
    If you are not familiar with the Fast Fourier Transform (FFT) as it is used in MSP, we suggest you review MSP Tutorials 25 and 26 , which deal with the fft~/ifft~ objects and the pfft~ object set, respectively.

    Our Starting Point

    In MSP Tutorial 26 on the pfft~ object, we are shown a simple phase vocoder patch which analyzes an incoming sound and records the FFT frames into a buffer~ object. The data in the buffer~ object is then reconstructed so that a basic sort of time-stretching (and compression) may be performed on the recorded data. Although it works as advertised and introduces the basic concepts, it lacks some of the flexibility of a more standardly-designed phase vocoder, which lets us transpose our original sound, as well as start at any point in the original sound.
    Let's review the STFT and basic phase vocoder design. A Short Term Fourier Transform (STFT) is a series of Fourier Transforms, usually spaced evenly in time:
    Fig. 1 – Diagram of the Short Term Fourier Transform (STFT)
    Fig. 1 – Diagram of the Short Term Fourier Transform (STFT)
    Whereas the old fft~ and ifft~ objects just perform the FFT, the pfft~ object actually goes one step further and performs an STFT, as shown above. The input signal is sliced into frames, which start at a regular offset known as the hop size. The hop size determines the overlap, which can be defined as the number of frames superimposed on each other at any given time. If the hop size is half the FFT frame size, the overlap is 2, and if it is one-fourth the frame size, the overlap is 4. Theoretically, an STFT needs a hop size of at least 4 and although a hop size of 2 can be made to work for many musical purposes, the phase vocoder will sound better with an overlap of 4.
    If our FFT size is 1024, our FFT will give us 1024 frequency bins (these would be called bands in filter terminology) from DC (0Hz) to the sampling frequency. Because the data above the Nyquist Frequency (half the sampling rate) is a mirrored copy of the data below the Nyquist Frequency, the pfft~ object by default eliminates this redundant data for efficiency's sake, and reconstructs it for us before performing the inverse FFT. (In our phase vocoder patch we will need to tell the pfft~ object to override this elimination of data for reasons that will become clear later in this article).
    Fig. 2 – Diagram of the FFT Spectrum
    Fig. 2 – Diagram of the FFT Spectrum
    For each bin, the FFT provides us with coordinates which can be converted to amplitude and phase values. (This is covered in MSP Tutorials 25 and 26). The difference between the phase values of successive FFT frames for a given bin determines the exact frequency of the energy centered in that bin. This is often known as the phase difference (and sometimes also referred to as phase derivative or instantaneous frequency if it's been subjected to a few additional calculations).
    If we record a succession of FFT frames, then play them out of order, the differences between the phase values in the bins will no longer produce the correct frequency content for each bin. Therefore, in order to "reconstruct" a plausible playback of re-ordered FFT frames, we need to calculate the phase difference between successive frames and use it to construct a "running phase" (by simply summing the successive differences) for the output FFT frames.

    Getting Sound into the Phase Vocoder

    The first thing we need to do is to be able to is to access our sound file directly in our pfft~ subpatch and perform an FFT directly on the sound, without sending it into any fftin~ inlets of the pfft~. Why do we need to do it this way? Since the whole point of a phase vocoder is to perform time stretching and compression on an existing sound, we need to be able to access that sound directly via a buffer~ object and perform the FFT at any given playback speed and overlap. In order to have independent transposition and playback speed, a phase vocoder needs independent playback of the sound to be transformed for each FFT overlap (4, in our case). Each playback frame needs to be synchronized with its respective FFT frame. Therefore we cannot send a single copy of the sound we wish to transform into the pfft~, but need to play a slice of the sound into each of the four overlap frames. Since we cannot send one slice of the sound into the pfft~ object (it keeps track of its input history and considers its input to be continuous sound, not individual slices of sound), we cannot use the fftin~ object inside the pfft~, but must do our FFT processing using the fft~ object. The fft~ object performs a full-spectrum FFT (i.e. mirrored), so we consequently need to make the pfft~ object behave the same way, so the fft~ can work in sync with the FFT frames processed inside the pfft~ object. We need to make a few changes to the default way that pfft~ deals with the STFT.
    We first need to tell the pfft~ object to process full-spectrum FFT frames, instead of the default "spectral frame" which is half the FFT size (up to half the Nyquist). This is easily accomplished by adding a non-zero fifth argument to the pfft~ object. Because the full-spectrum argument is the fifth argument, we must supply all the other arguments before it, including the fourth argument, the start onset, which will be set to the default value of zero.
    Fig. 3 – Additional Full-Spectrum Argument to the pfft~ Object
    Fig. 3 – Additional Full-Spectrum Argument to the pfft~ Object
    Next, because the fftin~ and fftout~ objects perform the FFT calculation at zero phase with respect to the FFT (the first sample in the windowed frame sent to the FFT is the middle of the window), and the traditional fft~ and ifft~ objects perform the FFT 180 degrees out of phase, we need to make sure any fftin~ and fftout~ objects in our patch have the same FFT phase offset used in the fft~ objects. We do this by specifying a phase offset to the fftin~ and fftout~ objects. A phase value of 0.5 means 180 degrees out of phase, so this is the value we want. While we do not need the fftin~ object in our pfft~, we can still make use of the convenience of the fftout~ object in order to get the properly windowed and overlap-added result out of our pfft~. The automatic windowing in the fftout~ object should behave like our manual windowing with the fft~ objects.
    Fig. 4 – Zero-Phase FFT Window
    Fig. 4 – Zero-Phase FFT Window
    Now we are ready to start constructing our phase vocoder sub-patch.

    Accessing our Pre-Recorded buffer~

    We always need to access our buffer~ in two places – at the current FFT frame location, and at the location of the previous FFT frame of the buffer. We can use the index~ object to access the buffer~, just as we might use the index~ object in a simple playback patch. (Note that we are accumulating samples at the "normal" playback rate for now.) And because we're manually designing the input part of our STFT ourselves, using the fft~ object, we need to window the signal we read from the buffer~ before sending it to the fft~ objects. The patch uses a Hanning window (the default window used by pfft~).
    Fig. 5 – Reading Frames from a buffer~ inside a pfft~
    Fig. 5 – Reading Frames from a buffer~ inside a pfft~
    Using two fft~ objects that are a quarter frame apart from each other, we calculate two FFT frames (the present frame and the previous). Using cartopol~, we convert from cartesian coordinates to polar coordinates to get amplitudes and phases, and then simply subtract the phases of the previous frame from the present frame to get phase differences for each bin. Just as in the simplified phase vocoder in MSP Tutorial 26 , we use the frameaccum~ object to accumulate the phase differences to construct the "running phase" for output.
    Additionally, at the top of the patch we no longer have a fixed playback rate (it was set to 0.25 in the previous example image), but have added the capability to time-stretch (or compress) the sound by accepting a (variable) sample increment from the main patch.
    Fig. 6 – The Phase Vocoder pfft~ Sub-Patch
    Fig. 6 – The Phase Vocoder pfft~ Sub-Patch
    Notice that now the "previous" frame that we read from the buffer~ might not actually be the frame that we previously read as the "current" frame! This is the whole point of the phase vocoder – we are able to read frames in any location in the buffer and at any speed, and by simultaneously reading the frame one hop-size before the "current" frame (regardless of the speed at which we're reading the buffer~) we can obtain the "correct" phase difference for the "current" FFT frame!

    Cartesian and Polar Coordinates

    The FFT algorithm outputs transformed data in cartesian (x, y) coordinates. These coordinates are often referred to as the real and imaginary parts of a complex number. The amplitude and phase values that we normally associate with the FFT are the polar coordinates of the (x,y) values. The polar coordinates of cartopol~ conveniently give us amplitude and phase information. While working in polar coordinates is convenient, Cartesian to polar conversion of phase information uses the trigonometric math function arctan~, which is computationally very expensive. Avoiding the arctan calculation, which must be calculated for each of the eight FFTs used in a single phase vocoder, by using less intuitive Cartesian math means working with complex math (the complex multiply and divide instead of the simple addition, subtraction, multiplication, and division needed in the polar domain). In addition to the issue of computational efficiency, for reasons of accuracy, converting to and from polar coordinates can introduce small amounts of error which slowly accumulate over time and probably should be avoided. Finally, there are some additional features that improve the quality of the phase vocoder which we will see in Part II of this article (scheduled in two months' time), for which it is preferable to use complex math on cartesian coordinates instead of calculating on the polar coordinates derived from them. So we need to learn a little complex math and how it relates to the polar calculations we're performing on the amplitude and phase values.
    A complex multiply multiplies the amplitudes and adds the phases of two complex signals (i.e. signals which have a real and imaginary part – such as the two outputs of the fft~ or fftin~ objects).
    Fig. 7. – Complex Multiplication
    Fig. 7. – Complex Multiplication
    A complex divide divides the amplitudes and subtracts the phases.
    Fig. 8. – Complex Division
    Fig. 8. – Complex Division
    What is important about these complex multiplication and division sub-patches is what it does to the phases of our two complex signals. Complex multiplication adds phases, whereas complex division subtracts phases. Therefore we can calculate both the phase difference as well as accumulate the running phase of our output signal using these complex math operators. Because we only care about the phase in our phase vocoder patch (remember in the polar version shown previously we did not modify the amplitudes), we can make a further optimization to the complex division and remove the denominator by which both real and imaginary parts are scaled:
    Fig. 9. – Complex Phase Subtraction Based on Division
    Fig. 9. – Complex Phase Subtraction Based on Division

    The Phase Vocoder Using Complex Math

    Now we're ready to use these to construct a phase vocoder which uses complex multiplication and division. The first part of the patch will remain the same – we are only changing what happens between the fft~ objects that read the buffer~, and the fftout~ at the bottom of the patch.
    Fig. 10. – Phase Vocoder using Complex Math
    Fig. 10. – Phase Vocoder using Complex Math
    Notice how we have to use the send~ and receive~ objects to manually feed-back the previous output frame so we can use it to accumulate the running phase. Remember that a receive~ object "receives" signal data from its corresponding send~ object with a delay of one signal vector in order to avoid signal loops. The signal vector of a pfft~ is the size of the FFT frame, and since we are running a full-spectrum pfft~, the delay is 1024 samples, or one FFT frame.
    Also notice how we invert the amplitude values so we can produce a frame which contains JUST the phases, with the amplitudes set to 1.0. (The inverted amplitude, or magnitude, multiplied by the real and imaginary values essentially cancels out the amplitude.) This is so we can use our complex divide and multiply and not affect the amplitude values of the "current" frame. In order to make sure we don't divide by zero, we add a very small offset to the complex number before calculating the inverse amplitude. This does not affect the sound perceptibly.
    At this point you might want to compare both the polar and cartesian versions of the phase vocoder patch. The polar version which we first showed you is conceptually clearer at first glance. However, note the difference in CPU usage of the two patches. You may well decide that the extra effort in designing the phase vocoder using complex arithmetic is worth the payoff!
    In listening to the phase vocoder patch, you may have noticed a somewhat metallic or slightly reverberated quality to the phase vocoder's sound as it runs. This is caused by phase inconsistencies from bin to bin. Since any given FFT bin might be affected by more than one component of a sound—for instance, a non-harmonic sound with two very close inharmonically related frequencies will struggle for precedence in a bin that they share—this can create a slightly reverberated quality to the sound. In Part II of the article we will learn some tricks to minimize this effect, as well as look at a buffer~ reader that allows for flexible and independent transposition and stretching at any location in the buffer~.


    To sum up what we've just covered, the basic structure of the phase vocoder requires four overlapping pairs of buffer~ reads and four overlapping pairs of FFTs. The buffer/FFT pairs are exactly one quarter frame apart (one hop). The FFT pairs allow us to calculate the phase differences between where we are in a sound and where we were a quarter frame ago (one hop). Then, for each of the four pairs, we simply add their respective phase difference to the previous phase from a full frame before, accumulating our running phase. We do all this with Cartesian coordinates (real and imaginary values making use of complex arithmetic) using fft~ objects inside a pfft~ that is running in full-spectrum mode with a 180-degree phase offset for the fftout~ object so that it runs in phase with the fft~ objects.


    The authors would like to thank Miller Puckette for developing what was more than likely the first real-time phase vocoder, and most certainly the first real-time phase vocoder using Cartesian coordinates!

    • Oct 19 2017 | 12:39 am
      With a transposition factor and speed factor each of 1, shouldn't the output be exactly identical to the sound in the buffer? For me, it starts out this way, but if I am retriggering the playback a bunch, each retrigger sounds as though it has a different FIR comb filter on it. Is this normal when doing a bunch of retriggers, or is there a way to fix it?
    • Nov 20 2018 | 6:53 pm
      Im new to this topic, but extreme time stretching(like paulstretch) is not possible with this, right? I tried it but the quality is not that great. I thought about iincreasing the FFT Size to 4048 for isntance, but that doenst work. Can anybody help?
    • Feb 22 2019 | 5:27 pm
      Peter, this patch would be a good basis for extreme time stretching alla paulstretch. It's not possible because the [fft~] object doesn't allow for a fft size greater than 4096.
    • Feb 22 2019 | 5:32 pm
      Check out the second part of this tutorial: in Figure 7, you'll see the noise~ (random numbers) added to the "phase information". To do "paulstretch", you would not add noise to the phases, but just replace the phase information with random values.
    • Oct 11 2019 | 6:35 pm
      hi everyone,
      two questions:
      1. does anyone have the patches for this that are not for max 4.6? I know that pt. II has them included next to the legacy versions, but I can't find them for pt. I 2. as asked by Chris: is there a possible way to perform Phase Vocoding on a live/real time input?
      I've been looking for real time timestretch/pitch shifting in Max for hours now and can't seem to find anything good. I'd very much like to write my own but if the tutorials aren't updated I don't really see how I can learn it ... :/
      any pointers are much appreciated
    • Oct 11 2019 | 7:55 pm
      if you add .mxb to the files (or of you use pcontrol or the "open" menu) they will open just fine on win-max-later-than-v4. phase vocoding works in realtime just as any other approach for stretching and pitching, just with even more latency.
    • Oct 11 2019 | 7:59 pm
      What problems do you encounter with these patches in Max 8? They work fine for me.
    • Oct 11 2019 | 9:08 pm
      @jfc no it's just because they're named .pat and my system (windows) didn't indicate that you can just open them like that
      @roman ah yes thanks that did the trick...
      "phase vocoding works in realtime just as any other approach for stretching and pitching, just with even more latency." mhhhh, ok.... what would be a good implementation for a real-time buffer recording/playback mechanism that I could use in combination with this? or more generally: what's a decent yet not too difficult to program real-time timestretch/pitchshift implementation?
    • Oct 11 2019 | 11:36 pm
      These patches allow you to do timestretching in real time: looks to me like if you add a [record~] object to record in the buffer (instead of using the 'replace' message), you will be able to play back the sound timestretched even while it's still recording in the buffer. Another approach consists in recording directly the spectral data instead of the waveform, like in this set of examples: Another solution would be to use a granulation approach (see Max -> Extras -> Examples Overview -> MSP -> Sampling -> granularized). Same thing, you'll need to use the [record~] object to record in the buffer~. You will find other granular engines that might serve you better for timestretching, depending on the kind of sound you're looking for. Another trick would be as follows, where you avoid dealing with the algorithms yourself, and just harness the power of [pitchshift~]:
      Max Patcher
      In Max, select New From Clipboard.
    • Oct 11 2019 | 11:39 pm
      difficulty is very subjective. i find all of them relatively complicated to build- and i think approaches other than "FFT" or "granular" doesnt make much sense to write in plain MSP.
    • Oct 13 2019 | 1:48 pm
      thanks to you both for getting back to me about this.
      @jfc: thanks for the patch and the considerate reply. using groove~ was my initial design but I abandoned it because the object's own timestrech attribute requires the filling of an internal buffer that makes it essentially useless for real-time applications - never thought of combining it with pitchshift~! using it this way is a really conventient solution but I feel it introduces more artifacts than a phase vocoder when slowing the sample down considerably, so I think I'll go for the pvoc. about time to dive into it anyways.
      your tutorials are on my to do list; for my current project I need a relatively quick fix so no time to look at them now but will start as soon as I'm done. they look really useful indeed. thanks again!
    • Jan 09 2022 | 9:53 am
      There is a small error in the subpatches: sah~ should sample when count == 0. Now it samples when count == 1. Insert ==~ 0 between count and the right inlet of sah~. No audible difference to me, but still... Thanks for this tutorial ! A good refresher :)
    • Jan 09 2022 | 9:56 am
      Another minor detail in the comments of complex division: it should be (ac + bd) and (ad - bc) i.