Forums > MaxMSP

Extremely precise sonogram ?

Mar 20 2010 | 10:07 am

(posted on the jitter forum, but it’s for MaxMSP forum)

I want an extremely precise sonogram. not just 2 or 3 times more precise than the actual one. i want a million pixels sonogram, with both time and frequency precision, i want to see the perfect fine shapes of each harmonics. not blurry paté…

I don’t care if it take 95% of cpu, or even if it needs 2 minutes to compute a 2 seconds visualization.

how to reach this ? jitter ? 32-time upsampling>fft ? what about wavelet transform* ?
any software that already do this ?

Note: I tryed some sonogram software on mac, they weren’t that much prettier than [sonogram] in max…


Mar 20 2010 | 8:30 pm

A million pixels sonogram. Let’s see. Suppose we work with a sampling rate of 44100Hz. 1024 frequency bins over a range of 22050 Hz, that means a FFT size of 2048 samples, i.e. 46ms.
If you want 1048576 frequency bins over 22050 Hz, that means a FFT size of 2097152 samples, i.e. 47 seconds. Definitely possible, but not straightforward in Max.

Oh, but you want both a great frequency AND a great time resolution. À la fois le beurre et l’argent du beurre. Well, this is simply impossible. It’s the audio / wave equivalent of Heisenberg’s uncertainty principle.

But by choosing a nice analysis window size, you can get really nice sonograms, within Max, or with free software like Raven Lite and others. If you keep a window size of 256 for instance, you might not get what you want.

Mar 21 2010 | 2:46 am

didnt soundhack have like 265,000 frames ? but i dont see how more frequency
bands then 2 times your monitor height should be useful ..

Mar 21 2010 | 3:24 pm

on a related note, I would absolutely love to know what all this is. Their sonograms look amazing.

Mar 21 2010 | 3:41 pm

izotope tools are wonderful. But even they didn’t break the uncertainty principle. For instance, from the page linked by AudioMatt:

"Auto-Adjustable STFT…if you zoom in horizontally (time) you’ll see that percussive sounds and transients will be more clearly defined. When you zoom in vertically (frequency), you’ll see individual musical notes and frequency events will appear more clearly defined."

Yes, that’s exactly the point, you can get either a good time or a good frequency resolution. You could work on replicating their idea (linking to zoom level) in Max; you could even work on making a "multi-resolution" analysis (from their page: "spectrogram with better frequency resolution at low frequencies and better time resolution at high frequencies"): you could calculate the spectrums with two different FFT sizes, then use the data of one or the other when displaying the low or high frequencies…

Mar 23 2010 | 3:14 am

Sorry for late response but i was looking a bit more to RavenPro and to your quite interesting fft/jitter tutorials on the share pages.

First i have to explain a little more why i want deep horizontal AND vertical resolution : i’m working on additive synthesis and would like to examine deep details in acoustics instruments sounds like bassoon, clarinets, contrabass, etc., to get inspiration on ways to reproduce them in a 200-harmonics-with-blur-factors-additive-synthesis expressive system for a two pens wacom screen. (see below**)

>> Oh, but you want both a great frequency AND a great time resolution.
>> À la fois le beurre et l’argent du beurre. Well, this is simply impossible.
>> It’s the audio / wave equivalent of Heisenberg’s uncertainty principle.

Of course i don’t agree with this.
Only true for standard FFT algorithm, but not necessary true for all time/frequency views in the whole world : Listen to the attached mp3 below, it’s played from your fft patch "3-record-play-speed-control". Ok, there is a jitter fft view, and we listen to the sound computed back from the fft view : This is a destructive transformation : the rhythm fidelity is poor (like you said, we have the frequency precision, then we don’t have the time precision.) This sound is just "a vague memory of my sound", thus, the graphic view is also, only, a vague memory. At this point i think that, In spite of his uncertainty principle, Heisenberg, would have, rather logically, agree with me, that if a "blurry paté" view is only a vague memory of a sound, then something is missing…

If building any sound by additive synthesis is virtually possible, then there must exist in the universe a way to decompose any sound, without anything missing.

option a :

"multi-resolution" in the way that Izotope explain, i’m not sure, but maybe with a 3D matrix, the third dimension representing 12 different fft sizes from 16 to 32768. Then, add or multiply (or something between) the 12 different planes of this third dimension.

option b :

Wavelet transform ? On the wikipedia page that i linked above, they say :
"[about fft:] A narrower window gives good time resolution but poor frequency resolution. (…) This is one of the reasons for the creation of the wavelet transform, which can give good time resolution for high-frequency events, and good frequency resolution for low-frequency events."
Thanks Vanille for the link to [wavelet~]. Well, I don’t understand how to manage this. to make a sonogram… the only thing i was able to make was a pitch-stretch (in attachment). Does anybody have seen a nice sonogram from wavelet transform ?

option c :

Playing with sampling rate ? Well i fell that if you do upsampling, the frequency resolution should go down, and when you downsample, then the frequency resolution goes up …for the preserved low frequencies. (Plus an other idea for high frequencies then, not sure : highpass filter > freqshift~(down) > downsampling => better frequency resolution for high frequencies too ?)

I’m not sure, maybe a mix between option a and option c. Hum, un peu une usine à gaz… (french expression, literally: i bit like a gas factory)
I was hopping that someone already had a nice solution because it is not that i’m lazy but well, so many things work on. I’m not sure to start on this now, plus i’m not so experienced with jitter.

Anyway, cool to read people interested by this topic,


** About additive synthesis for a good imitation of natural sounds : i want to clearly understand exactly where, in the period of the waveform, is the energy of which frequencies, and how all this change in time during the attack and the sustain of the sound : Trying with my example "funny_additive-synth" ( ), i see that the phase information is dramatically important for low frequency instruments (less important on high frequency instruments) Also, I’d like to see, for a soft violin for example, how blurred the harmonics are, and which ones, etc. (additive synthesis from resonators can make some interesting blurred harmonics : mp3 example in: )

P.S : "million pixels" view : it was only a way of speaking, i didn’t mean "one million frequency bins", i think 16384 or 32768 would be enough.

Mar 23 2010 | 2:01 pm

"Of course i don’t agree with this."

Well that’s a nice opinion to hold – but saying it doesn’t make a difference to whether or not you can have both.

There are some ways of getting better resolution (look up LORIS and time-frequency reassignment) – wavelets have their own problems – I’m not an expert on them though, but I know enough to know that they aren’t some kind of magic bullet solution to the tradeoff problem.

"If building any sound by additive synthesis is virtually possible, then there must exist in the universe a way to decompose any sound, without anything missing."

Yes – it’s called an FFT – if you take an FFT and the do the iFFT then you reconstruct the signal exactly (except for calculation error) – the problem’s occur when you try to extrapolate this data further into sinusoidal tracks. THis is the difficult bit – the analysis of the FFT data.

However – sinusoidal tracks aren’t going to be enough to get you realistic sound – you probably need to do some kind of noise synthesis too (a la LORIS or ATS.)

You probably need clever tracking algorithms too like the one usd in miller puckette’s sigmund~ or the stff explored in Jez Well’s PhD – a good peak detection algorithm (possibly with time/frequency correction or reassignment), then some kind of noise analysis/reassignment – a good peak tracker and THEN an additive synth module.

So the long and short of it is this stuff is complicated and more or less at the forefront of what is going on. There isn’t anything I’m aware of that is good to go for MaxMSP and of the kind of quality I would be interested in – I’m sure there are guys out there who have their own stuff, or are using LORIS in max or whatever, but that is a big coding project to take on in a lower level language like C. A year ago I started developing some tools that were intended to eventually allow me to do some really good additive synthesis in MaxMSP, but it got too complicated and I’ve put the project on hold.

I wasn’t even at the stage of writing the peak detection algorithm (although I’ve written simple ones before), or the peak tracker, or look at noise analysis – I was building a framework to allow me to do this kind of processing.

Anyway, it’s not really clear how you’re going to use the analysis data you get. I’d say you probably don’t want a visual anyway, but rather you should use a numeric readout. You might want to check out sigmund~ for a start (it’s got some neat tricks for sinusoid detection in it) and start googling the other stuff.

AS for ideas about changing sampling rates I’m just doing this quickly in my head but I think in theory the resolution problem doesn’t change – remember that you are representing twice the frequency range:

For 4096 sample FFT @ 48kHz

Bin width = 24000 / 2048 = 11.71… Hz
Window Length = 4096 / 48000 = 8.53….ms

For 8192 sample FFT @ 96kHz

Bin Width = 48000 / 4096 = 11.71… Hz
Window Length = 8192 / 96000 = 8.53….ms

You’ll see from the above that the two situations are equivalent, just in the second you have twice the representable frequency range…..


Mar 23 2010 | 5:39 pm

Pointing again to another crazy FFT thing mentioned by AlexHarker: try a FFT size very very small. Not even 256, maybe just 64. Do you think you will get with signal analyzed with FFT, then re-synthesized with IFFT (like go in/out of a pfft~)? Very poor quality? No! It will be almost perfectly the same. That’s the never ending power of Fourier. But, as AlexHarker says, when you want to play with the internal data before re-synthesis, you’ll have problems. Also, although the totality of the information is there, even with a very small window size, it doesn’t mean that you can OBSERVE both the time & frequency at high resolution.

A way to formulate the uncertainty principle is to say that the more we locate a signal in the time domain, the less we can locate it in the frequency domain, and vice versa.

Since the Uncertainty Principle is so recognized by scientists, if you manage to prove that wrong, you might be eligible for the Nobel prize, no kidding.

Mar 23 2010 | 9:30 pm

>> "the more we locate a signal in the time domain,
>> the less we can locate it in the frequency domain, and vice versa."

Ok, i can agree with this sentence, but at a level far beyond the simple FFT Transform. Of course that in a 5-samples sound, it will be hard to find lot’s of frequencies. There is a point where this is right but you cannot use that fact to object that i’ll just get your blurry paté for dinner.

I think you’re king of contradict yourself in these two following sentences you wrote : As the point of FFT is to "play with the internal data before re-synthesis", and as you say we "have problems" doing so, then where is the "never ending power of Fourier" ?

>> try a FFT size very very small. (…) It will be almost perfectly the same.

I noticed this too, so what ? It’s not because "the totality of the information is there" that the data -specially phases data- means something for humans, and that we have to grovel to FFT as the ultimate sonogram possible.
By the way, i don’t get why my "play-fft-size-4096.mp3" above from your fft patch "3-record-play-speed-control", sounds so different than when i make a simple fftin>cartopol>poltocar>fftout, while i played it at normal speed ? do you know ?

From the AlexHarker remarks :

>> sinusoidal tracks aren’t going to be enough to get you realistic sound –
>> you probably need to do some kind of noise synthesis too

True we would need an infinite number of sinusoids to make a real nice noise : In fact the better way to do additive synthesis for noisy instruments that have blurred harmonics, like flute or violin, is by doing the opposite: Starting from a white noise, then filtering it with resonators~ (like i did in )
…Then i’ve got this intuition :
To built the ultimate sonogram where we’ll be able to FULLY CLEARLY distinguish noises from pitched content, i guess we would virtually need to mix the data from an infinite number of FFT each one using a different FFT size, not only powers of 2. An equivalent, and maybe more efficient, way of going would be to resample the sound at differents speeds using groove~, and then blend(or multiply?) all the "jitter-FFTs" from each sound.

>> …sampling rates I’m just doing this quickly in my head but
>> I think in theory the resolution problem doesn’t change

You’re right! I was wrong for my "option c" above. the sampling rate doesn’t change much thing.

Thanks for pointing on sigmund~! It is a pretty nice piece of object. I’m not sure how i could use it to draw a sonogram, but i’ll need it for an other application! About LORIS, looks interesting but i’m not into C++ and so i’m not able to try it.

>> wavelets have their own problems

you’re probably right here so if my options b and c are gone, i should try "option a". (and using not only powers of 2, as my intuition tells me)

Jean-Francois, if i reach to manage these damned jitter objects, i don’t give long life to your damned uncertainty principle…


Mar 24 2010 | 10:39 am

"As the point of FFT is to "play with the internal data before re-synthesis""

Not necessarily. The FFT can be used for that, but it’s purposes are far more general – if you start googling you’ll see that many engineers use FFTs as analysis tools in fields that are not even anything to do with sound.

"To built the ultimate sonogram where we’ll be able to FULLY CLEARLY distinguish noises from pitched content, i guess we would virtually need to mix the data from an infinite number of FFT each one using a different FFT size, not only powers of 2"

Not really – there are other ways to tell noise from deterministic content (either by phase calculation or lobe width, or by medin filtering etc.)- The problem is nothing to do with the FFT size – the problem is that in a noisy signal peaks appear in single FFT frames that often look almost identical to sinusoidal components, but aren’t.

Another problem that no-one has mentioned yet is that a single sine wave will excite all the bins in the FFT to some extent (assuming it is not EXACTLY on a bin frequency) – windowing improves the situation to some extent, by suppressing sidelobes, but it widens the main lobe (which will look like blurring in a sonogram). One of the good things about sigmund~ is that it takes account of this and attempts to correct for it, which leads to more accurate frequency and amplitude values.

"Thanks for pointing on sigmund~ I’m not sure how i could use it to draw a sonogram"

Well I’m not sure you actually want a sonogram, which will almost certainly be blurry to some extent – if you want to know what the sinusoidal components are doing you should be plotting points, or lines rather than spectral data directly (which seems to be too blurry for your tastes). You could build something like this with sigmund~ (in track mode) and jitter. Alternatively you could download gabor and FTM and check out the drawing of spectral data they do, which is a bit like this. It sounds like you wan to plot sinusoidal peaks, not FFT data (like a sonogram) – which will give you precise points, but will ignore any noise components.

Multi-resolution sonograms will only give you a different tradeoff between frequency and time resolution in different frequency ranges (remember the FFT is linear so we generally have not enough frequency resolution at the lower end, and far too much at the top, so a better choice is bigger FFTs for low frequencies (still with poor time resolution), and smaller ones for high frequencies (where we don’t need the same linear resolution so better time resolution is preferable).


Mar 24 2010 | 1:02 pm

Smart people: Can I ask a really really stupid question that falls under the category of "someone must have thought of this?"

filter everything below (Nyquist/2) (take the bottom half of the spectrum)
Ring modulate at (Nyquist/2) (flip over the spectrum)
Take an FFT
Flip the top half of the FFT to the bass

Instant awesome resolution in bass?

Mar 24 2010 | 2:51 pm

Nice idea AudioMatt, but the resolution of the FFT is the same across the frequency range in the linear domain. The point is that we perceive frequency in a logarithmic way, so the same resolution down low seems like less resolution…

What you are suggesting wouldn’t change the resolution at all so we’d get the same results, just in different bins….


Mar 24 2010 | 3:04 pm

oooohhhhh. yeeaah…
*hangs head in shame*


Mar 24 2010 | 11:12 pm

AudioMatt, of course this doesn’t work, but i think you were right pointing on ring modulation! :

In fact, i found Izotope RX is far better than RavenPro. In Izotope RX, ok there is this Multi-resolution option in the "spectrogram setting" but it’s not the most important, there is some other great stuff like : Time Overlap and Frequency Overlap.

In the image below, from Izotope RX, i compared the same sonogram from "Talk.aiff" using "overlap" :

While time overlap is made by moving a bit the sound in time in front of the FTT window, I think the Frequency Overlap in Izotope RX must be made moving a bit the frequencies in front of the FTT bins… it think it must use some kind of little ring modulation (like freqshift~ does, i think) just moving the sound frequencies some few hertz before doing the FFTs. Then, by blending all the FFTs, this accurate the frequencies of the harmonics. (A bit like i imagined using different FFTs sizes.) I’m not yet satisfied but that’s the beginning of something.

>> "Well I’m not sure you actually want a sonogram"

Yes, it IS what I want. I want the sinusoid AND the noise content. I’d like to see with my eyes EVERYTHING that my brain can hear with my ears. I don’t feel this is utopian.
if a 256 fft window have good time resolution, and a 8192 fft window have good frequency resolution, i don’t see why you guys are not agree that i could have both by intelligently blend them, playing with contrast.


wow.. i’m wondering again about wavelet seeing this :

This guy is showing more interesting images made with wavelets than what i had seen before in and in the link that Vanille pointed.

Arg, the software is for window, I’m gonna borrow the pc from my girlfriend and have look at it.

Any soft like that for mac ?

Any good example patch using the [wavelet~] object from cnmat, somewhere ?


  1. overlap-in-Izotope-RX.png


Mar 25 2010 | 11:25 am

A nice tutorial on wavelets:

Even this expert can’t escape the uncertainty, and he explains it well.
And don’t get fooled by pictures, they sometimes show precision which does not exist and doesn’t mean anything either. Our ears are as well limited by these rules, and we can fake sound pretty easy.
If you do the correct assumptions, you can get pretty amazing results. Mp3 does work pretty well, though a lot of information of the original signals is just dropped. If you drop the irrelevant, you won’t recognize it…

Your picture of resynthesis for example assumes tonal content as the most important. The old dream of additive synthesis…
I got good results by separating tonal from noise components, and only processing the (simplified) tonal part. It was necessary to ignore the noise part and simply mix it in again after processing. But as the noise part would carry most of the perceivable time structure, the results had been promising. Though the tonal aspects had been blurred by the processing, the noise part would still carry the time structure…


Mar 26 2010 | 3:09 am

That’s a very nice tutorial on wavelets, Stefan! Thanks for the link — I hadn’t seen that before.

Mar 29 2010 | 9:54 pm

There’s something more than time and frequency overlap :

Time-frequency "reassignment" that AlexHarker pointed :

"Compared with the classic spectrogram (aka ‘waterfall’) display, reassigned spectrograms can offer better resolution in the time- as well as in the frequency domain. (…) by comparing the phase between two neighbouring frequency bins (within the same STFT) it is possible to relocate the energy from that cell along the time(!) axis. By comparing the phase in a frequency bin (between two neighbouring STFTs), it is possible to relocate the energy from that cell along the frequency(!) axis."

"the method of reassignment sharpens blurry time-frequency data by relocating the data according to local estimates of instantaneous frequency and group delay."

By checking the "Enable reassignment" box in Izotope RX while using time&frequency overlaps, you can get fine pitch tracking like in the first image below from a singing female voice ("shafqat.aif" cnmat audio example), far better than standard FFT without overlap (2nd image).

The wavelet window software didn’t really convinced me about wavelets finally, plus it is damned slow. I find FFT with reassignment and overlaps more precise than wavelets.

(By the way, RavenPro also have time and frequency overlaps option, but it is lost behind hundreds of option, i just found it in "configure spectrogram".)

[attachment=128487,297] [attachment=128487,298]

  1. StandardFFT-without-overlaps.jpg


Mar 29 2010 | 10:20 pm

>> "don't get fooled by pictures, they sometimes show precision which does not exist and doesn't mean anything either.

…I laughed when i saw the following image, wondering about this advice from Stefan. Hmm… looks like some ufo entered my signal :
This is a very deep zoom with *128 overlap & reassignment in the spectrogram of a simple [saw~], i'm not kidding!


  1. tiny-ufos-in-saw-signal.jpg


Mar 29 2010 | 10:22 pm

Here's a wavelet image of the same soundfile, shafqat.aif.

Wavelet analysis is pretty good, but indeed slow. Comparing the images, the FFT with reassignment and overlaps appears to offer better precision, espec. in the high frequencies.

I'm curious which FFT software allows for reassignment. Izotope RX, any others?


  1. shafqat_wavelet.jpg


Mar 29 2010 | 10:40 pm

is there any way to get a higher resolution image of that saw~ analysis? i think itd be a pretty ace background :D:D

the frequencies coming arcing off the main bulk of energy like a magnetic field are interesting… anyone able to explain this?

Mar 30 2010 | 12:04 am

But seriously, i’m sure this reassignment method is only the 3/5 in the way to the best sonogram that can be done.

These artifacts produced by the reassignment method could be almost cleared by multiplying some different FFT-variously-sized reassigned sonograms… Because from one FFT size to another, these artifacts are moving… but not the pitched content… (also, this should show a better distinction between pitched and noise content, you see?)

I’m lost in front of the math under this reassignment method. Plus it’s kind of slower to compute…

Any interested C-external developers, to make an efficient [jit.reassignedFFT] object from LORIS C libraries ?

I’m also wondering if any voronoi jitter effect could approach this in some way, i asked this on the jitter forum :

This reassignment method, associated with freq&time overlaps, and associated with the idea of blending different FFT-sizes, could get really cruel with this "Heisenberg’s uncertainty principle", and open useful new possibilities, like :

– Very fine polyphonic pitch tracking… There is already the [transcribe~] external, a pioneer, but works really bad. But if you have fine harmonic pitch tracking (see another pitch tracking example in the image/mp3 below), then, even in a polyphonic messy sonogram, one could do greats things : Imagine that you divide vertically the size of the 16384* lines jitter sonogram by 2, by 3, by 4, by 5, etc… harmonics. (considering that most musicals sounds have true harmonics with negligible inharmonicity), intelligently add*multiply all these (antialiased) jitter matrixes ( (H1+H2+H3+H4…) * ((H1+f)*(H2+f)*(H3+f)*(H4+f)*…) where f is a kind of "noise factor" to adjust ) …and get damned cool, understandable, polyphonic pitch tracking sonogram !

– And maybe – only when we’ll have 20 Ghz laptops… – start to dream about the mythical "demixer"…

* (let’s say 1024 * 32 frequency overlap = 16384)


Mar 30 2010 | 3:48 pm

> "By comparing the phase in a frequency bin (between two neighbouring STFTs), it is possible to relocate the energy from that cell along the frequency(!) axis."

you might get some ideas of how to approach this, from here:

near the end of the thread there is an example on how to calculate the "true frequency".
for the sake of simplicity in this example i’ve only used a single overlap. better results can be achieved with an overlap factor of 4 or higher.

Apr 03 2010 | 3:33 pm

Thanks volker!
i didn’t find the time to go further in this yet, but later i will. thanks for your algorithm.

Apr 03 2010 | 6:38 pm

By the way, when you do this, you assume that the energy in a frequency bin is due to only one sinusoidal component.

Apr 03 2010 | 8:07 pm

>> By the way, when you do this, you assume that the energy in a frequency bin is due to only one sinusoidal component.

Hi Jean-Francois,
sorry, i’m not sure to understand, could you explain more what you mean ?

Apr 04 2010 | 9:22 pm

Well, the FFT (or STFT) gives you, in each analysis window, for each frequency bin, first how much energy there is, and second a phase difference. But what it does not give you is the piece of information "how many partials are actually in the original signal in this frequency bin".
For instance, with a FFT size of 512 and a sampling rate of 44100Hz, a frequency bin is about 86Hz wide. You can know that in the frequency bin from 43 to 129 Hz, there is a certain amount of energy. With the formula mentioned, using the phase difference, you can relocate the energy of that cell in the frequency space, and you will maybe find that "the value" is 93 Hz. But what you don’t know is IF there is one and only one sinusoidal component. Meaning, that maybe in your original signal, the energy in this frequency bin is made from a component at 67 Hz, and another one at 104 Hz. Or maybe comes from 12 different components. That would be pretty different. When you use the phase difference to calculate a unique frequency, you first assume this frequency is unique.
Hope that makes sense.

Apr 04 2010 | 10:56 pm

Thanks for your explanation.
I understand what you mean but as i tried to explain before about "artifact" – well i’m not sure i was clear – if you look, in Izotope RX with reassignment enabled, at a sound with some noise areas and some pitched areas, you will see, when changing the FFT size manually (512, 1024, 2048..), that the pitched areas will almost not move, while the "relocated energy" in the noise areas is moving a lot from an FFT size to another !

Then – i hope i will find the time to manage jitter objects to show you an experience of this, right now i don’t – if you BLEND all the results of these FFT sizes, you will, i think, go over these disadvantage of FFT that you are talking about, and see noise as -more or less – filled areas, and see clear "lines" that really are pitched content.

Demo version of Izotope RX is here : (all sonogram options are in "spectrogram setting")

Aug 07 2010 | 6:59 pm

I just began to read this very interesting topic!

0/ Is there some modified version of FFT, in that instead of dividing the frequency axis in a "linear" way into "bins", the frequency axis is divided in "bins" with a log scale ??

1/ If I understand well, Multi-resolution-Processing is "only"
adding a third "dimension", ie doing the analysis with FFTSIZE=512, 1024, 2048, .., 16384 and taking the best of each one ?

2/ Are you really sure that if we upsample, we cannot get more precise resolution ??? I’m really not sure !!! Someone gave this example:
>For 4096 sample FFT @ 48kHz
>Bin width = 24000 / 2048 = 11.71… Hz
>Window Length = 4096 / 48000 = 8.53….ms
>For 8192 sample FFT @ 96kHz
>Bin Width = 48000 / 4096 = 11.71… Hz
>Window Length = 8192 / 96000 = 8.53….ms

I’m okay, but what about this :
*For 8192 sample FFT @ 96kHz
*Window Length = 8192 / 96000 = 8.53….ms
*Bin Width = 24000 / 4096 = *****5.86… Hz***** (why would we need to cover the frequencies with bins, from 0 to 48000 ? we only need to cover from 0 to 24000 because above 24000 we don’t hear!!)

So by upsampling by 16, we can have a 16 times better frequency resolution with the same window length !!! (But in this trick, we do the assumption that we FORGET what’s in the signal above 24khz) :

*For 65536 sample FFT @ 768kHz
*Window Length = 65536 / 768000 = 8.53….ms
*Bin Width = 24000 / 32768 = 0.73… Hz
Here again we do the assumption that we forget above 24khz (we don’t hear anyway…)

3/ Alexandre, did you find some solutions for your problem ? Is it possible to find such a High Resolution sonogram ??

Thanks, Jebb

Aug 07 2010 | 7:27 pm

Hi Jebb,

That was me, and the maths is correct.

Unfortunately the FFT does not care what you can and can’t hear. What is important is the nyquist frequency – this has nothing to do with hearing – the FFT has applications far outside of audio – it is in itself a mathematical tool for doing certain things.

The wavelet transform is more or less a specialised constant q filterbank – it works somewhat like the log scale FFT you propose above. It’s already been mentioned here and it has issues of its own.

There are many papers available on the web about up-to-date spectral analysis and processing techniques. They are written by engineers with a very good grasp of the maths and theory behind these things and some of them get very complicated. My take on this thread now (as when it started) is that it seems unlikely that posters here who do not have a firm understanding of the basics of spectral analysis and techniques will come up with tools or techniques that are better than those created by experts in the field.

If you guys really want to get into this in detail then you are probably going to need to read a *LOT*, get very good at maths and roll your own externals in C or java. The time input to get serious with this is going to be very large indeed.

Good luck,


Aug 07 2010 | 8:15 pm

Hi AlexHarker,

Of course your math is correct! I did not say the contrary at at all ;)
(and by the way, I know about math, I know about FFT, so no problem…)

I just asked : what happens if we make a further hypothesis, ie "forgetting" about the freq above 24khz?

It is always possible to do a FFT with :
FFT samples = 65536, Sampling rate= 768kHz
*Window Length = 65536 / 768000 = 8.53….ms
*Bin Width = 24000 / 32768 = 0.73… Hz

This is possible, we only store in the matrix what interests us, and I make the *assumption*, that we don’t want to store anything above 24khz.
So we can have bins of 0.73hz.

The *right* question is rather : is it clever or not ? ie : what will happen when we do the ****inverse FFT**** ?? Will the *data lost* (we have made the assumption to forget about them) above 24khz (with a sampling rate of 768khz) make enormous distorsion ?

How will it sound ? Far from the original signal or not ?

This is the question ! Do you have an idea ?


Aug 07 2010 | 9:32 pm

I am sorry – I do not believe you are correct.

In order to carry out your method I have to "only store in the matrix what interests us".

Please explain to me how you do this?

At best you have proposed some kind of alternative way of thinking about zero-padding – however, this leads to spectral interpolation, which is not at all the same as true resolution…


Aug 07 2010 | 11:01 pm

It depends what you want to do. About this topic : you only want a very precise sonogram.

Let x be the signal, w the window function.
S(m,omega):= sum_n (x(n) w(n-m) exp(-I omega n) )

It is possible to calculate values |S(m,omega)|^2 for all values of omega, even if there are very close to each other ?

By increasing the sample rate (the number of n’s increases, and the windows w(n) changes according to the sample rate, in order to keep a constant window length in milliseconds), we can compute much more values
of |S(m,omega)|^2. This allows to have more points on the frequency axis, not only one per SFFT bin?

(I’m speaking just about plotting a sonogram.)

Aug 07 2010 | 11:39 pm

You describe two scenarios there (from a quick reading).

1 – calculating values for frequencies that do not fit into the sample length an integer number of times.

This equates to zero-padding (at least if we start by doing half way between the bins, then halfway again and so on) . If I zero pad the data to twice the size I will be calculating the half way points between bins in the previous size for instance.

This gives ideal interpolation, which may result in a clearer/nicer sonogram (yes in this way you can keep reducing the bin width), *but* it is not the same as true resolution, because you cannot use this method to resolve closely spaced sinusoids:

So – you have interpolated your data nicely, but you have not gained additional information about the signal -in certain ways this will be "more precise" at least to the eye, but the critical information is already encoded in the FFT data, and for many applications (such as more accurate location of spectral peaks) there are well known techniques of deriving the information (such as parabolic interpolation).

2 – When you increase the sample rate you also increase the representable range (the nyquist frequency increases), so the bin width for the same size fft decreases.

Double the sampling rate – nyquist doubles
For the same fft size – each bin is now twice the width because of the increase in the nyquist frequency
We can then double the FFT size (same amount of input data as before) and we return to the same bin width as we started with, so we don’t gain anything there.

Zero-padding doesn’t seem to have come up in this thread so that’s a useful addition, although it’s a very well known technique – implementing it in msp is pretty straightforward. In regards to oversampling I repeat what I have said earlier – you cannot gain frequency resolution by oversampling.


Aug 08 2010 | 2:04 am

1/ Thanks Alex, I begin to see what you mean now… :)
But this oversampling technique may increase the quality of the sonogram, anyway, don’t you think so ? That was Alexandre’s goal I think.

2/ Is there a MAX/MSP patch that plots *beautiful* color sonograms ?
(Like IzotopeRX ones) Or a MatLab code ? If so, I really would like to test oversampling, in order to see what it will do !

3/ A general question about FFT. Let’s say I have a 1second long mono 96khz wav file containing a pure 440hz sine.
I have the feeling now, that NO "FFT-based" sonogram will be able to show just a 1-second-long line of 1 pixel of width at 440hz, but instead, there will always be artefacts.
Am I wrong or not ? If so, this is quite strange : with a so simple example, we already get some artefacts ??? Waw….

Aug 08 2010 | 7:39 am

1 – Unfortunately I don’t think you’ll see any improvement at all

3 – The FFT data will be spread across more than one bin, except in the situation in which the sine wave is tuned *exactly* to the centre of a bin, and no windowing is applied (a rectangular window is used). This situation results in only one FFT bin being excited, but is a fairly useless one, because when the sine wave is not tuned to the centre of a bin, many other issues occur and the leakage is bad – windowing is a way of compromising size of the central lobe with the amplitude of sidelobes, resulting in a slightly enlarged central lobe, but suppressed sidelobes.

Whether this is visible as more than a single pixel depends on the exact method of drawing the data, but that is indeed why you see a hazy cloud around the partials in the plots above.

If the display is not a sonogram, but rather the plot of a partial tracking algorithm (such as in SPEAR) then it is possible for the sine wave to appear as a single line – however, this type of display will not deal with noise content well.

Aug 08 2010 | 10:13 am

Thanks AlexHarker for your answer.

1- and 2- Yes I begin to understand now… But in order to convince me / do some tests / learn more about that, I’d like to do tests myself. Would someone have some code that draws nice sonogram (in Matlab or MAX or something like that) ? ( Of course, a code that does not just call "Sonogram" or fft routine) …

3- So if I understand well, "in the real life" (we can forget about the special case of the sine freq which is is in the centre of a bin), the STFT will quite NEVER give a nice ‘line’, even if the signal is a constant sine ! There always will be some pâté ! This is very interesting to notice that, once for all! Thanks a lot !

4- So the good solution would be :
* partial tracking for the sinusoidal parts
* standard "sonogram" for the rest of the signal (= orignal signal minus the harmonic part)
How to do this with Max/MSP ? I haven t found sygmund~ or fiddle~ working on windows… Any hint ?

Aug 09 2010 | 1:16 am

Great thread, lots to think about. Love the images too, especially the saw~ one. Would love a high-res version of that too, or a collection of similar ones (hint, hint…)

Was wondering if there’s any way to put poly~ to work on this, maybe some way to divide up the calculations across multiple poly~ patches? Could each one do (say) a fourth of the bins, or do some other trick to speed up the processing? If you had a quad-core processor and could split up the work, that would really move things along.

Though maybe it’s not feasible, and you need to do everything in one process. Just a thought, am curious if I’m on the right track or way off…

Aug 09 2010 | 7:23 pm

>> jebb said:
>> Alexandre, did you find some solutions for your problem ?

Sorry i didn’t find the time to go over all this, plus AlexHarker is right that it should better involve some C or Java (at least for the REASSIGNMENT algorithm.)

>>But this oversampling technique … That was Alexandre’s goal I think.

No! it was just a useless idea i thought about at the beginning of this threat. As AlexHarker pointed many times in the threat – Thanks for your patience, Alex :-) – and explained to us, beginners in fft, OVERSAMPLING in itself will NOT increase resolution in the FFT.

>> Is it possible to find such a High Resolution sonogram ??

I’m sure it is. When i said above: "this reassignment method is only the 3/5 in the way to the best sonogram that can be done." i should have said the 1/5…

We should think about pixels in sonograms as "probabilities" for a frequency to be there. As many guys pointed above, at a special instant of sound, at a special sample, there is no frequencies at all, we can only guess a probability for that frequency to be around. A sonogram is nothing real or physical. It is just an imagination of a sound. Exactly like our brain treatment, when we listen to music and sounds, is. So, looking for "the deep truth of the signal" is not the good manner to think about sonograms. The only physical truth is the signal itself. So what i dream as an "extremely precise sonogram" is just something that approch the "special treatment of probabilities" that my brain is doing when i listen to sounds and music. The "REASSIGNMENT method" of Izotope RX, used with *32 X and Y overlaps, is far better than standard FFT, but it’s still far from ideal because of the spiderweb-like artifacts produced everywhere…
…How do we know that they are artifacts, and not real pitched sounds ? This is because they change their positions completely everytime that you change the FFT window size! they are MOVING! And there are some areas in the sonogram full of these spiderweb artifacts: What does this means ? : A potentially infinite number of frequencies in this area, or, wrote differently, an equivalent probability for all the frequencies in the area: This is what is called NOISE. Now take 10, 20 or even 60 differents sonograms like that, each one made using a different fourier window size, then mix them all, and you’ll approach – i think – the holy grail : The spiderwebs should then disappear to show some rather soft "fields of noise", while "very-high-probability-paths-of-pitched-content" will still look like really pitched content. You will SEE WHAT YOU HEAR. Not blurry patés nor spiderweb.

>>…be able to show just a 1-second-long line of 1 pixel of width at 440hz, but instead, there will always be artefacts.

Using a bunch of mixed sonograms using the "reassignment method", It will be possible, because, again, the artefacts are MOVING when you change the fft size :
Take 5 Izotope RX sonograms (with reassignment option checked and *32 x and y overlaps) from 5 differents fft sizes. Now do it again after resampling your sound at 8 differents frequencies between 32Hz and 48Hz.
(Because FFT works only with powers of 2 – from what i understood, for efficiency reasons – resampling the signal can be a workaround: i will be equivalent to non-power-of-2 fourier windows sizes.)
So now you get 5*8= 40 screen-shots… then mix them all together using jitter or even photoshop: what should happen is that the artifacts produced by the "reassignment method" would just disappear, because they are different for each screen-shot, but not the 440Hz tone.
Do you guys start to see what i mean ?

> seejayjames said:
> ..especially the saw~ one. Would love a high-res version of that too

hehe, you americans like ufos! Here below is a not-so-much-more high-res of the ufos. You can make your own using the free trial of RX: This is not a real "saw" but a kind of dephased saw made from cosines instead of sines* (made using that : ) (*the first second of the sound attached below)

> Was wondering if there’s any way to put poly~ to work on this

Go ahead!
I see the steps like this:

1- Time-overlap is already an option in the fft objects in max, but not Frequency-overlap : Creating it by shifting a bit the sound, 16 or 32 times (using [freqshift~], maybe) should work, i think. (shifting amount = Bin width divided 16 or 32) Then interleave all the 16 or 32 FTTs in one jitter matrix.

2- Find the way to apply the "Reassignment method" on this, either writing a C external object using LORIS: (maybe even swap the first step and process all the FFT inside the external), or using the example from volker: (patch near the end of the thread)

3- At this point, i think each poly~ should be used to compute a different sonogram using a different FFT window size. And some of the poly~ should also use a resampled sound, like i explained above: it will be equivalent to non-power-of-2 fourier windows sizes: Examining reassigned sonograms in RX, my feeling is that powers of 2 for window sizes are not enough to clear completely the artifacts.

4- Simply mix all the sonograms together. The more sonograms you compute from different FFT sizes, the more clean the result, i think.

Note: Even you have a 24-core cpu, i think you’ll stay far from a real time sonogram from adc~…
(how many TeraHertz are our brains analyzing music ?)


Aug 09 2010 | 11:36 pm

1 – The so called frequency overlapping in the izotope RX package seems to me a misnomer, as it is not as far as I can see really overlapping anything – it is clearly explained as zero padding by the hint, andas such results in spectral interpolation as outlined above. If you want to emulate the effect then do zero padding – don’t shift the sound around – that’s just reducing accuracy, and if you want to do averaging to improve the look of things, do it directly on the power spectrum by convolving with a small kernel.

2 – Multi-resolution is generally used to use more appropriate fft sizes for different frequency ranges, not to average spectra – averaging spectra from different FFT sizes will in my view *reduce* accuracy, because the frequency resolution of the lower fft sizes will be poorer and hence spectral peaks will be poorly located.

3 – The reassignment thing definitely looks dramatic. Personally, however, I think if you really want sine waves as nice lines track the peaks using a bit of zero-padding and parabolic peak interpolation (easier to do than the reassignment method) and plot them as lines – subtract from the fft and plot the noise residual as a standard sonogram if you want that too. If you want to be really clever then you can increase the accuracy of the peak finding by subtracting peaks from the entire spectrum as you go in order of highest magnitude first so that the effect of nearby peaks on one another is reduced (ie – locate largest peak – remove – then repeat process). That’s what happens inside sigmund~ – it’s very neat, very clever and a little bit complicated, but the source is available – in order for it to work you have to subtract the correct shape, which miller has calculated according to the way he does his fft / windowing. He also does some other neat stuff, like windowing in the frequency domain using convolution so that he can examine the raw fft alongside the windowed one, without needing to do 2 separate ffts….

4 – Part of the problem with moving noise with different settings is probably to do with the high variance of the power spectrum estimate from a direct or singly windowed fft. You could try to improve this using this technique:

I wouldn’t expect this to tighten up your spectral peaks at all, but it might allow tighter timing resolution, or reduce noisy atifacts to some extent – although in terms of a sonogram, the visibility of these is highly dependent on the scaling of the values (I can get a lot of the background noise to disappear in the above organ example, simply by adjusting sonogram scaling settings in izotope RX).


Aug 10 2010 | 3:41 pm

Well, thanks again for your comments and tips.
From your points :

> 1 –
you're right Izotope RX is in fact using this "zero padding" method as written in the hint as i just read it. Attached below is another example of the power of this misnamed "frequency overlap".

> 2 –
you're right that averaging spectra from VERY different FFT sizes will *reduce* accuracy. It is clear that extreme window sizes (like 256 or 8192) will dramatically reduce accuracy (in y for 256 or in x for 8192). In fact i was looking more at "shafqat.aif" in RX and, through it may depend on the sound analyzed, the more i look at it, the more i feel that a mixed sonogram from different fourier window sizes should stay around 1024. Then perhaps DFT (Discrete Fourier transform) instead of FFT, should be used to mix the results of none-power-of-2-windows-sizes between 800 and 1600 samples.
(But, wow, in wikipedia, they say that DFT is 100 slower than FFT…)

> 3 –
Through i realize, while googling "parabolic peak interpolation", that sometimes the math start to go over my head in this discussion, i want to notice that this idea, about using one method for pitched content, and anther one for noise content, does not convince me at all. My goal is nothing about aesthetic sonogram images. (except the ufo joke above) If you use different methods for pitched and noise content, you assume that pitched and noise content are like black and white in your sound, without any greyscale. But it's never like this. (or only for ugly electronic sounds) At the very beginning of a soft violin note, you have only noise, then the pitches from the harmonics starts to appears gradually from the noise, until they get really clear when the bow is pressed harder on the string.
You will never be able to say "this is pitched content, and that is noise content". Again, all that you can have are probabilities. Ok, somewhere, a fine line of 97% probabilities, surrounded by an area of 1% of probability, can be called "pitched content", and a clean big area of 10% of probabilities can be called "noise", but between these 2 extremities, listening to the sample of violin synthesis* attached below, it is clear that it's not always black or white.

> 4 –
Reading this, i also feel a bit lost with the math…

A bit out of the subject, i wanted to say again that sigmund~ is a really nice object. Perhaps, the polyphonic pitch tracking idea that i tried to describe above ("harmonically multiply" everything together) could be better achieved using sigmund~.

* theses "blurred" harmonics are made using that:


  1. left-16xpadding-right-nopadding.png


Aug 21 2010 | 4:40 pm

What about this method for obtaining an extremely precise sonogram of a constant pitch note :

1/ We want to determine an extremely precise sonograme at time t0 ? (I agree this makes no sense in general : as someone told, at a given "frozen" time, there is no sense it trying to know which frenquencies there are)
No problem !

2/ An algorithm can find a "full period" around time t0. Juste one period. Then we copy this period in order to do a full periodic function (infinte in time).

3/ Then we can do a FFT with infinitely precision, because we can take a window as large as we want (as the signal is now replicated infinitely in time).

This could be useful to study sounds where we easily see they are rather periodic…

What do you think ?


Aug 24 2010 | 11:06 pm

I’ve often thought about things you might be able to do with the new autotune patch. If you retune the signal to the FFT period, then retranspose the display back up, I wonder if it looks any better

Viewing 42 posts - 1 through 42 (of 42 total)

Forums > MaxMSP