Rich onset detection (audio descriptors/features from percussion)
So I've been tweaking and improving the onset detection algorithm I use in my patches (code below) over the last few years, and presently I have something that I think works quite well, and optionally gives velocity (at the cost of a small amount of latency).
But more recently as I'm delving into modular stuff, as well as more complex sample-based resynthesis, I want to explore richer information coming from percussion-based music.
Obviously the big tradeoff is latency, as certain audio descriptors require time to pass in order to be meaningful, but beyond that, I'm not sure which audio descriptors would be most useful under short time frames and short/attack-based sound sources.
What I'm leaning towards is loudness, spectral centroid, and spectral flatness. But I'm also thinking of having some kind of 'sustain' flag, where if a sound is deemed to be periodic and lasts longer than the given analysis window, that it is classed as a "long" sound, and perhaps some gestural (time-based) analysis is applied.
Has anyone worked with this kind of short time-framed audio analysis coupled with onset detection? What/how/etc...?
Here's my current onset detection (the core germ comes from a Peter McCullough patch, with lots of help and measurement/tweaking from PA Tremblay too):
(here's a messy patch with some of my tests/experiments (uses Alex Harker's descriptors~ external)):
Thanks for sharing Rodrigo,
Found this page which is clarifying some of the complexity involved in impulse detection:
http://www.katjaas.nl/beatdetection/beatdetection.html
I'm quite surprised how hard it is to do proper detection of impulses , since it is quite easy for the ear to differentiate between percussive sounds...
Would love to see a big thread on this!
I've just started investigating this too. Have you had any further developments?
This might be some useful research that discusses onset detection and timbre classification.
https://qmro.qmul.ac.uk/xmlui/bitstream/handle/123456789/54064/Barthet%20Real-Time%20Hit%202018%20Published.pdf?sequence=2
http://www.nime.org/proceedings/2016/nime2016_paper0044.pdf
I'm writing this now before I've looked at your patches. :)
@Daanbr, the ear has the luxury of introducing latency.
Ha! True.
Indeed one major hurdle is the realtime aspect, I'm guessing one could only approximate it, but never achieve perfect detection.
Regarding impulse detection, it would perhaps be interesting to look for hidden gems in the field of psychoacoustics.. Perhaps some principles can be reverse-engineered and implemented.
I know Erkki Kurenniemi was approaching stuff that way:
"Simple circuits of neurons can act as frequency dividers or phase locked loop (PLL) frequency multipliers. "
https://books.google.nl/books?id=fsqNCgAAQBAJ page 231
@Jay Thomas, thanks for sharing those papers.
Just getting my thoughts out on those two papers before I forget .
Latency - it seems around 20 millseconds before latency becomes “noticeable“ or interferes when playing percussion.
gesture recognition - what’s interesting in the Handsolo paper was, the better the percussionist, the more accurately the machine learning algorithm performed. I’m some cases 100%.
timbre recognition - too many questions in this one but the most important for me is, what timbre descriptor to use, considering you only have 20ms to do analysis? How accurate is it going to be?
Sensors - which sensor to use for input? Piezo, condenser mic or some combination? I imagine it depends on what timbre descriptor you want to use.
Daanbr, what's your application? In all honesty, the best thing you can do for this is reduce the size of the problem because getting near-realtime impulse detection on all signals is actually sort of tough. Even commercial algorithms like ableton and avid fuck it up quite a bit.
Awesome, some great links in there! I'll have a look through them in detail in a few days.
For now, yes, I did get somewhere further with this. I have something that works really well, except some minor inconsistencies in the loudness descriptor.
A cleaned up version of the patch is pasted below.
Where the previous version was going wrong was that I was detecting an onset very quickly, then waiting 15ms (in Max land), and then analyzing the last 20ms of audio using descriptors~
, with lots of Max slop in the mix. So not only did I have to wait a fair amount of time, but I wasn’t sure that the section of audio I was analyzing was what correctly corresponded to when the onset was detected.
So at the moment the onset detection outputs a sample accurate moment in time which will then be analyzed by descriptors~
. The output of that happens with Max slop but, critically, it is analyzing exactly the right slice of time.
At the moment I’m using FFT settings of 256 64
for descriptors~
, and in keeping the overall analysis window a multiple of the FFT size (as per @a.harker’s suggestion a while back), I’m analyzing a 512 sample chunk of time, which equals 11.61ms, which is also the latency of the process now.
This this is a bit faster than the previous version (12ms vs 15ms) but it is significantly more accurate in its results.
Give it a spin, particularly with live sound input (mouth/voice sounds are great for testing). The red subpatch p streamView
is really useful for seeing the values over time. You just have to make 50 onsets before the zl stream
starts displaying on the multislider
s.
Also attaching two images comparing a literal loop of the same snare hit, with one image being the version with the old (first) version of this patch, and the other image being the tighter `sah~` version of the patch.


Also, more critically I am approaching the same problem in a different way on a research project I'm working on, so that will hopefully lead to a significantly more sophisticated version of this too. So more on that in the future.
@audiomatt: I needed this thing for a museum project. The input was an e-drum, for which I needed trigger detection , (basically just a piezo output). The patch was controlling a solenoid controlled snare drum. The idea was a human-robot improv session, the robot would echo the human input but come up with its own fill-ins etc. I ended up using a slightly modified version of Rodrigo's patch (thanks!). Anyway, I'm still interested in the subject matter. I have a feeling this midi paradigm of having binary notes with some kind of velocity value is pretty limiting in a way.. (But challenging enough though!)
I can remember playing with some analog e-drum modules, which gave me the impression there was no discrete trigger (I believe it was a Simmons SDS). More as if it worked with envelope followers which directly controlled oscillators using the volume of the input. I would love to expand on this idea, staying away from any thresholding and try to solve as much as possible in the floating point sig~ realm.. no latency except for adc/dac
How about yourselves? What kind of application are you using it for?
@Rodrigo, thanks for the update, will check it out a.s.a.p!
@Jay, 20ms seems a bit high to me for perception.
I've set myself a goal of (around) 10ms, with that already being 'feel-able' but 'pass-able'. For example, I can totally tell if I'm in a 256 I/O vector size which is 6ms (I guess 12ms round trip), and prefer to run at 128 as a maximum usable value, with 64 being my preferred setting.
That being said, I'm talking about when you can also hear the source (acoustic) sound. If you are triggering just a pad, then less so.
In terms of analysis, there is a surprising amount of information, and variations of information within the transient of percussive sounds, as most tend to have a fairly sharp (and distinctive) attack which then fades away quickly. So my focus has been on analyzing and differentiating that tiny sliver of time. You kind of sacrifice pitch accuracy, but that's not a huge concern for me.
(this is something I am working on improving using some different tools where I can keep a tiny window this way, but use only the appropriate frames for each given descriptor type)
My main interest in this is just having a more general "onset detection" algorithm that I can use elsewhere, so instead of getting a bang, or bang + velocity, I can instead get centroid and noisiness too.
But a more specific and immediate application would be for database navigation purposes where I can have a library of pre analyzed samples which I can then trigger/play based on the descriptor information of a given attack. (so if I have a loud/bright sound, select a loud/bright sound to play in its entirety, from the database)
My comments about latency and perception relate to this approach in that if the sample playback is >20ms later than the (in my case) acoustic attack, it is definitely audible.
Yeah it’s high. Especially taking your point of sample playback being 20ms later than an acoustic attack. I was just going off what was in those two research papers. Both of them ended up with a system that had an overall latency between 22ms - 33ms.
Here’s the table for the Handsolo project:

I haven’t made anything yet in the way of onset detection. I’ll be sure to check out your patch this weekend. Thanks for sharing!!
I’m actually trying to achieve something similar to yourself and what is in those papers. I’m trying to create a patch for conga/djembe to recognise different gestures, and those gestures trigger a midi note/event. So it would need to recognise open tone, slap, muted/closed slap, etc and they would trigger an assigned midi note.
I’d also like to extract other timbral descriptors for control messages, that part seems more straightforward though.
It’s not as complex as what you’re trying to achieve. Which actually sounds like a really interesting idea.
BTW: What version of MAX are you running? I don’t think the descriptors object runs on 64bit.
I'm in Max8. There are 64bit versions of his objects floating around, but I don't think they are fully nailed down and proper release ready, but it would be worthwhile asking him to see where they are at.
In principal the patch should work with any descriptor object there as what's much improved here is being quite accurate about what frame of time is being analyzed.
This is a great thread and has really helped my understanding of onset detection - thank you! What strategies would you employ if you were building a generative system to extract descriptor information from field recordings? In this context, the field recordings are fixed media and latency is not a problem. Any advice is greatly appreciated.
These days I would do everything using FluCoMa and it's super straightforward to do stuff like that once you figure out which descriptors you want to analyze (e.g. do you care about pitch? (fluid.bufpitch~) how important is loudness? (all the stats from fluid.bufloudness~ + fluid.bufstats~), do you want comprehensive timbral information? (fluid.bufmfcc~) etc...)
Hi Rodrigo, thanks for getting back to me. I have been using your onset detection alongside the ZSA descriptors (couldn't find a 64bit version of Alec Harker's descriptors~). The idea is to extract as many values from a field recording as possible and to use that information to control the parameters of electronic/ electroacoustic music. Some field recordings are more percussive (footsteps, stones etc) and some are more pitch-based (bird song, sirens etc). I have also been extracting partials and creating harmonic/spectral systems from signal input. I have not used FluCoMa before - have you moved to FluCoMa from 'descriptors~'?
One of the things I am struggling with is detecting onsets from subtle information. I have attached a file that I have struggled to detect onsets for reliably.
I'm actually getting really good results with fluid.ampslice. Do you know the best way to extract velocity and synchronise with the onsets?
For descriptor stuff I've moved completely over to FluCoMa. I've built a whole toolbox on top of it:
http://rodrigoconstanzo.com/sp-tools/
Getting reliable onsets is tricky. You have to fine tune the parameters a bit around the material you're using.
There's also a pretty active forum/community (for FluCoMa). There's some relevant discussion in here for you I think:
https://discourse.flucoma.org/t/another-noob-question-on-scaling-non-overlaying-corpora/1648
Your toolbox is really great! - I'll give FluCoMa some time investment. Thanks so much. for the responses.