yagodequay

I'm trying to create a patch that detects when some one is speaking. WHEN, not WHAT. I'm NOT concerned with what they are saying (ex. commands, words...). I just need an algorithm that detects human voice and outputs - yes, that is a human voice, or no (ex. toggle?). Rather simple yet I haven't seen any thing like this.

need-voice-detection-not-speech-recognition

Do you mean something that can tell the difference between a speaking voice and, say, an oboe *or* something that merely detects the presence of a signal? In the former case, there's a very good reason you haven't seen anything like it (it'd be very complicated to do). In the latter case, you'd be using thresh~ or something....

thx for the reply. Yes, I need something to detect the presence of a voice, not any signal.

I dont need it to tell the difference, only yes or no, 1 or 0. If the oboe and human voice have the same characteristics, then it would be fine to confuse them... Im not planning in presenting this in any place musical, actually its supposed to be outside a building.

I just needs to output YES to a human voice, and NO to a car, pedestrian, dog, bird, construction, wtvr... No need for fine algorithms, mistakes can be made (false negatives = ok)

Spectrum analyses, centroid, harmonics...? no ideas ppl?

It sounds to me like the sort of thing that would only really work well if you "trained/calibrated" the patch for a specific person's voice. That way you would know the formant structure you needed to detect.

I'm not 100% sure about this, but aren't formants of the human voice very similar person to person? If so then FFT analysis for patterns of formants over a short period of time might be a place to start.

Just some ideas, I've never attempted anything remotely like that before though. It will be very complex as GT pointed out.

That said, speech recognition software does exist, so it's possible!

It sounds to me like the sort of thing that would only really work well if you "trained/calibrated" the patch for a specific person's voice.  That way you would know the formant structure you needed to detect.  

I'm not 100% sure about this, but aren't formants of the human voice very similar person to person?  If so then FFT analysis for patterns of formants over a short period of time might be a place to start.

Just some ideas, I've never attempted anything remotely like that before though.  It will be very complex as GT pointed out.

That said, speech recognition software does exist, so it's possible!


Yep its more complicated cousin - speech recognition - is out there. I just need the underlying structure. Thought there were tool out there for this already :S

Good idea about the formants, ill check that out... I'm sure ur on the right track.

First thing to do would be to filter out non-speech frequencies, i.e. below 200 Hz and above 2000 Hz (maybe even tighter than that? you'll have to experiment). Then set a threshold so that when audio is coming in on those frequencies is at a high enough level, then you trigger a YES. That would be the simplest way to go. You could also add in a time threshold so that the audio coming in on those frequencies has to exceed the threshold for a certain number of milliseconds before a YES will be triggered.

I also know that commercial speech recognition algorithms incorporate some kind of periodicity sensing. In other words, they look for periodic spikes that are indicative of continuous speech. That is much more complicated and I can't say I would know where to start on that front.

First thing to do would be to filter out non-speech frequencies, i.e. below 200 Hz and above 2000 Hz (maybe even tighter than that?  you'll have to experiment).  Then set a threshold so that when audio is coming in on those frequencies is at a high enough level, then you trigger a YES.  That would be the simplest way to go.  You could also add in a time threshold so that the audio coming in on those frequencies has to exceed the threshold for a certain number of milliseconds before a YES will be triggered.

I also know that commercial speech recognition algorithms incorporate some kind of periodicity sensing.  In other words, they look for periodic spikes that are indicative of continuous speech.  That is much more complicated and I can't say I would know where to start on that front.


I would probably use one of the speech recognition externals and if recognises a word - even wrongly - than someone is probably speaking . . .

Well
I would probably use one of the speech recognition externals and if recognises a word - even wrongly - than someone is probably speaking . . .
Cheers


I'd probably experiment with doing a "statistical" analysis of how much harmonic content (i.e. vowels) and how much noise components (consonants) are present in the sound and how they are spread, while limiting the detection of harmonic sounds to a typical speaking range and also taking note of the frequency ranges of the noise components.

If you play an oboe, you will have mostly harmonic content, if you blow into the microphone, it will be mostly noise. Speech in a specific language however, will probably have certain rather typical balance of noise and harmonic content, and both will have frequency ranges with a certain statistical setup.

This will hardly work well if just a single word is spoken, but may work well enough for longer sentences.

P.S. This technique might even go so far as allowing you to detect the language someone is speaking, statistically, if you also measure -which- vowels are spoken how often (by tracking the formants) etc., since every language has a typical setup of how often which sounds appear.

That typical setup might not be as easy to tease out as you think - think about how working with stress-timed vs. syllable-timed languages would potentially affect this (just to get all linguist nerdy on you), for example....

Oh, I'm sure it wouldn't be easy at all. And yeah, it may be a lot easier to concentrate on recognizing speech in a single language like this, than any speech.

Of course the task would be a lot easier if it wasn't about recognizing speech vs. anything else, but recognizing speech vs something specific (or at least a certain range of other sounds).

I guess the most foolproof way would still be to lock a human listener in a cardboard box, write "computer" on that box and have this person determine whether someone is speaking or not :P

These are all great ideas. I implemented @swieser1 suggestion involving these constrains:

1) Centroid has to be within a determined window (ex 200 - 2000 hz)

2) Stable (measured by the standard deviation of a list of centroid values)

However, i noticed, like many have mentioned, that the human voice makes some beautiful periodic spikes in the spectrum... Next step is to use that. Maybe if spikes are spread out regular intervals... zsa.freqpeak~ does this...

It's totally fine if this system mistakes a trumpet, obe, sax, with a human voice. Though, NOT with a car horn, hammer, fridge hum, etc...

I wish I could send my whole patch to u guys buy I have so many externals and I don't know how to compile the whole thing to send :S

any way a must-have is "zsa.descriptors": 

These are all great ideas. I implemented @swieser1 suggestion involving these constrains: 

Signal has to be...
1) Centroid has to be within a determined window (ex 200 - 2000 hz)
2) Stable (measured by the standard deviation of a list of centroid values)
3) Above a threshold
4) all in few hundred ms

It's totally fine if this system mistakes a trumpet, obe, sax, with a human voice. Though, NOT with a car horn, hammer, fridge hum, etc... 

any way a must-have is "zsa.descriptors": http://www.e--j.com/?page_id=85


Looking at "singing-voice~.help" from CNMAT will perhaps help you to understand better how voices sounds works :

I feel that for low frequency (male) human voices, it would be rather easy to distinguish - by analysing spectrum shape between 0hz and 4000hz - from other animal sounds, because the human vowels are like "sharp" filters compared to cats and dogs cries.

But for high frequency female human voices, it might be more difficult... (only few harmonics, then more difficult to see a "spectrum shape" of it : even for our hears it sometimes difficult to clearly distinguish vowels on very high frequency female human voices)

Looking at "singing-voice~.help" from CNMAT will perhaps help you to understand better how voices sounds works : 

I feel that for low frequency (male) human voices, it would be rather easy to distinguish - by analysing spectrum shape between 0hz and 4000hz - from other animal sounds, because the human vowels are like "sharp" filters compared to cats and dogs cries. 

But for high frequency female human voices, it might be more difficult... (only few harmonics, then more difficult to see a "spectrum shape" of it : even for our hears it sometimes difficult to clearly distinguish vowels on very high frequency female human voices)


 object. Hopefully it will be release in the next couple of weeks or so.

In the next version of Zsa.descriptors, there will be a zsa.mfcc~ object. Hopefully it will be release in the next couple of weeks or so.


For a signal to be considered voice it has to...

4) Harmonic - formants have to be evenly spaced

Back to ur point @Alexandre, is there a way to measure a match between the spectrum created in singing-voice~ (all possible permutations in pitch and vowels!) and the input?

1) Centroid has to be within a determined window (ex 200 - 2000 hz)
2) Stable (measured by the standard deviation of a list of centroid values)
3) Be above a volume threshold
4) Harmonic - formants have to be evenly spaced
5) longer than a few hundred ms

Back to ur point @Alexandre, is there a way to measure a match between the spectrum created in singing-voice~ (all possible permutations in pitch and vowels!) and the input?


> all possible permutations in pitch and vowels

i don't think this will be useful for you... "all possible permutations in pitch and vowels" : you'll simply get all the frequency between 200hz and 2000hz.

What is needed, i think, is to measure how "sharp" like a comb is the aspect of spectrum of your periodic sound: very sharp = male human voice, not sharp (all harmonics are presents) = animal cry or acoustic instrument.

>> It's totally fine if this system mistakes a trumpet, obe, sax, with a human voice.

>> Though, NOT with a car horn, hammer, fridge hum, etc...

Then, in this simple case, what you only need to know is if the sound is enough periodic or not. Maybe using sigmund~ and look how much Harmonic/formants are evenly spaced...

i don't think this will be useful for you...  "all possible permutations in pitch and vowels" : you'll simply get all the frequency between 200hz and 2000hz. 

>> It's totally fine if this system mistakes a trumpet, obe, sax, with a human voice.
>> Though, NOT with a car horn, hammer, fridge hum, etc...

Then, in this simple case, what you only need to know is if the sound is enough periodic or not. Maybe using sigmund~ and look how much Harmonic/formants are evenly spaced...


I think it may be misleading to look too much at singing voices. Singing creates a great emphasis on vowels/harmonic sounds, whereas we don't tend to stretch our vowels that much when we speak, so consonants make up a great portion of spoken words. Many words, when spoken normally, will only have some short harmonic components and consist to a great part of differently filtered noise. And even the harmonic components that will be present won't be extremely stable when spoken quickly and will contain many fluctuations.

That's why I'm a bit sceptical about restricting it to harmonic sounds and ignoring noise altogether. Although the same approach may work fine for sung words or words spoken very slowly and relatively loudly (since the quieter we speak and the closer it gets to whispering, the more the consonants start to overshadow the vowels).

@Szrp re: So true... I'm noticing that now :S

If one speaks very fast my system doesn't respond. However if one enunciates well, I get pretty decent result - specially because I made it to respond to very fast cues, around 250 ms. Not fast enough to be 100% right. Any one with insight on this topic!?

@Szrp re: So true... I'm noticing that now :S
If one speaks very fast my system doesn't respond. However if one enunciates well, I get pretty decent result - specially because I made it to respond to very fast cues, around 250 ms. Not fast enough to be 100% right. Any one with insight on this topic!?


What about doing video tracking first? like face tracking? then if you got some face here, you do sound analysis?

I think it needs to be a mixture between time domain and spectral analysis. In speech there is simply lots of change in short times. If you can track the amount of variation within the speech spectrum, and just assume its no speech if there is too much signal out of this range.

Maybe its even enough to only track the amplitude changes within that spectral range... It would have some latency of course...

Keep us posted, its an interesting topic...

I think it needs to be a mixture between time domain and spectral analysis. In speech there is simply lots of change in short times. If you can track the amount of variation within the speech spectrum, and just assume its no speech if there is too much signal out of this range.
Maybe its even enough to only track the amplitude changes within that spectral range... It would have some latency of course...

Trying to revive this thread again... Maybe new developments were made?

Why not "simply" employ speech recognition? And any output from it is translated back into "1", essentially. That's what I'd do, anyway.

http://www.eurasip.org/Proceedings/Eusipco/Eusipco2009/contents/papers/1569192958.pdf

need voice detection - not speech recognition