Miller

I'm thinking of trying to do some basic speaker detection in Max/MSP. More specifically, assuming I have a bunch of people sitting in a room each with their own mic, I would like to know when someone starts speaking, and which mic the speaker is using. I'm also assuming that I will have access to each of the mic signals independently in Max/MSP. Has anyone done something like this? Or could you suggest some techniques / algorithms / MSP objects that would help?

detecting-speech

speech detection and determining which mic someone is speaking into are two different things.

assuming you are on a mac, you can use the aka speech object. i had pretty good success with that in the past.

assuming you are using some kind of mult-channel interface, with a mixer that has multiple outputs, it's fairly trivial to determine which mic someone is speaking into.

you've not really given us enough info to help much.

assuming you are on a mac, you can use the aka speech object.  i had pretty good success with that in the past.

you've not really given us enough info to help much.


The multichannel aspect is straightforward, just adc~ 4 (for four channels, etc.) For speech recognition I haven't tried aka.speech (it utilizes the built-in speech recognition on the Mac) but apparently it works OK.

I have Dragon Naturally Speaking for Windows, and if you put the focus/cursor in a textedit box or text file in Max, it splats the words in there :) With a bit of patching you can have a textedit that captures words, a metro-bang to report it at whatever interval (can't use a keystroke as it'll go into the textedit, though you can filter for certain ones and use them to report the text), then route text--> route word1 word2 word3 etc... to see if the word matches any that you have it looking for.

Single words are best, and actually, complicated words (like "spatula") work better than simple ones (like "and"). With enough routes, you could have hundreds of recognizable words trigger events or whatever. The only downside is that Dragon takes a fair amount of resources to run, and it takes about a half second or so to capture the word, but hey.

There's a demo of Dragon available which lets you run it 5 (?) times, to see if it's what you want. However, note that it trains to specific voices to work well, so if you have strangers sitting down for the first time it probably won't do the trick. The training is pretty fast, like 10 mins, but Dragon only runs in one "user mode" at a time. Maybe aka.speech does better with new, untrained voices?

The multichannel aspect is straightforward, just adc~ 4 (for four channels, etc.) For speech recognition I haven't tried aka.speech (it utilizes the built-in speech recognition on the Mac) but apparently it works OK. 

I have Dragon Naturally Speaking for Windows, and if you put the focus/cursor in a textedit box or text file in Max, it splats the   words in there  :)  With a bit of patching you can have a textedit that captures words, a metro-bang to report it at whatever interval (can't use a keystroke as it'll go into the textedit, though you can filter for certain ones and use them to report the text), then route text--> route word1 word2 word3 etc... to see if the word matches any that you have it looking for. 

I should have been more clear that what I am interested in is *not* speech recognition - i.e. determining what a person said - but rather simply detecting when people start / stop speaking. aka.speech is for speech synthesis, while aka.listen is for recognizing words, so neither of those help me.

So let me try to spell this out in more detail. I have a 4-channel soundcard, and a mixer with enough outputs so that I can have 4 mics, each coming into a different audio channel in Max/MSP. Suppose there are four people sitting in a room with a lapel mic on. When someone starts speaking, I want to be able to know which mic the speaker is using.

I guess there might be two parts to this. The first is distinguishing speaking from background noise. The second is that when a person speaks, maybe all mics will pic it up to a certain extent? So then you have to decide which mic had the "strongest" speech signal coming into it?

Quote: Miller wrote on Thu, 12 June 2008 10:19

----------------------------------------------------

> I guess there might be two parts to this. The first is distinguishing speaking from background noise. The second is that when a person speaks, maybe all mics will pic it up to a certain extent? So then you have to decide which mic had the "strongest" speech signal coming into it?

I'm assuming you only want to process signals where someone is speaking. The bleed in from the other mics is also background noise, so you just need to cut out the noise with a noise gate.

Here's a simple patch that shows if the signal is above some threshold, on average. From this you could build a signal router or something to display which mics are active. The numbers in my patch would need to be tweaked a lot for this to work in practice (I have no idea if they are appropriate). Perhaps someone with more sound recording experience could shed insight.

There are lots of possible improvements. Using a multiband signal expander before the noise gate could probably help the accuracy by expanding the frequency range of human speech. I think the built in omx objects can do this.

Quote: Miller wrote on Thu, 12 June 2008 10:19
----------------------------------------------------
> I guess there might be two parts to this. The first is distinguishing speaking from background noise. The second is that when a person speaks, maybe all mics will pic it up to a certain extent? So then you have to decide which mic had the "strongest" speech signal coming into it?

Here's a simple patch that shows if the signal is above some threshold, on average. From this you could build a signal router or something to display which mics are active. The numbers in my patch would need to be tweaked a lot for this to work in practice  (I have no idea if they are appropriate). Perhaps someone with more sound recording experience could shed insight.

You mean 4 inputs? outputs wont help you with microphones.

look at [thresh~] that is likely all you need though you might need to add

On Thu, Jun 12, 2008 at 12:19 PM, Miller Peterson 

> I should have been more clear that what I am interested in is *not* speech

> recognition - i.e. determining what a person said - but rather simply

> detecting when people start / stop speaking. aka.speech is for speech

> synthesis, while aka.listen is for recognizing words, so neither of those

> So let me try to spell this out in more detail. I have a 4-channel

> soundcard, and a mixer with enough outputs so that I can have 4 mics, each

> coming into a different audio channel in Max/MSP. Suppose there are four

> people sitting in a room with a lapel mic on. When someone starts speaking,

> I want to be able to know which mic the speaker is using.

> I guess there might be two parts to this. The first is distinguishing

> speaking from background noise. The second is that when a person speaks,

> maybe all mics will pic it up to a certain extent? So then you have to

> decide which mic had the "strongest" speech signal coming into it?

look at [thresh~] that is likely all you need though you might need to add
[edge~] and or [onebang]

On Thu, Jun 12, 2008 at 12:19 PM, Miller Peterson 
wrote:

>
> I should have been more clear that what I am interested in is *not* speech
> recognition - i.e. determining what a person said - but rather simply
> detecting when people start / stop speaking. aka.speech is for speech
> synthesis, while aka.listen is for recognizing words, so neither of those
> help me.
>
> So let me try to spell this out in more detail. I have a 4-channel
> soundcard, and a mixer with enough outputs so that I can have 4 mics, each
> coming into a different audio channel in Max/MSP. Suppose there are four
> people sitting in a room with a lapel mic on. When someone starts speaking,
> I want to be able to know which mic the speaker is using.
>
> I guess there might be two parts to this. The first is distinguishing
> speaking from background noise. The second is that when a person speaks,
> maybe all mics will pic it up to a certain extent? So then you have to
> decide which mic had the "strongest" speech signal coming into it?
>
> Hopefully that clarifies things a bit.
>
> Miller
>

Detecting Speech