« Return to Thread: How does vad work?

Re: How does vad work?

by Damon Casale :: Rate this Message:

| View in Thread

Some parts of this message have been removed. Learn more about Nabble's security policy.

Oops, that there *may* be consonant sounds prior to the detection of audio before the trigger point.

 

Also, I turned the –t option all the way down to 1 and I’m *still* losing the first half of “guess” in “guess what”.  That’s just sad.

 

Damon

 

From: Damon Casale [mailto:damonjcasale@...]
Sent: Sunday, June 03, 2012 11:02 AM
To: 'robs'; sox-users@...
Subject: Re: [SoX-users] How does vad work?

 

I’m already using norm along with vad.

 

What’s frustrating is that vad seems to perform *less* effectively at stripping silence than the “silence” option does.  The p option for vad seems to be just a shot in the dark that there *may* be vowel sounds prior to the detection of audio before the trigger point, and trying to capture “gue” out of “guess what” with –p seems rather silly, given how long of a duration that missing fragment is.  (Yes, that’s how much is being stripped by vad!)

 

My goal is to tightly constrain the audio clip to the detected audio, and remove the silence from both ends.  Having –s and  –p options for “silence” would help enormously, since the only time “silence” seems to lose anything is with a few edge cases like certain initial consonants (c, h, k), certain labial stops at the end (-t, -th, -d), and words ending in a nasal consonant (m or n) that’s relatively quieter than the rest of the audio.

 

Damon

 

From: robs [aquegg@...]
Sent: Sunday, June 03, 2012 9:56 AM
To: sox-users@...
Subject: Re: [SoX-users] How does vad work?

 

From: Damon Casale <damonjcasale@...>
To: sox-users@...
Sent: Friday, 1 June 2012, 17:36
Subject: [SoX-users] How does vad work?

 

The online documentation isn’t completely clear on how it works:

 

Voice Activity Detector. Attempts to trim silence and quiet background sounds from the ends of (fairly high resolution i.e. 16-bit, 44−48kHz) recordings of speech.

 

Options: 
Default values are shown in parenthesis. 
−t 
num (7)

The measurement level used to trigger activity detection. This might need to be changed depending on the noise level, signal level and other charactistics of the input audio.

 

What is a “measurement level”?  What does the default number 7 represent?

Hello Damon, per the manual, this is a measurement of cepstral power. Basically, this is the volume (in some arbitrary units) of a vowel sound needed to trigger detection of voice, so a lower number will trigger more readily than a higher number. The -p option may help in preserving initial consonant sounds.  Remember also, that vad performs best with normalised audio (norm effect).

 

HTH,

Rob 


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Sox-users mailing list
Sox-users@...
https://lists.sourceforge.net/lists/listinfo/sox-users

 « Return to Thread: How does vad work?