How to detect oral activity

Hi all, this is Chris Sun from team VR Rehearsal. Oral activity is an important factor which VR Rehearsal looks into to assessĀ user’s fluency of speech, and based on it, defines the behavior of virtual audiences.

Voice Activity Detection

To detect oral activity, i.e. whether the user is speaking or not, we look into the audio data the application recorded every frame. We rely on a volume threshold to determine if some voice activity takes place in this frame.

Threshold Check

Based on this, how could we further determine if the user is speaking or not? A frame of decent high volume could be an irrelevant noise during a silent interval, while a frame of decent low volume could be regular blanks between words.

TheĀ trick to deal with this is taking one step away from “one frame” but looking at the whole thing in a series of frame and concentrate on “state change” rather than the state of a single frame. We first define a “current state” of the user: the user could be either speaking or not, never both.

Then, when a frame of the opposite activity appears, we add the time of this frame to a timer; the timer shows how long the user has “suspiciously” been on the other state, and, if that’s enough long, we determine that the state of the user has changed.

State Transition

If, when a frame of the current state appears, the “opposite state” has not stay for enough long, then we assume the suspicious “opposite state” is a noise and discard it.

The final trick will be widely performing tests and finding the best volume threshold and state transition threshold for the algorithm.