Calculates boundaries between speech and music.
Outputs
- Segmentation
- Impulses at the boundary points.
- Detection function
- Function used to find boundaries.
Parameters
- Resolution
- The number of frames defining the window at which candidate changes might be found (default = 256)
- Change threshold
- The threshold of skewness difference at which a candidate change will be marked (default = 0.0781)
- Decision threshold
- The threshold used to classify segments as speech or music (default = 0.2734)
- Margin
- A parameter for the generation of the ZCR skewness (margin around mean ZCR where no ZCR samples will be taken into account) (default = 14)
- Minimum music segment length
- Music segments that are shorter than this minimum length will be dismissed (default = 0)
Description
This Vamp plugin is heavily inspired by the approach described in [1].
The algorithm works as follows:
- Measure the skewness of the distribution of zero-crossing rate across the audio file;
- Find points at which this distribution changes drastically;
- For each candidate change point found, classify the corresponding segment as follows:
- Mean skewness > threshold: speech
- Mean skewness < threshold: music
- If the segment has the same type with the previous one, merge it with the previous one.
This is a very early prototype, so not very accurate. It is relatively fast (around 1s to process a 20 minute file).
References
[1] J. Saunders, "Real-time discrimination of broadcast speech/music," IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.2, pp.993-999, 7-10 May 1996