人声提取

1 minute read

又叫人声分离/割（speaker diarization）；
目的：基于声纹特征的方式将不同说话人的声音分离出来；

一、基本思路

语音分割：找到音频中发言人变化的点
最简单的做法是
- 对语音进行切片；
  切片操作，早期/主流的是滑窗（1-2s 的窗口）；
- 提取每个语音片段的声纹特征；
  判断声纹片段是否只包含一个发言者，早期主流的是 BIC；
按规则聚类，得到说话人声片段，同时可得到发言人数量
提特征，以供聚类（聚类，GMM，SVM；k-means，spectral clustering；RNN）；将聚类结果拼接在一起，即得到每个人的语音片段；
2、3 中使用的特征是 MFCC，再加上简单的能量、过零率及和语音相关的共振峰；
重分割：优化聚类结果来提升说话人分类的精度

早期指的是 2008 年；

:o: 如果多人同时说话，频率叠加后，会被认为是新的发言者的声音；

:o: 与标准的监督学习分类任务不同的是，说话人分类模型需要对新出现的说话人有着足够鲁棒的识别和分类性能，而在训练的过程中却无法囊括现实中各式各样的说话人。这在很大程度上限制了语音识别系统特别是在线系统的实时能力；

:o: 典型的聚类方法如k均值和谱聚类等非监督算法对于在线说话人识别时，应对不断输入的音频流很难有效聚类；

:o: 聚类的表现对于整个 Speaker diarization 有着重要的作用，这种无监督的方法占主流的情况下我们无法通过语音样本的监督学习来改进这些算法；

ALIZE Speaker Diarization (last repository uplast_modified_at: July 2016; last release: February 2013, version: 3.0): ALIZE Diarization System, developed at the University Of Avignon, a release 2.0 is available 2.
SpkDiarization (last release: September 2013, version: 8.4.1): LIUM_SpkDiarization tool 3.
Audioseg (last repository uplast_modified_at: May 2014; last release: January 2010, version: 1.2): AudioSeg is a toolkit dedicated to audio segmentation and classification of audio streams. 4.
SHoUT (last uplast_modified_at: December 2010; version: 0.3): SHoUT is a software package developed at the University of Twente to aid speech recognition research. SHoUT is a Dutch acronym for Speech Recognition Research at the University of Twente. 5.
pyAudioAnalysis (last repository uplast_modified_at: August 2018): Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications 6.
LIUM Speaker Diarization (http://www-lium.univ-lemans.fr/diarization/doku.php/welcome) pyAudioAnalysis.
https://github.com/hcook/gmm (based on http://digitalassets.lib.berkeley.edu/techreports/ucb/text/EECS-2011-128.pdf).
SIDEKIT.
https://projets-lium.univ-lemans.fr/s4d/.
The Future of Lead Qualification https://www.squadvoice.co/.
Powering The Future Of Work https://www.squadplatform.com/.