![]() |
![]() |
![]() |
Applications covered:
speechbrain
– PyTorch powered speech toolkit, https://speechbrain.github.io/espnet
– end-to-end speech processing toolkit, https://github.com/espnet/espnetPretrained models are mostly downloaded from Hugging Face
Transformer-based neural network for speech separation
C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi and J. Zhong, "Attention Is All You Need In Speech Separation," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 21-25. PDF
We will use SepformerSeparation
class from the SpeechBrain library with pretrained model (speechbrain/sepformer-wsj02mix) from Hugging Face (https://huggingface.co/speechbrain/sepformer-wsj02mix).
# Import
from speechbrain.pretrained import SepformerSeparation
# Create and download pretrained model
speech_separator = SepformerSeparation.from_hparams(
source='speechbrain/sepformer-wsj02mix', # Model name
savedir='pretrained_models/sepformer-wsj02mix',
run_opts={'device':'cuda'} # Use GPU
)
# Apply model to example audio
estimated_sources = speech_separator.separate_file(
path='speechbrain/sepformer-wsj02mix/test_mixture.wav'
)
Signal with two males speaking different languages (English and French) at the same time
# Estimate source from the mixure signal
estimated_sources = speech_separator.separate_batch(
torch.from_numpy(mixture.data).float()[None,:]
)
# Create audio containers for source1 and source2
source1 = AudioContainer(
data=tensor_to_numpy(estimated_sources[:, :, 0]),
fs=8000
)
source2 = AudioContainer(
data=tensor_to_numpy(estimated_sources[:, :, 1]),
fs=8000
)
Transformer-based neural network for speech separation
C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi and J. Zhong, "Attention Is All You Need In Speech Separation," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021, pp. 21-25. PDF
We will use again SepformerSeparation
class from the SpeechBrain library with pretrained model (speechbrain/sepformer-whamr-enhancement) from Hugging Face (https://huggingface.co/speechbrain/sepformer-whamr-enhancement).
from speechbrain.pretrained import SepformerSeparation
speech_enhancer = SepformerSeparation.from_hparams(
source='speechbrain/sepformer-whamr-enhancement',
savedir='pretrained_models/sepformer-whamr-enhancement'
)
est_sources = speech_enhancer.separate_file(
path='speechbrain/sepformer-whamr-enhancement/example_whamr.wav'
)
Mixture signal with two males speaking different languages (English and French) at the same time in a metro station.
noisy_estimated_sources = speech_separator.separate_batch(
torch.from_numpy(noisy_mixture.data).float()[None,:]
)
estimated_source = speech_enhancer.separate_batch(
torch.from_numpy(noisy_source1.data).float()[None,:]
)
estimated_source = speech_enhancer.separate_batch(
torch.from_numpy(noisy_source2.data).float()[None,:]
)
The task that automatically identifies the language of a given spoken utterance
Desplanques, B., Thienpondt, J., Demuynck, K., ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Proc. Interspeech 2020, pp. 3830-3834. PDF
We will use EncoderClassifier
class from the SpeechBrain library with pretrained model (speechbrain/lang-id-voxlingua107-ecapa) from Hugging Face (https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa)
# Import
from speechbrain.pretrained import EncoderClassifier
# Create and download pretrained model
language_id = EncoderClassifier.from_hparams(
source="speechbrain/lang-id-voxlingua107-ecapa",
savedir='pretrained_models/lang-id-voxlingua107-ecapa',
run_opts={"device":"cuda"}
)
# Identify language
signal = language_id.load_audio("https://omniglot.com/soundfiles/udhr/udhr_fi.mp3")
lang = language_id.classify_batch(signal)[-1][0]
Identified language: Finnish
Let's estimate languages for the sound sources separated earlier in the speech separation section.
# Identify languages for both sources
prediction1 = language_id.classify_batch(
wavs=torch.from_numpy(source1.data).float()
)
prediction2 = language_id.classify_batch(
wavs=torch.from_numpy(source2.data).float()
)
Identified language for source 1: French Identified language for source 2: English
S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, Hybrid CTC/Attention Architecture for End-to-End Speech Recognition, in IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, Dec. 2017. PDF
We will use EncoderASR
or EncoderDecoderASR
classes from the SpeechBrain library with pretrained model (speechbrain/asr-wav2vec2-commonvoice-LANG) from Hugging Face (https://huggingface.co/speechbrain/asr-wav2vec2-commonvoice-en)
# Import
from speechbrain.pretrained import EncoderDecoderASR
# Create and download pretrained model
ASR = EncoderDecoderASR.from_hparams(
source='speechbrain/asr-wav2vec2-commonvoice-en',
savedir='pretrained_models/asr-wav2vec2-commonvoice-en',
run_opts={'device':'cuda'} # for GPU
)
# Apply speech recognition
text = ASR.transcribe_file('speechbrain/asr-wav2vec2-commonvoice-en/example.wav')
Recognized speech: THE BIRCH CANOE SLID ON SMOOTH PLANKS
Let's estimate the content for both speech signals separated earlier in the speech separation section.
from speechbrain.pretrained import EncoderASR, EncoderDecoderASR
# Create models with languages (English and French)
asr_models = {
'fr': EncoderASR.from_hparams(
source='speechbrain/asr-wav2vec2-commonvoice-fr',
savedir='pretrained_models/asr-wav2vec2-commonvoice-fr',
run_opts={"device":"cuda"} # for GPU
),
'en': EncoderDecoderASR.from_hparams(
source='speechbrain/asr-wav2vec2-commonvoice-en',
savedir='pretrained_models/asr-wav2vec2-commonvoice-en',
run_opts={"device":"cuda"} # for GPU
),
}
pred_str1, pred_tokens1 = asr_models[lang1].transcribe_batch(
wavs=torch.from_numpy(source1.resample(16000).data).float().unsqueeze(0),
wav_lens=torch.tensor([1.0])
)
text1 = pred_str1[0]
Recognized content for source 1 [fr]: TOUS LES ÊTRES HUMAINS NAISSENT LIBRES ET ÉGAUX EN DIGNITÉ ET EN DROIT ILS SONT DOUÉS DE RAISONS ET DE CONSCIENCE ET DOIVENT AGIR LES UNS ENVERS LES AUTRES DANS UN ESPRIT DE FRATERNITÉ
pred_str2, pred_tokens2 = asr_models[lang2].transcribe_batch(
wavs=torch.from_numpy(source2.resample(16000).data).float().unsqueeze(0),
wav_lens=torch.tensor([1.0])
)
text2 = pred_str2[0]
Recognized content for source 2 [en]: ALL HUMAN BEINGS ON FREE AND EQUAL IN DIGNITY AND RIGHT THEY ARE ENDOWED WITH REASON AND CONSCIENCE AND SHOULD ACT TOWARDS ONE ANOTHER IN A SPIRIT OF BROTHERHOOD
An end-to-end speech synthesis system by Google
Wang, Yuxuan, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Z. Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Robert A. J. Clark and Rif A. Saurous. Tacotron: Towards End-to-End Speech Synthesis, INTERSPEECH (2017) PDF
We will use Text2Speech
class from the espnet2 library with pretrained model to convert recognized content of speech from sound source 2 into audio.
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.tts_inference import Text2Speech
d = ModelDownloader('pretrained_models') # Data downloader
# Create and download pretrained model
text2speech = Text2Speech(
**d.download_and_unpack(
'kan-bayashi/ljspeech_tts_train_vits_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave'
), device="cuda"
)
# Apply model to text
speech = text2speech(text2)["wav"]
ALL HUMAN BEINGS ON FREE AND EQUAL IN DIGNITY AND RIGHT THEY ARE ENDOWED WITH REASON AND CONSCIENCE AND SHOULD ACT TOWARDS ONE ANOTHER IN A SPIRIT OF BROTHERHOOD
We will use VAD
class from the SpeechBrain library with pretrained model (speechbrain/vad-crdnn-libriparty) from Hugging Face (https://huggingface.co/speechbrain/vad-crdnn-libriparty).
# Import
from speechbrain.pretrained import VAD
# Create and download pretrained model
VAD = VAD.from_hparams(
source='speechbrain/vad-crdnn-libriparty', # Model name
savedir='pretrained_models/vad-crdnn-libriparty'
)
# Detect segments with speech, segment boudaries in seconds
speech_segments = VAD.get_speech_segments(
audio_file='speechbrain/vad-crdnn-libriparty/example_vad.wav'
)
Start Stop ----- ---- 14.3s 17.3s 18.1s 21.6s 28.6s 36.9s
Let's run voice activity detection on a long speech signal with environmental sounds in the background.