{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Forced Alignment with Wav2Vec2\n\n**Author**: [Moto Hira](moto@meta.com)_\n\nThis tutorial shows how to align transcript to speech with\n``torchaudio``, using CTC segmentation algorithm described in\n[CTC-Segmentation of Large Corpora for German End-to-end Speech\nRecognition](https://arxiv.org/abs/2007.09127)_.\n\n
This tutorial was originally written to illustrate a usecase\n for Wav2Vec2 pretrained model.\n\n TorchAudio now has a set of APIs designed for forced alignment.\n The [CTC forced alignment API tutorial](./ctc_forced_alignment_api_tutorial.html)_ illustrates the\n usage of :py:func:`torchaudio.functional.forced_align`, which is\n the core API.\n\n If you are looking to align your corpus, we recommend to use\n :py:class:`torchaudio.pipelines.Wav2Vec2FABundle`, which combines\n :py:func:`~torchaudio.functional.forced_align` and other support\n functions with pre-trained model specifically trained for\n forced-alignment. Please refer to the\n [Forced alignment for multilingual data](forced_alignment_for_multilingual_data_tutorial.html)_ which\n illustrates its usage.
In the subsequent sections, we will compute the probability in\n log-domain to avoid numerical instability. For this purpose, we\n normalize the ``emission`` with :py:func:`torch.log_softmax`.