{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Forced alignment for multilingual data\n\n**Authors**: [Xiaohui Zhang](xiaohuizhang@meta.com)_, [Moto Hira](moto@meta.com)_.\n\nThis tutorial shows how to align transcript to speech for non-English languages.\n\nThe process of aligning non-English (normalized) transcript is identical to aligning\nEnglish (normalized) transcript, and the process for English is covered in detail in\n[CTC forced alignment tutorial](./ctc_forced_alignment_api_tutorial.html)_.\nIn this tutorial, we use TorchAudio's high-level API,\n:py:class:`torchaudio.pipelines.Wav2Vec2FABundle`, which packages the pre-trained\nmodel, tokenizer and aligner, to perform the forced alignment with less code.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import torch\nimport torchaudio\n\nprint(torch.__version__)\nprint(torchaudio.__version__)\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nprint(device)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from typing import List\n\nimport IPython\nimport matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating the pipeline\n\nFirst, we instantiate the model and pre/post-processing pipelines.\n\nThe following diagram illustrates the process of alignment.\n\n\n\nThe waveform is passed to an acoustic model, which produces the sequence of\nprobability distribution of tokens.\nThe transcript is passed to tokenizer, which converts the transcript to\nsequence of tokens.\nAligner takes the results from the acoustic model and the tokenizer and generate\ntimestamps for each token.\n\n
This process expects that the input transcript is already normalized.\n The process of normalization, which involves romanization of non-English\n languages, is language-dependent, so it is not covered in this tutorial,\n but we will breifly look into it.
The model instantiated by :py:data:`~torchaudio.pipelines.MMS_FA`'s\n :py:meth:`~torchaudio.pipelines.Wav2Vec2FABundle.get_model`\n method by default includes the feature dimension for ``