{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Forced alignment for multilingual data\n\n**Authors**: [Xiaohui Zhang](xiaohuizhang@meta.com)_, [Moto Hira](moto@meta.com)_.\n\nThis tutorial shows how to align transcript to speech for non-English languages.\n\nThe process of aligning non-English (normalized) transcript is identical to aligning\nEnglish (normalized) transcript, and the process for English is covered in detail in\n[CTC forced alignment tutorial](./ctc_forced_alignment_api_tutorial.html)_.\nIn this tutorial, we use TorchAudio's high-level API,\n:py:class:`torchaudio.pipelines.Wav2Vec2FABundle`, which packages the pre-trained\nmodel, tokenizer and aligner, to perform the forced alignment with less code.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torchaudio\n\nprint(torch.__version__)\nprint(torchaudio.__version__)\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nprint(device)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from typing import List\n\nimport IPython\nimport matplotlib.pyplot as plt"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Creating the pipeline\n\nFirst, we instantiate the model and pre/post-processing pipelines.\n\nThe following diagram illustrates the process of alignment.\n\n<img src=\"https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2fabundle.png\">\n\nThe waveform is passed to an acoustic model, which produces the sequence of\nprobability distribution of tokens.\nThe transcript is passed to tokenizer, which converts the transcript to\nsequence of tokens.\nAligner takes the results from the acoustic model and the tokenizer and generate\ntimestamps for each token.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>This process expects that the input transcript is already normalized.\n   The process of normalization, which involves romanization of non-English\n   languages, is language-dependent, so it is not covered in this tutorial,\n   but we will breifly look into it.</p></div>\n\nThe acoustic model and the tokenizer must use the same set of tokens.\nTo facilitate the creation of matching processors,\n:py:class:`~torchaudio.pipelines.Wav2Vec2FABundle` associates a\npre-trained accoustic model and a tokenizer.\n:py:data:`torchaudio.pipelines.MMS_FA` is one of such instance.\n\nThe following code instantiates a pre-trained acoustic model, a tokenizer\nwhich uses the same set of tokens as the model, and an aligner.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torchaudio.pipelines import MMS_FA as bundle\n\nmodel = bundle.get_model()\nmodel.to(device)\n\ntokenizer = bundle.get_tokenizer()\naligner = bundle.get_aligner()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>The model instantiated by :py:data:`~torchaudio.pipelines.MMS_FA`'s\n   :py:meth:`~torchaudio.pipelines.Wav2Vec2FABundle.get_model`\n   method by default includes the feature dimension for ``<star>`` token.\n   You can disable this by passing ``with_star=False``.</p></div>\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The acoustic model of :py:data:`~torchaudio.pipelines.MMS_FA` was\ncreated and open-sourced as part of the research project,\n[Scaling Speech Technology to 1,000+ Languages](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/)_.\nIt was trained with 23,000 hours of audio from 1100+ languages.\n\nThe tokenizer simply maps the normalized characters to integers.\nYou can check the mapping as follow;\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(bundle.get_dict())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The aligner internally uses :py:func:`torchaudio.functional.forced_align`\nand :py:func:`torchaudio.functional.merge_tokens` to infer the time\nstamps of the input tokens.\n\nThe detail of the underlying mechanism is covered in\n[CTC forced alignment API tutorial](./ctc_forced_alignment_api_tutorial.html)_,\nso please refer to it.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We define a utility function that performs the forced alignment with\nthe above model, the tokenizer and the aligner.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def compute_alignments(waveform: torch.Tensor, transcript: List[str]):\n    with torch.inference_mode():\n        emission, _ = model(waveform.to(device))\n        token_spans = aligner(emission[0], tokenizer(transcript))\n    return emission, token_spans"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We also define utility functions for plotting the result and previewing\nthe audio segments.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Compute average score weighted by the span length\ndef _score(spans):\n    return sum(s.score * len(s) for s in spans) / sum(len(s) for s in spans)\n\n\ndef plot_alignments(waveform, token_spans, emission, transcript, sample_rate=bundle.sample_rate):\n    ratio = waveform.size(1) / emission.size(1) / sample_rate\n\n    fig, axes = plt.subplots(2, 1)\n    axes[0].imshow(emission[0].detach().cpu().T, aspect=\"auto\")\n    axes[0].set_title(\"Emission\")\n    axes[0].set_xticks([])\n\n    axes[1].specgram(waveform[0], Fs=sample_rate)\n    for t_spans, chars in zip(token_spans, transcript):\n        t0, t1 = t_spans[0].start, t_spans[-1].end\n        axes[0].axvspan(t0 - 0.5, t1 - 0.5, facecolor=\"None\", hatch=\"/\", edgecolor=\"white\")\n        axes[1].axvspan(ratio * t0, ratio * t1, facecolor=\"None\", hatch=\"/\", edgecolor=\"white\")\n        axes[1].annotate(f\"{_score(t_spans):.2f}\", (ratio * t0, sample_rate * 0.51), annotation_clip=False)\n\n        for span, char in zip(t_spans, chars):\n            t0 = span.start * ratio\n            axes[1].annotate(char, (t0, sample_rate * 0.55), annotation_clip=False)\n\n    axes[1].set_xlabel(\"time [second]\")\n    fig.tight_layout()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def preview_word(waveform, spans, num_frames, transcript, sample_rate=bundle.sample_rate):\n    ratio = waveform.size(1) / num_frames\n    x0 = int(ratio * spans[0].start)\n    x1 = int(ratio * spans[-1].end)\n    print(f\"{transcript} ({_score(spans):.2f}): {x0 / sample_rate:.3f} - {x1 / sample_rate:.3f} sec\")\n    segment = waveform[:, x0:x1]\n    return IPython.display.Audio(segment.numpy(), rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Normalizing the transcript\n\nThe transcripts passed to the pipeline must be normalized beforehand.\nThe exact process of normalization depends on language.\n\nLanguages that do not have explicit word boundaries\n(such as Chinese, Japanese and Korean) require segmentation first.\nThere are dedicated tools for this, but let's say we have segmented\ntranscript.\n\nThe first step of normalization is romanization.\n[uroman](https://github.com/isi-nlp/uroman)_ is a tool that\nsupports many languages.\n\nHere is a BASH commands to romanize the input text file and write\nthe output to another text file using ``uroman``.\n\n```bash\n$ echo \"des \u00e9v\u00e9nements d'actualit\u00e9 qui se sont produits durant l'ann\u00e9e 1882\" > text.txt\n$ uroman/bin/uroman.pl < text.txt > text_romanized.txt\n$ cat text_romanized.txt\n```\n```text\nCette page concerne des evenements d'actualite qui se sont produits durant l'annee 1882\n```\nThe next step is to remove non-alphabets and punctuations.\nThe following snippet normalizes the romanized transcript.\n\n```python\nimport re\n\n\ndef normalize_uroman(text):\n    text = text.lower()\n    text = text.replace(\"\u2019\", \"'\")\n    text = re.sub(\"([^a-z' ])\", \" \", text)\n    text = re.sub(' +', ' ', text)\n    return text.strip()\n\n\nwith open(\"text_romanized.txt\", \"r\") as f:\n    for line in f:\n        text_normalized = normalize_uroman(line)\n        print(text_normalized)\n```\nRunning the script on the above exanple produces the following.\n\n```text\ncette page concerne des evenements d'actualite qui se sont produits durant l'annee\n```\nNote that, in this example, since \"1882\" was not romanized by ``uroman``,\nit was removed in the normalization step.\nTo avoid this, one needs to romanize numbers, but this is known to be a non-trivial task.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Aligning transcripts to speech\n\nNow we perform the forced alignment for multiple languages.\n\n\n### German\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "text_raw = \"aber seit ich bei ihnen das brot hole\"\ntext_normalized = \"aber seit ich bei ihnen das brot hole\"\n\nurl = \"https://download.pytorch.org/torchaudio/tutorial-assets/10349_8674_000087.flac\"\nwaveform, sample_rate = torchaudio.load(\n    url, frame_offset=int(0.5 * bundle.sample_rate), num_frames=int(2.5 * bundle.sample_rate)\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "assert sample_rate == bundle.sample_rate"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "transcript = text_normalized.split()\ntokens = tokenizer(transcript)\n\nemission, token_spans = compute_alignments(waveform, transcript)\nnum_frames = emission.size(1)\n\nplot_alignments(waveform, token_spans, emission, transcript)\n\nprint(\"Raw Transcript: \", text_raw)\nprint(\"Normalized Transcript: \", text_normalized)\nIPython.display.Audio(waveform, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[0], num_frames, transcript[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[1], num_frames, transcript[1])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[2], num_frames, transcript[2])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[3], num_frames, transcript[3])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[4], num_frames, transcript[4])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[5], num_frames, transcript[5])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[6], num_frames, transcript[6])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[7], num_frames, transcript[7])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Chinese\n\nChinese is a character-based language, and there is not explicit word-level\ntokenization (separated by spaces) in its raw written form. In order to\nobtain word level alignments, you need to first tokenize the transcripts\nat the word level using a word tokenizer like [\u201cStanford\nTokenizer\u201d](https://michelleful.github.io/code-blog/2015/09/10/parsing-chinese-with-stanford/)_.\nHowever this is not needed if you only want character-level alignments.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "text_raw = \"\u5173 \u670d\u52a1 \u9ad8\u7aef \u4ea7\u54c1 \u4ecd \u5904\u4e8e \u4f9b\u4e0d\u5e94\u6c42 \u7684 \u5c40\u9762\"\ntext_normalized = \"guan fuwu gaoduan chanpin reng chuyu gongbuyingqiu de jumian\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "url = \"https://download.pytorch.org/torchaudio/tutorial-assets/mvdr/clean_speech.wav\"\nwaveform, sample_rate = torchaudio.load(url)\nwaveform = waveform[0:1]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "assert sample_rate == bundle.sample_rate"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "transcript = text_normalized.split()\nemission, token_spans = compute_alignments(waveform, transcript)\nnum_frames = emission.size(1)\n\nplot_alignments(waveform, token_spans, emission, transcript)\n\nprint(\"Raw Transcript: \", text_raw)\nprint(\"Normalized Transcript: \", text_normalized)\nIPython.display.Audio(waveform, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[0], num_frames, transcript[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[1], num_frames, transcript[1])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[2], num_frames, transcript[2])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[3], num_frames, transcript[3])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[4], num_frames, transcript[4])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[5], num_frames, transcript[5])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[6], num_frames, transcript[6])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[7], num_frames, transcript[7])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[8], num_frames, transcript[8])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Polish\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "text_raw = \"wtedy ujrza\u0142em na jego brzuchu okr\u0105g\u0142\u0105 czarn\u0105 ran\u0119\"\ntext_normalized = \"wtedy ujrzalem na jego brzuchu okragla czarna rane\"\n\nurl = \"https://download.pytorch.org/torchaudio/tutorial-assets/5090_1447_000088.flac\"\nwaveform, sample_rate = torchaudio.load(url, num_frames=int(4.5 * bundle.sample_rate))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "assert sample_rate == bundle.sample_rate"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "transcript = text_normalized.split()\nemission, token_spans = compute_alignments(waveform, transcript)\nnum_frames = emission.size(1)\n\nplot_alignments(waveform, token_spans, emission, transcript)\n\nprint(\"Raw Transcript: \", text_raw)\nprint(\"Normalized Transcript: \", text_normalized)\nIPython.display.Audio(waveform, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[0], num_frames, transcript[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[1], num_frames, transcript[1])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[2], num_frames, transcript[2])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[3], num_frames, transcript[3])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[4], num_frames, transcript[4])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[5], num_frames, transcript[5])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[6], num_frames, transcript[6])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[7], num_frames, transcript[7])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Portuguese\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "text_raw = \"na imensa extens\u00e3o onde se esconde o inconsciente imortal\"\ntext_normalized = \"na imensa extensao onde se esconde o inconsciente imortal\"\n\nurl = \"https://download.pytorch.org/torchaudio/tutorial-assets/6566_5323_000027.flac\"\nwaveform, sample_rate = torchaudio.load(\n    url, frame_offset=int(bundle.sample_rate), num_frames=int(4.6 * bundle.sample_rate)\n)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "assert sample_rate == bundle.sample_rate"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "transcript = text_normalized.split()\nemission, token_spans = compute_alignments(waveform, transcript)\nnum_frames = emission.size(1)\n\nplot_alignments(waveform, token_spans, emission, transcript)\n\nprint(\"Raw Transcript: \", text_raw)\nprint(\"Normalized Transcript: \", text_normalized)\nIPython.display.Audio(waveform, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[0], num_frames, transcript[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[1], num_frames, transcript[1])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[2], num_frames, transcript[2])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[3], num_frames, transcript[3])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[4], num_frames, transcript[4])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[5], num_frames, transcript[5])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[6], num_frames, transcript[6])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[7], num_frames, transcript[7])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[8], num_frames, transcript[8])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Italian\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "text_raw = \"elle giacean per terra tutte quante\"\ntext_normalized = \"elle giacean per terra tutte quante\"\n\nurl = \"https://download.pytorch.org/torchaudio/tutorial-assets/642_529_000025.flac\"\nwaveform, sample_rate = torchaudio.load(url, num_frames=int(4 * bundle.sample_rate))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "assert sample_rate == bundle.sample_rate"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "transcript = text_normalized.split()\nemission, token_spans = compute_alignments(waveform, transcript)\nnum_frames = emission.size(1)\n\nplot_alignments(waveform, token_spans, emission, transcript)\n\nprint(\"Raw Transcript: \", text_raw)\nprint(\"Normalized Transcript: \", text_normalized)\nIPython.display.Audio(waveform, rate=sample_rate)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[0], num_frames, transcript[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[1], num_frames, transcript[1])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[2], num_frames, transcript[2])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[3], num_frames, transcript[3])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[4], num_frames, transcript[4])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "preview_word(waveform, token_spans[5], num_frames, transcript[5])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Conclusion\n\nIn this tutorial, we looked at how to use torchaudio\u2019s forced alignment\nAPI and a Wav2Vec2 pre-trained mulilingual acoustic model to align\nspeech data to transcripts in five languages.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Acknowledgement\n\nThanks to [Vineel Pratap](vineelkpratap@meta.com)_ and [Zhaoheng\nNi](zni@meta.com)_ for developing and open-sourcing the\nforced aligner API.\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}