{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Text-to-Speech with Tacotron2\n\n**Author**: [Yao-Yuan Yang](https://github.com/yangarbiter)_,\n[Moto Hira](moto@meta.com)_\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Overview\n\nThis tutorial shows how to build text-to-speech pipeline, using the\npretrained Tacotron2 in torchaudio.\n\nThe text-to-speech pipeline goes as follows:\n\n1. Text preprocessing\n\n   First, the input text is encoded into a list of symbols. In this\n   tutorial, we will use English characters and phonemes as the symbols.\n\n2. Spectrogram generation\n\n   From the encoded text, a spectrogram is generated. We use the ``Tacotron2``\n   model for this.\n\n3. Time-domain conversion\n\n   The last step is converting the spectrogram into the waveform. The\n   process to generate speech from spectrogram is also called a Vocoder.\n   In this tutorial, three different vocoders are used,\n   :py:class:`~torchaudio.models.WaveRNN`,\n   :py:class:`~torchaudio.transforms.GriffinLim`, and\n   [Nvidia's WaveGlow](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/)_.\n\n\nThe following figure illustrates the whole process.\n\n<img src=\"https://download.pytorch.org/torchaudio/tutorial-assets/tacotron2_tts_pipeline.png\">\n\nAll the related components are bundled in :py:class:`torchaudio.pipelines.Tacotron2TTSBundle`,\nbut this tutorial will also cover the process under the hood.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Preparation\n\nFirst, we install the necessary dependencies. In addition to\n``torchaudio``, ``DeepPhonemizer`` is required to perform phoneme-based\nencoding.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%%bash\npip3 install deep_phonemizer"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torchaudio\n\ntorch.random.manual_seed(0)\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n\nprint(torch.__version__)\nprint(torchaudio.__version__)\nprint(device)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import IPython\nimport matplotlib.pyplot as plt"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Text Processing\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Character-based encoding\n\nIn this section, we will go through how the character-based encoding\nworks.\n\nSince the pre-trained Tacotron2 model expects specific set of symbol\ntables, the same functionalities is available in ``torchaudio``. However,\nwe will first manually implement the encoding to aid in understanding.\n\nFirst, we define the set of symbols\n``'_-!\\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we will map the\neach character of the input text into the index of the corresponding\nsymbol in the table. Symbols that are not in the table are ignored.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "symbols = \"_-!'(),.:;? abcdefghijklmnopqrstuvwxyz\"\nlook_up = {s: i for i, s in enumerate(symbols)}\nsymbols = set(symbols)\n\n\ndef text_to_sequence(text):\n    text = text.lower()\n    return [look_up[s] for s in text if s in symbols]\n\n\ntext = \"Hello world! Text to speech!\"\nprint(text_to_sequence(text))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As mentioned in the above, the symbol table and indices must match\nwhat the pretrained Tacotron2 model expects. ``torchaudio`` provides the same\ntransform along with the pretrained model. You can\ninstantiate and use such transform as follow.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "processor = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.get_text_processor()\n\ntext = \"Hello world! Text to speech!\"\nprocessed, lengths = processor(text)\n\nprint(processed)\nprint(lengths)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Note: The output of our manual encoding and the ``torchaudio`` ``text_processor`` output matches (meaning we correctly re-implemented what the library does internally). It takes either a text or list of texts as inputs.\nWhen a list of texts are provided, the returned ``lengths`` variable\nrepresents the valid length of each processed tokens in the output\nbatch.\n\nThe intermediate representation can be retrieved as follows:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print([processor.tokens[i] for i in processed[0, : lengths[0]]])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Phoneme-based encoding\n\nPhoneme-based encoding is similar to character-based encoding, but it\nuses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme)\nmodel.\n\nThe detail of the G2P model is out of the scope of this tutorial, we will\njust look at what the conversion looks like.\n\nSimilar to the case of character-based encoding, the encoding process is\nexpected to match what a pretrained Tacotron2 model is trained on.\n``torchaudio`` has an interface to create the process.\n\nThe following code illustrates how to make and use the process. Behind\nthe scene, a G2P model is created using ``DeepPhonemizer`` package, and\nthe pretrained weights published by the author of ``DeepPhonemizer`` is\nfetched.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH\n\nprocessor = bundle.get_text_processor()\n\ntext = \"Hello world! Text to speech!\"\nwith torch.inference_mode():\n    processed, lengths = processor(text)\n\nprint(processed)\nprint(lengths)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Notice that the encoded values are different from the example of\ncharacter-based encoding.\n\nThe intermediate representation looks like the following.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print([processor.tokens[i] for i in processed[0, : lengths[0]]])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Spectrogram Generation\n\n``Tacotron2`` is the model we use to generate spectrogram from the\nencoded text. For the detail of the model, please refer to [the\npaper](https://arxiv.org/abs/1712.05884)_.\n\nIt is easy to instantiate a Tacotron2 model with pretrained weights,\nhowever, note that the input to Tacotron2 models need to be processed\nby the matching text processor.\n\n:py:class:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching\nmodels and processors together so that it is easy to create the pipeline.\n\nFor the available bundles, and its usage, please refer to\n:py:class:`~torchaudio.pipelines.Tacotron2TTSBundle`.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH\nprocessor = bundle.get_text_processor()\ntacotron2 = bundle.get_tacotron2().to(device)\n\ntext = \"Hello world! Text to speech!\"\n\nwith torch.inference_mode():\n    processed, lengths = processor(text)\n    processed = processed.to(device)\n    lengths = lengths.to(device)\n    spec, _, _ = tacotron2.infer(processed, lengths)\n\n\n_ = plt.imshow(spec[0].cpu().detach(), origin=\"lower\", aspect=\"auto\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Note that ``Tacotron2.infer`` method perfoms multinomial sampling,\ntherefore, the process of generating the spectrogram incurs randomness.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot():\n    fig, ax = plt.subplots(3, 1)\n    for i in range(3):\n        with torch.inference_mode():\n            spec, spec_lengths, _ = tacotron2.infer(processed, lengths)\n        print(spec[0].shape)\n        ax[i].imshow(spec[0].cpu().detach(), origin=\"lower\", aspect=\"auto\")\n\n\nplot()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Waveform Generation\n\nOnce the spectrogram is generated, the last process is to recover the\nwaveform from the spectrogram using a vocoder.\n\n``torchaudio`` provides vocoders based on ``GriffinLim`` and\n``WaveRNN``.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### WaveRNN Vocoder\n\nContinuing from the previous section, we can instantiate the matching\nWaveRNN model from the same bundle.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH\n\nprocessor = bundle.get_text_processor()\ntacotron2 = bundle.get_tacotron2().to(device)\nvocoder = bundle.get_vocoder().to(device)\n\ntext = \"Hello world! Text to speech!\"\n\nwith torch.inference_mode():\n    processed, lengths = processor(text)\n    processed = processed.to(device)\n    lengths = lengths.to(device)\n    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)\n    waveforms, lengths = vocoder(spec, spec_lengths)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot(waveforms, spec, sample_rate):\n    waveforms = waveforms.cpu().detach()\n\n    fig, [ax1, ax2] = plt.subplots(2, 1)\n    ax1.plot(waveforms[0])\n    ax1.set_xlim(0, waveforms.size(-1))\n    ax1.grid(True)\n    ax2.imshow(spec[0].cpu().detach(), origin=\"lower\", aspect=\"auto\")\n    return IPython.display.Audio(waveforms[0:1], rate=sample_rate)\n\n\nplot(waveforms, spec, vocoder.sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Griffin-Lim Vocoder\n\nUsing the Griffin-Lim vocoder is same as WaveRNN. You can instantiate\nthe vocoder object with\n:py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`\nmethod and pass the spectrogram.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH\n\nprocessor = bundle.get_text_processor()\ntacotron2 = bundle.get_tacotron2().to(device)\nvocoder = bundle.get_vocoder().to(device)\n\nwith torch.inference_mode():\n    processed, lengths = processor(text)\n    processed = processed.to(device)\n    lengths = lengths.to(device)\n    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)\nwaveforms, lengths = vocoder(spec, spec_lengths)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot(waveforms, spec, vocoder.sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Waveglow Vocoder\n\nWaveglow is a vocoder published by Nvidia. The pretrained weights are\npublished on Torch Hub. One can instantiate the model using ``torch.hub``\nmodule.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Workaround to load model mapped on GPU\n# https://stackoverflow.com/a/61840832\nwaveglow = torch.hub.load(\n    \"NVIDIA/DeepLearningExamples:torchhub\",\n    \"nvidia_waveglow\",\n    model_math=\"fp32\",\n    pretrained=False,\n)\ncheckpoint = torch.hub.load_state_dict_from_url(\n    \"https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth\",  # noqa: E501\n    progress=False,\n    map_location=device,\n)\nstate_dict = {key.replace(\"module.\", \"\"): value for key, value in checkpoint[\"state_dict\"].items()}\n\nwaveglow.load_state_dict(state_dict)\nwaveglow = waveglow.remove_weightnorm(waveglow)\nwaveglow = waveglow.to(device)\nwaveglow.eval()\n\nwith torch.no_grad():\n    waveforms = waveglow.infer(spec)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot(waveforms, spec, 22050)"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}