{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Text-to-Speech with Tacotron2\n\n**Author**: [Yao-Yuan Yang](https://github.com/yangarbiter)_,\n[Moto Hira](moto@meta.com)_\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n\nThis tutorial shows how to build text-to-speech pipeline, using the\npretrained Tacotron2 in torchaudio.\n\nThe text-to-speech pipeline goes as follows:\n\n1. Text preprocessing\n\n First, the input text is encoded into a list of symbols. In this\n tutorial, we will use English characters and phonemes as the symbols.\n\n2. Spectrogram generation\n\n From the encoded text, a spectrogram is generated. We use the ``Tacotron2``\n model for this.\n\n3. Time-domain conversion\n\n The last step is converting the spectrogram into the waveform. The\n process to generate speech from spectrogram is also called a Vocoder.\n In this tutorial, three different vocoders are used,\n :py:class:`~torchaudio.models.WaveRNN`,\n :py:class:`~torchaudio.transforms.GriffinLim`, and\n [Nvidia's WaveGlow](https://pytorch.org/hub/nvidia_deeplearningexamples_tacotron2/)_.\n\n\nThe following figure illustrates the whole process.\n\n\n\nAll the related components are bundled in :py:class:`torchaudio.pipelines.Tacotron2TTSBundle`,\nbut this tutorial will also cover the process under the hood.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation\n\nFirst, we install the necessary dependencies. In addition to\n``torchaudio``, ``DeepPhonemizer`` is required to perform phoneme-based\nencoding.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%bash\npip3 install deep_phonemizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torchaudio\n\ntorch.random.manual_seed(0)\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n\nprint(torch.__version__)\nprint(torchaudio.__version__)\nprint(device)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import IPython\nimport matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text Processing\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Character-based encoding\n\nIn this section, we will go through how the character-based encoding\nworks.\n\nSince the pre-trained Tacotron2 model expects specific set of symbol\ntables, the same functionalities is available in ``torchaudio``. However,\nwe will first manually implement the encoding to aid in understanding.\n\nFirst, we define the set of symbols\n``'_-!\\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we will map the\neach character of the input text into the index of the corresponding\nsymbol in the table. Symbols that are not in the table are ignored.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "symbols = \"_-!'(),.:;? abcdefghijklmnopqrstuvwxyz\"\nlook_up = {s: i for i, s in enumerate(symbols)}\nsymbols = set(symbols)\n\n\ndef text_to_sequence(text):\n text = text.lower()\n return [look_up[s] for s in text if s in symbols]\n\n\ntext = \"Hello world! Text to speech!\"\nprint(text_to_sequence(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned in the above, the symbol table and indices must match\nwhat the pretrained Tacotron2 model expects. ``torchaudio`` provides the same\ntransform along with the pretrained model. You can\ninstantiate and use such transform as follow.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "processor = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.get_text_processor()\n\ntext = \"Hello world! Text to speech!\"\nprocessed, lengths = processor(text)\n\nprint(processed)\nprint(lengths)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: The output of our manual encoding and the ``torchaudio`` ``text_processor`` output matches (meaning we correctly re-implemented what the library does internally). It takes either a text or list of texts as inputs.\nWhen a list of texts are provided, the returned ``lengths`` variable\nrepresents the valid length of each processed tokens in the output\nbatch.\n\nThe intermediate representation can be retrieved as follows:\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print([processor.tokens[i] for i in processed[0, : lengths[0]]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Phoneme-based encoding\n\nPhoneme-based encoding is similar to character-based encoding, but it\nuses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme)\nmodel.\n\nThe detail of the G2P model is out of the scope of this tutorial, we will\njust look at what the conversion looks like.\n\nSimilar to the case of character-based encoding, the encoding process is\nexpected to match what a pretrained Tacotron2 model is trained on.\n``torchaudio`` has an interface to create the process.\n\nThe following code illustrates how to make and use the process. Behind\nthe scene, a G2P model is created using ``DeepPhonemizer`` package, and\nthe pretrained weights published by the author of ``DeepPhonemizer`` is\nfetched.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH\n\nprocessor = bundle.get_text_processor()\n\ntext = \"Hello world! Text to speech!\"\nwith torch.inference_mode():\n processed, lengths = processor(text)\n\nprint(processed)\nprint(lengths)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the encoded values are different from the example of\ncharacter-based encoding.\n\nThe intermediate representation looks like the following.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print([processor.tokens[i] for i in processed[0, : lengths[0]]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Spectrogram Generation\n\n``Tacotron2`` is the model we use to generate spectrogram from the\nencoded text. For the detail of the model, please refer to [the\npaper](https://arxiv.org/abs/1712.05884)_.\n\nIt is easy to instantiate a Tacotron2 model with pretrained weights,\nhowever, note that the input to Tacotron2 models need to be processed\nby the matching text processor.\n\n:py:class:`torchaudio.pipelines.Tacotron2TTSBundle` bundles the matching\nmodels and processors together so that it is easy to create the pipeline.\n\nFor the available bundles, and its usage, please refer to\n:py:class:`~torchaudio.pipelines.Tacotron2TTSBundle`.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH\nprocessor = bundle.get_text_processor()\ntacotron2 = bundle.get_tacotron2().to(device)\n\ntext = \"Hello world! Text to speech!\"\n\nwith torch.inference_mode():\n processed, lengths = processor(text)\n processed = processed.to(device)\n lengths = lengths.to(device)\n spec, _, _ = tacotron2.infer(processed, lengths)\n\n\n_ = plt.imshow(spec[0].cpu().detach(), origin=\"lower\", aspect=\"auto\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that ``Tacotron2.infer`` method perfoms multinomial sampling,\ntherefore, the process of generating the spectrogram incurs randomness.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot():\n fig, ax = plt.subplots(3, 1)\n for i in range(3):\n with torch.inference_mode():\n spec, spec_lengths, _ = tacotron2.infer(processed, lengths)\n print(spec[0].shape)\n ax[i].imshow(spec[0].cpu().detach(), origin=\"lower\", aspect=\"auto\")\n\n\nplot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Waveform Generation\n\nOnce the spectrogram is generated, the last process is to recover the\nwaveform from the spectrogram using a vocoder.\n\n``torchaudio`` provides vocoders based on ``GriffinLim`` and\n``WaveRNN``.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### WaveRNN Vocoder\n\nContinuing from the previous section, we can instantiate the matching\nWaveRNN model from the same bundle.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH\n\nprocessor = bundle.get_text_processor()\ntacotron2 = bundle.get_tacotron2().to(device)\nvocoder = bundle.get_vocoder().to(device)\n\ntext = \"Hello world! Text to speech!\"\n\nwith torch.inference_mode():\n processed, lengths = processor(text)\n processed = processed.to(device)\n lengths = lengths.to(device)\n spec, spec_lengths, _ = tacotron2.infer(processed, lengths)\n waveforms, lengths = vocoder(spec, spec_lengths)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot(waveforms, spec, sample_rate):\n waveforms = waveforms.cpu().detach()\n\n fig, [ax1, ax2] = plt.subplots(2, 1)\n ax1.plot(waveforms[0])\n ax1.set_xlim(0, waveforms.size(-1))\n ax1.grid(True)\n ax2.imshow(spec[0].cpu().detach(), origin=\"lower\", aspect=\"auto\")\n return IPython.display.Audio(waveforms[0:1], rate=sample_rate)\n\n\nplot(waveforms, spec, vocoder.sample_rate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Griffin-Lim Vocoder\n\nUsing the Griffin-Lim vocoder is same as WaveRNN. You can instantiate\nthe vocoder object with\n:py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`\nmethod and pass the spectrogram.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH\n\nprocessor = bundle.get_text_processor()\ntacotron2 = bundle.get_tacotron2().to(device)\nvocoder = bundle.get_vocoder().to(device)\n\nwith torch.inference_mode():\n processed, lengths = processor(text)\n processed = processed.to(device)\n lengths = lengths.to(device)\n spec, spec_lengths, _ = tacotron2.infer(processed, lengths)\nwaveforms, lengths = vocoder(spec, spec_lengths)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot(waveforms, spec, vocoder.sample_rate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Waveglow Vocoder\n\nWaveglow is a vocoder published by Nvidia. The pretrained weights are\npublished on Torch Hub. One can instantiate the model using ``torch.hub``\nmodule.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Workaround to load model mapped on GPU\n# https://stackoverflow.com/a/61840832\nwaveglow = torch.hub.load(\n \"NVIDIA/DeepLearningExamples:torchhub\",\n \"nvidia_waveglow\",\n model_math=\"fp32\",\n pretrained=False,\n)\ncheckpoint = torch.hub.load_state_dict_from_url(\n \"https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth\", # noqa: E501\n progress=False,\n map_location=device,\n)\nstate_dict = {key.replace(\"module.\", \"\"): value for key, value in checkpoint[\"state_dict\"].items()}\n\nwaveglow.load_state_dict(state_dict)\nwaveglow = waveglow.remove_weightnorm(waveglow)\nwaveglow = waveglow.to(device)\nwaveglow.eval()\n\nwith torch.no_grad():\n waveforms = waveglow.infer(spec)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot(waveforms, spec, 22050)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 0 }