{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Speech Recognition with Wav2Vec2\n\n**Author**: [Moto Hira](moto@meta.com)_\n\nThis tutorial shows how to perform speech recognition using using\npre-trained models from wav2vec 2.0\n[[paper](https://arxiv.org/abs/2006.11477)_].\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n\nThe process of speech recognition looks like the following.\n\n1. Extract the acoustic features from audio waveform\n\n2. Estimate the class of the acoustic features frame-by-frame\n\n3. Generate hypothesis from the sequence of the class probabilities\n\nTorchaudio provides easy access to the pre-trained weights and\nassociated information, such as the expected sample rate and class\nlabels. They are bundled together and available under\n:py:mod:`torchaudio.pipelines` module.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torchaudio\n\nprint(torch.__version__)\nprint(torchaudio.__version__)\n\ntorch.random.manual_seed(0)\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\nprint(device)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import IPython\nimport matplotlib.pyplot as plt\nfrom torchaudio.utils import download_asset\n\nSPEECH_FILE = download_asset(\"tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a pipeline\n\nFirst, we will create a Wav2Vec2 model that performs the feature\nextraction and the classification.\n\nThere are two types of Wav2Vec2 pre-trained weights available in\ntorchaudio. The ones fine-tuned for ASR task, and the ones not\nfine-tuned.\n\nWav2Vec2 (and HuBERT) models are trained in self-supervised manner. They\nare firstly trained with audio only for representation learning, then\nfine-tuned for a specific task with additional labels.\n\nThe pre-trained weights without fine-tuning can be fine-tuned\nfor other downstream tasks as well, but this tutorial does not\ncover that.\n\nWe will use :py:data:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H` here.\n\nThere are multiple pre-trained models available in :py:mod:`torchaudio.pipelines`.\nPlease check the documentation for the detail of how they are trained.\n\nThe bundle object provides the interface to instantiate model and other\ninformation. Sampling rate and the class labels are found as follow.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H\n\nprint(\"Sample Rate:\", bundle.sample_rate)\n\nprint(\"Labels:\", bundle.get_labels())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Model can be constructed as following. This process will automatically\nfetch the pre-trained weights and load it into the model.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = bundle.get_model().to(device)\n\nprint(model.__class__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading data\n\nWe will use the speech data from [VOiCES\ndataset](https://iqtlabs.github.io/voices/)_, which is licensed under\nCreative Commos BY 4.0.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "IPython.display.Audio(SPEECH_FILE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To load data, we use :py:func:`torchaudio.load`.\n\nIf the sampling rate is different from what the pipeline expects, then\nwe can use :py:func:`torchaudio.functional.resample` for resampling.\n\n
- :py:func:`torchaudio.functional.resample` works on CUDA tensors as well.\n - When performing resampling multiple times on the same set of sample rates,\n using :py:class:`torchaudio.transforms.Resample` might improve the performace.
Wav2Vec2 models fine-tuned for ASR task can perform feature\n extraction and classification with one step, but for the sake of the\n tutorial, we also show how to perform feature extraction here.