{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# ASR Inference with CUDA CTC Decoder\n\n**Author**: [Yuekai Zhang](yuekaiz@nvidia.com)_\n\nThis tutorial shows how to perform speech recognition inference using a\nCUDA-based CTC beam search decoder.\nWe demonstrate this on a pretrained\n[Zipformer](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_ctc)_\nmodel from [Next-gen Kaldi](https://nadirapovey.com/next-gen-kaldi-what-is-it)_ project.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n\nBeam search decoding works by iteratively expanding text hypotheses (beams)\nwith next possible characters, and\u00a0maintaining only the hypotheses with the\nhighest scores at each time step.\n\nThe underlying implementation uses cuda to acclerate the whole decoding process\n A mathematical formula for the decoder can be\nfound in the [paper](https://arxiv.org/pdf/1408.2873.pdf)_, and\na more detailed algorithm can be found in this [blog](https://distill.pub/2017/ctc/)_.\n\nRunning ASR inference using a CUDA CTC Beam Search decoder\nrequires the following components\n\n- Acoustic Model: model predicting modeling units (BPE in this tutorial) from acoustic features\n- BPE Model: the byte-pair encoding (BPE) tokenizer file\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Acoustic Model and Set Up\n\nFirst we import the necessary utilities and fetch the data that we are\nworking with\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torchaudio\n\nprint(torch.__version__)\nprint(torchaudio.__version__)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\nfrom pathlib import Path\n\nimport IPython\nimport sentencepiece as spm\nfrom torchaudio.models.decoder import cuda_ctc_decoder\nfrom torchaudio.utils import download_asset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the pretrained\n[Zipformer](https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-ctc-2022-12-01)_\nmodel that is trained on the [LibriSpeech\ndataset](http://www.openslr.org/12)_. The model is jointly trained with CTC and Transducer loss functions.\nIn this tutorial, we only use CTC head of the model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def download_asset_external(url, key):\n path = Path(torch.hub.get_dir()) / \"torchaudio\" / Path(key)\n if not path.exists():\n path.parent.mkdir(parents=True, exist_ok=True)\n torch.hub.download_url_to_file(url, path)\n return str(path)\n\n\nurl_prefix = \"https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-ctc-2022-12-01\"\nmodel_link = f\"{url_prefix}/resolve/main/exp/cpu_jit.pt\"\nmodel_path = download_asset_external(model_link, \"cuda_ctc_decoder/cpu_jit.pt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will load a sample from the LibriSpeech test-other dataset.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "speech_file = download_asset(\"tutorial-assets/ctc-decoding/1688-142285-0007.wav\")\nwaveform, sample_rate = torchaudio.load(speech_file)\nassert sample_rate == 16000\nIPython.display.Audio(speech_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The transcript corresponding to this audio file is\n\n```\ni really was very much afraid of showing him how much shocked i was at some parts of what he said\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Files and Data for Decoder\n\nNext, we load in our token from BPE model, which is the tokenizer for decoding.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokens\n\nThe tokens are the possible symbols that the acoustic model can predict,\nincluding the blank symbol in CTC. In this tutorial, it includes 500 BPE tokens.\nIt can either be passed in as a\nfile, where each line consists of the tokens corresponding to the same\nindex, or as a list of tokens, each mapping to a unique index.\n\n```\n# tokens\n\n\n\nS\n_THE\n_A\nT\n_AND\n...\n```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bpe_link = f\"{url_prefix}/resolve/main/data/lang_bpe_500/bpe.model\"\nbpe_path = download_asset_external(bpe_link, \"cuda_ctc_decoder/bpe.model\")\n\nbpe_model = spm.SentencePieceProcessor()\nbpe_model.load(bpe_path)\ntokens = [bpe_model.id_to_piece(id) for id in range(bpe_model.get_piece_size())]\nprint(tokens)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Construct CUDA Decoder\nIn this tutorial, we will construct a CUDA beam search decoder.\nThe decoder can be constructed using the factory function\n:py:func:`~torchaudio.models.decoder.cuda_ctc_decoder`.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cuda_decoder = cuda_ctc_decoder(tokens, nbest=10, beam_size=10, blank_skip_threshold=0.95)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run Inference\n\nNow that we have the data, acoustic model, and decoder, we can perform\ninference. The output of the beam search decoder is of type\n:py:class:`~torchaudio.models.decoder.CUCTCHypothesis`, consisting of the\npredicted token IDs, words (symbols corresponding to the token IDs), and hypothesis scores.\nRecall the transcript corresponding to the\nwaveform is\n\n```\ni really was very much afraid of showing him how much shocked i was at some parts of what he said\n```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "actual_transcript = \"i really was very much afraid of showing him how much shocked i was at some parts of what he said\"\nactual_transcript = actual_transcript.split()\n\ndevice = torch.device(\"cuda\", 0)\nacoustic_model = torch.jit.load(model_path)\nacoustic_model.to(device)\nacoustic_model.eval()\n\nwaveform = waveform.to(device)\n\nfeat = torchaudio.compliance.kaldi.fbank(waveform, num_mel_bins=80, snip_edges=False)\nfeat = feat.unsqueeze(0)\nfeat_lens = torch.tensor(feat.size(1), device=device).unsqueeze(0)\n\nencoder_out, encoder_out_lens = acoustic_model.encoder(feat, feat_lens)\nnnet_output = acoustic_model.ctc_output(encoder_out)\nlog_prob = torch.nn.functional.log_softmax(nnet_output, -1)\n\nprint(f\"The shape of log_prob: {log_prob.shape}, the shape of encoder_out_lens: {encoder_out_lens.shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cuda ctc decoder gives the following result.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "results = cuda_decoder(log_prob, encoder_out_lens.to(torch.int32))\nbeam_search_transcript = bpe_model.decode(results[0][0].tokens).lower()\nbeam_search_wer = torchaudio.functional.edit_distance(actual_transcript, beam_search_transcript.split()) / len(\n actual_transcript\n)\n\nprint(f\"Transcript: {beam_search_transcript}\")\nprint(f\"WER: {beam_search_wer}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Beam Search Decoder Parameters\n\nIn this section, we go a little bit more in depth about some different\nparameters and tradeoffs. For the full list of customizable parameters,\nplease refer to the\n:py:func:`documentation `.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Helper Function\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def print_decoded(cuda_decoder, bpe_model, log_prob, encoder_out_lens, param, param_value):\n start_time = time.monotonic()\n results = cuda_decoder(log_prob, encoder_out_lens.to(torch.int32))\n decode_time = time.monotonic() - start_time\n transcript = bpe_model.decode(results[0][0].tokens).lower()\n score = results[0][0].score\n print(f\"{param} {param_value:<3}: {transcript} (score: {score:.2f}; {decode_time:.4f} secs)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### nbest\n\nThis parameter indicates the number of best hypotheses to return. For\ninstance, by setting ``nbest=10`` when constructing the beam search\ndecoder earlier, we can now access the hypotheses with the top 10 scores.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for i in range(10):\n transcript = bpe_model.decode(results[0][i].tokens).lower()\n score = results[0][i].score\n print(f\"{transcript} (score: {score})\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### beam size\n\nThe ``beam_size`` parameter determines the maximum number of best\nhypotheses to hold after each decoding step. Using larger beam sizes\nallows for exploring a larger range of possible hypotheses which can\nproduce hypotheses with higher scores, but it does not provide additional gains beyond a certain point.\nWe recommend to set beam_size=10 for cuda beam search decoder.\n\nIn the example below, we see improvement in decoding quality as we\nincrease beam size from 1 to 3, but notice how using a beam size\nof 3 provides the same output as beam size 10.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "beam_sizes = [1, 2, 3, 10]\n\nfor beam_size in beam_sizes:\n beam_search_decoder = cuda_ctc_decoder(\n tokens,\n nbest=1,\n beam_size=beam_size,\n blank_skip_threshold=0.95,\n )\n print_decoded(beam_search_decoder, bpe_model, log_prob, encoder_out_lens, \"beam size\", beam_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### blank skip threshold\n\nThe ``blank_skip_threshold`` parameter is used to prune the frames which have large blank probability.\nPruning these frames with a good blank_skip_threshold could speed up decoding\nprocess a lot while no accuracy drop.\nSince the rule of CTC, we would keep at least one blank frame between two non-blank frames\nto avoid mistakenly merge two consecutive identical symbols.\nWe recommend to set blank_skip_threshold=0.95 for cuda beam search decoder.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "blank_skip_probs = [0.25, 0.95, 1.0]\n\nfor blank_skip_prob in blank_skip_probs:\n beam_search_decoder = cuda_ctc_decoder(\n tokens,\n nbest=10,\n beam_size=10,\n blank_skip_threshold=blank_skip_prob,\n )\n print_decoded(beam_search_decoder, bpe_model, log_prob, encoder_out_lens, \"blank_skip_threshold\", blank_skip_prob)\n\ndel cuda_decoder" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Benchmark with flashlight CPU decoder\nWe benchmark the throughput and accuracy between CUDA decoder and CPU decoder using librispeech test_other set.\nTo reproduce below benchmark results, you may refer [here](https://github.com/pytorch/audio/tree/main/examples/asr/librispeech_cuda_ctc_decoder)_.\n\n+--------------+------------------------------------------+---------+-----------------------+-----------------------------+\n| Decoder | Setting | WER (%) | N-Best Oracle WER (%) | Decoder Cost Time (seconds) |\n+==============+==========================================+=========+=======================+=============================+\n| CUDA decoder | blank_skip_threshold 0.95 | 5.81 | 4.11 | 2.57 |\n+--------------+------------------------------------------+---------+-----------------------+-----------------------------+\n| CUDA decoder | blank_skip_threshold 1.0 (no frame-skip) | 5.81 | 4.09 | 6.24 |\n+--------------+------------------------------------------+---------+-----------------------+-----------------------------+\n| CPU decoder | beam_size_token 10 | 5.86 | 4.30 | 28.61 |\n+--------------+------------------------------------------+---------+-----------------------+-----------------------------+\n| CPU decoder | beam_size_token 500 | 5.86 | 4.30 | 791.80 |\n+--------------+------------------------------------------+---------+-----------------------+-----------------------------+\n\nFrom the above table, CUDA decoder could give a slight improvement in WER and a significant increase in throughput.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 0 }