{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# ASR Inference with CTC Decoder\n\n**Author**: [Caroline Chen](carolinechen@meta.com)_\n\nThis tutorial shows how to perform speech recognition inference using a\nCTC beam search decoder with lexicon constraint and KenLM language model\nsupport. We demonstrate this on a pretrained wav2vec 2.0 model trained\nusing CTC loss.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Overview\n\nBeam search decoding works by iteratively expanding text hypotheses (beams)\nwith next possible characters, and\u00a0maintaining only the hypotheses with the\nhighest scores at each time step. A language model can be incorporated into\nthe scoring computation, and adding a lexicon constraint restricts the\nnext possible tokens for the hypotheses so that only words from the lexicon\ncan be generated.\n\nThe underlying implementation is ported from [Flashlight](https://arxiv.org/pdf/2201.12465.pdf)_'s\nbeam search decoder. A mathematical formula for the decoder optimization can be\nfound in the [Wav2Letter paper](https://arxiv.org/pdf/1609.03193.pdf)_, and\na more detailed algorithm can be found in this [blog](https://towardsdatascience.com/boosting-your-sequence-generation-performance-with-beam-search-language-model-decoding-74ee64de435a)_.\n\nRunning ASR inference using a CTC Beam Search decoder with a language\nmodel and lexicon constraint requires the following components\n\n-  Acoustic Model: model predicting phonetics from audio waveforms\n-  Tokens: the possible predicted tokens from the acoustic model\n-  Lexicon: mapping between possible words and their corresponding\n   tokens sequence\n-  Language Model (LM): n-gram language model trained with the [KenLM\n   library](https://kheafield.com/code/kenlm/)_, or custom language\n   model that inherits :py:class:`~torchaudio.models.decoder.CTCDecoderLM`\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Acoustic Model and Set Up\n\nFirst we import the necessary utilities and fetch the data that we are\nworking with\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\nimport torchaudio\n\nprint(torch.__version__)\nprint(torchaudio.__version__)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import time\nfrom typing import List\n\nimport IPython\nimport matplotlib.pyplot as plt\nfrom torchaudio.models.decoder import ctc_decoder\nfrom torchaudio.utils import download_asset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We use the pretrained [Wav2Vec 2.0](https://arxiv.org/abs/2006.11477)_\nBase model that is finetuned on 10 min of the [LibriSpeech\ndataset](http://www.openslr.org/12)_, which can be loaded in using\n:data:`torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M`.\nFor more detail on running Wav2Vec 2.0 speech\nrecognition pipelines in torchaudio, please refer to [this\ntutorial](./speech_recognition_pipeline_tutorial.html)_.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M\nacoustic_model = bundle.get_model()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will load a sample from the LibriSpeech test-other dataset.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "speech_file = download_asset(\"tutorial-assets/ctc-decoding/1688-142285-0007.wav\")\n\nIPython.display.Audio(speech_file)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The transcript corresponding to this audio file is\n\n```\ni really was very much afraid of showing him how much shocked i was at some parts of what he said\n```\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "waveform, sample_rate = torchaudio.load(speech_file)\n\nif sample_rate != bundle.sample_rate:\n    waveform = torchaudio.functional.resample(waveform, sample_rate, bundle.sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Files and Data for Decoder\n\nNext, we load in our token, lexicon, and language model data, which are used\nby the decoder to predict words from the acoustic model output. Pretrained\nfiles for the LibriSpeech dataset can be downloaded through torchaudio,\nor the user can provide their own files.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Tokens\n\nThe tokens are the possible symbols that the acoustic model can predict,\nincluding the blank and silent symbols. It can either be passed in as a\nfile, where each line consists of the tokens corresponding to the same\nindex, or as a list of tokens, each mapping to a unique index.\n\n```\n# tokens.txt\n_\n|\ne\nt\n...\n```\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tokens = [label.lower() for label in bundle.get_labels()]\nprint(tokens)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Lexicon\n\nThe lexicon is a mapping from words to their corresponding tokens\nsequence, and is used to restrict the search space of the decoder to\nonly words from the lexicon. The expected format of the lexicon file is\na line per word, with a word followed by its space-split tokens.\n\n```\n# lexcion.txt\na a |\nable a b l e |\nabout a b o u t |\n...\n...\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Language Model\n\nA language model can be used in decoding to improve the results, by\nfactoring in a language model score that represents the likelihood of\nthe sequence into the beam search computation. Below, we outline the\ndifferent forms of language models that are supported for decoding.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### No Language Model\n\nTo create a decoder instance without a language model, set `lm=None`\nwhen initializing the decoder.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### KenLM\n\nThis is an n-gram language model trained with the [KenLM\nlibrary](https://kheafield.com/code/kenlm/)_. Both the ``.arpa`` or\nthe binarized ``.bin`` LM can be used, but the binary format is\nrecommended for faster loading.\n\nThe language model used in this tutorial is a 4-gram KenLM trained using\n[LibriSpeech](http://www.openslr.org/11)_.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Custom Language Model\n\nUsers can define their own custom language model in Python, whether\nit be a statistical or neural network language model, using\n:py:class:`~torchaudio.models.decoder.CTCDecoderLM` and\n:py:class:`~torchaudio.models.decoder.CTCDecoderLMState`.\n\nFor instance, the following code creates a basic wrapper around a PyTorch\n``torch.nn.Module`` language model.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torchaudio.models.decoder import CTCDecoderLM, CTCDecoderLMState\n\n\nclass CustomLM(CTCDecoderLM):\n    \"\"\"Create a Python wrapper around `language_model` to feed to the decoder.\"\"\"\n\n    def __init__(self, language_model: torch.nn.Module):\n        CTCDecoderLM.__init__(self)\n        self.language_model = language_model\n        self.sil = -1  # index for silent token in the language model\n        self.states = {}\n\n        language_model.eval()\n\n    def start(self, start_with_nothing: bool = False):\n        state = CTCDecoderLMState()\n        with torch.no_grad():\n            score = self.language_model(self.sil)\n\n        self.states[state] = score\n        return state\n\n    def score(self, state: CTCDecoderLMState, token_index: int):\n        outstate = state.child(token_index)\n        if outstate not in self.states:\n            score = self.language_model(token_index)\n            self.states[outstate] = score\n        score = self.states[outstate]\n\n        return outstate, score\n\n    def finish(self, state: CTCDecoderLMState):\n        return self.score(state, self.sil)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Downloading Pretrained Files\n\nPretrained files for the LibriSpeech dataset can be downloaded using\n:py:func:`~torchaudio.models.decoder.download_pretrained_files`.\n\nNote: this cell may take a couple of minutes to run, as the language\nmodel can be large\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torchaudio.models.decoder import download_pretrained_files\n\nfiles = download_pretrained_files(\"librispeech-4-gram\")\n\nprint(files)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Construct Decoders\nIn this tutorial, we construct both a beam search decoder and a greedy decoder\nfor comparison.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Beam Search Decoder\nThe decoder can be constructed using the factory function\n:py:func:`~torchaudio.models.decoder.ctc_decoder`.\nIn addition to the previously mentioned components, it also takes in various beam\nsearch decoding parameters and token/word parameters.\n\nThis decoder can also be run without a language model by passing in `None` into the\n`lm` parameter.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "LM_WEIGHT = 3.23\nWORD_SCORE = -0.26\n\nbeam_search_decoder = ctc_decoder(\n    lexicon=files.lexicon,\n    tokens=files.tokens,\n    lm=files.lm,\n    nbest=3,\n    beam_size=1500,\n    lm_weight=LM_WEIGHT,\n    word_score=WORD_SCORE,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Greedy Decoder\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class GreedyCTCDecoder(torch.nn.Module):\n    def __init__(self, labels, blank=0):\n        super().__init__()\n        self.labels = labels\n        self.blank = blank\n\n    def forward(self, emission: torch.Tensor) -> List[str]:\n        \"\"\"Given a sequence emission over labels, get the best path\n        Args:\n          emission (Tensor): Logit tensors. Shape `[num_seq, num_label]`.\n\n        Returns:\n          List[str]: The resulting transcript\n        \"\"\"\n        indices = torch.argmax(emission, dim=-1)  # [num_seq,]\n        indices = torch.unique_consecutive(indices, dim=-1)\n        indices = [i for i in indices if i != self.blank]\n        joined = \"\".join([self.labels[i] for i in indices])\n        return joined.replace(\"|\", \" \").strip().split()\n\n\ngreedy_decoder = GreedyCTCDecoder(tokens)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Run Inference\n\nNow that we have the data, acoustic model, and decoder, we can perform\ninference. The output of the beam search decoder is of type\n:py:class:`~torchaudio.models.decoder.CTCHypothesis`, consisting of the\npredicted token IDs, corresponding words (if a lexicon is provided), hypothesis score,\nand timesteps corresponding to the token IDs. Recall the transcript corresponding to the\nwaveform is\n\n```\ni really was very much afraid of showing him how much shocked i was at some parts of what he said\n```\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "actual_transcript = \"i really was very much afraid of showing him how much shocked i was at some parts of what he said\"\nactual_transcript = actual_transcript.split()\n\nemission, _ = acoustic_model(waveform)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The greedy decoder gives the following result.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "greedy_result = greedy_decoder(emission[0])\ngreedy_transcript = \" \".join(greedy_result)\ngreedy_wer = torchaudio.functional.edit_distance(actual_transcript, greedy_result) / len(actual_transcript)\n\nprint(f\"Transcript: {greedy_transcript}\")\nprint(f\"WER: {greedy_wer}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Using the beam search decoder:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "beam_search_result = beam_search_decoder(emission)\nbeam_search_transcript = \" \".join(beam_search_result[0][0].words).strip()\nbeam_search_wer = torchaudio.functional.edit_distance(actual_transcript, beam_search_result[0][0].words) / len(\n    actual_transcript\n)\n\nprint(f\"Transcript: {beam_search_transcript}\")\nprint(f\"WER: {beam_search_wer}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>The :py:attr:`~torchaudio.models.decoder.CTCHypothesis.words`\n   field of the output hypotheses will be empty if no lexicon\n   is provided to the decoder. To retrieve a transcript with lexicon-free\n   decoding, you can perform the following to retrieve the token indices,\n   convert them to original tokens, then join them together.\n\n   .. code::\n\n      tokens_str = \"\".join(beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens))\n      transcript = \" \".join(tokens_str.split(\"|\"))</p></div>\n\nWe see that the transcript with the lexicon-constrained beam search\ndecoder produces a more accurate result consisting of real words, while\nthe greedy decoder can predict incorrectly spelled words like \u201caffrayd\u201d\nand \u201cshoktd\u201d.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Incremental decoding\n\nIf the input speech is long, one can decode the emission in\nincremental manner.\n\nYou need to first initialize the internal state of the decoder with\n:py:meth:`~torchaudio.models.decoder.CTCDecoder.decode_begin`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "beam_search_decoder.decode_begin()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then, you can pass emissions to\n:py:meth:`~torchaudio.models.decoder.CTCDecoder.decode_begin`.\nHere we use the same emission but pass it to the decoder one frame\nat a time.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "for t in range(emission.size(1)):\n    beam_search_decoder.decode_step(emission[0, t:t + 1, :])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, finalize the internal state of the decoder, and retrieve the\nresult.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "beam_search_decoder.decode_end()\nbeam_search_result_inc = beam_search_decoder.get_final_hypothesis()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The result of incremental decoding is identical to batch decoding.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "beam_search_transcript_inc = \" \".join(beam_search_result_inc[0].words).strip()\nbeam_search_wer_inc = torchaudio.functional.edit_distance(\n    actual_transcript, beam_search_result_inc[0].words) / len(actual_transcript)\n\nprint(f\"Transcript: {beam_search_transcript_inc}\")\nprint(f\"WER: {beam_search_wer_inc}\")\n\nassert beam_search_result[0][0].words == beam_search_result_inc[0].words\nassert beam_search_result[0][0].score == beam_search_result_inc[0].score\ntorch.testing.assert_close(beam_search_result[0][0].timesteps, beam_search_result_inc[0].timesteps)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Timestep Alignments\nRecall that one of the components of the resulting Hypotheses is timesteps\ncorresponding to the token IDs.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "timesteps = beam_search_result[0][0].timesteps\npredicted_tokens = beam_search_decoder.idxs_to_tokens(beam_search_result[0][0].tokens)\n\nprint(predicted_tokens, len(predicted_tokens))\nprint(timesteps, timesteps.shape[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Below, we visualize the token timestep alignments relative to the original waveform.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot_alignments(waveform, emission, tokens, timesteps, sample_rate):\n\n    t = torch.arange(waveform.size(0)) / sample_rate\n    ratio = waveform.size(0) / emission.size(1) / sample_rate\n\n    chars = []\n    words = []\n    word_start = None\n    for token, timestep in zip(tokens, timesteps * ratio):\n        if token == \"|\":\n            if word_start is not None:\n                words.append((word_start, timestep))\n            word_start = None\n        else:\n            chars.append((token, timestep))\n            if word_start is None:\n                word_start = timestep\n\n    fig, axes = plt.subplots(3, 1)\n\n    def _plot(ax, xlim):\n        ax.plot(t, waveform)\n        for token, timestep in chars:\n            ax.annotate(token.upper(), (timestep, 0.5))\n        for word_start, word_end in words:\n            ax.axvspan(word_start, word_end, alpha=0.1, color=\"red\")\n        ax.set_ylim(-0.6, 0.7)\n        ax.set_yticks([0])\n        ax.grid(True, axis=\"y\")\n        ax.set_xlim(xlim)\n\n    _plot(axes[0], (0.3, 2.5))\n    _plot(axes[1], (2.5, 4.7))\n    _plot(axes[2], (4.7, 6.9))\n    axes[2].set_xlabel(\"time (sec)\")\n    fig.tight_layout()\n\n\nplot_alignments(waveform[0], emission, predicted_tokens, timesteps, bundle.sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Beam Search Decoder Parameters\n\nIn this section, we go a little bit more in depth about some different\nparameters and tradeoffs. For the full list of customizable parameters,\nplease refer to the\n:py:func:`documentation <torchaudio.models.decoder.ctc_decoder>`.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Helper Function\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def print_decoded(decoder, emission, param, param_value):\n    start_time = time.monotonic()\n    result = decoder(emission)\n    decode_time = time.monotonic() - start_time\n\n    transcript = \" \".join(result[0][0].words).lower().strip()\n    score = result[0][0].score\n    print(f\"{param} {param_value:<3}: {transcript} (score: {score:.2f}; {decode_time:.4f} secs)\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### nbest\n\nThis parameter indicates the number of best hypotheses to return, which\nis a property that is not possible with the greedy decoder. For\ninstance, by setting ``nbest=3`` when constructing the beam search\ndecoder earlier, we can now access the hypotheses with the top 3 scores.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "for i in range(3):\n    transcript = \" \".join(beam_search_result[0][i].words).strip()\n    score = beam_search_result[0][i].score\n    print(f\"{transcript} (score: {score})\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### beam size\n\nThe ``beam_size`` parameter determines the maximum number of best\nhypotheses to hold after each decoding step. Using larger beam sizes\nallows for exploring a larger range of possible hypotheses which can\nproduce hypotheses with higher scores, but it is computationally more\nexpensive and does not provide additional gains beyond a certain point.\n\nIn the example below, we see improvement in decoding quality as we\nincrease beam size from 1 to 5 to 50, but notice how using a beam size\nof 500 provides the same output as beam size 50 while increase the\ncomputation time.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "beam_sizes = [1, 5, 50, 500]\n\nfor beam_size in beam_sizes:\n    beam_search_decoder = ctc_decoder(\n        lexicon=files.lexicon,\n        tokens=files.tokens,\n        lm=files.lm,\n        beam_size=beam_size,\n        lm_weight=LM_WEIGHT,\n        word_score=WORD_SCORE,\n    )\n\n    print_decoded(beam_search_decoder, emission, \"beam size\", beam_size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### beam size token\n\nThe ``beam_size_token`` parameter corresponds to the number of tokens to\nconsider for expanding each hypothesis at the decoding step. Exploring a\nlarger number of next possible tokens increases the range of potential\nhypotheses at the cost of computation.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "num_tokens = len(tokens)\nbeam_size_tokens = [1, 5, 10, num_tokens]\n\nfor beam_size_token in beam_size_tokens:\n    beam_search_decoder = ctc_decoder(\n        lexicon=files.lexicon,\n        tokens=files.tokens,\n        lm=files.lm,\n        beam_size_token=beam_size_token,\n        lm_weight=LM_WEIGHT,\n        word_score=WORD_SCORE,\n    )\n\n    print_decoded(beam_search_decoder, emission, \"beam size token\", beam_size_token)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### beam threshold\n\nThe ``beam_threshold`` parameter is used to prune the stored hypotheses\nset at each decoding step, removing hypotheses whose scores are greater\nthan ``beam_threshold`` away from the highest scoring hypothesis. There\nis a balance between choosing smaller thresholds to prune more\nhypotheses and reduce the search space, and choosing a large enough\nthreshold such that plausible hypotheses are not pruned.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "beam_thresholds = [1, 5, 10, 25]\n\nfor beam_threshold in beam_thresholds:\n    beam_search_decoder = ctc_decoder(\n        lexicon=files.lexicon,\n        tokens=files.tokens,\n        lm=files.lm,\n        beam_threshold=beam_threshold,\n        lm_weight=LM_WEIGHT,\n        word_score=WORD_SCORE,\n    )\n\n    print_decoded(beam_search_decoder, emission, \"beam threshold\", beam_threshold)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### language model weight\n\nThe ``lm_weight`` parameter is the weight to assign to the language\nmodel score which to accumulate with the acoustic model score for\ndetermining the overall scores. Larger weights encourage the model to\npredict next words based on the language model, while smaller weights\ngive more weight to the acoustic model score instead.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "lm_weights = [0, LM_WEIGHT, 15]\n\nfor lm_weight in lm_weights:\n    beam_search_decoder = ctc_decoder(\n        lexicon=files.lexicon,\n        tokens=files.tokens,\n        lm=files.lm,\n        lm_weight=lm_weight,\n        word_score=WORD_SCORE,\n    )\n\n    print_decoded(beam_search_decoder, emission, \"lm weight\", lm_weight)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### additional parameters\n\nAdditional parameters that can be optimized include the following\n\n- ``word_score``: score to add when word finishes\n- ``unk_score``: unknown word appearance score to add\n- ``sil_score``: silence appearance score to add\n- ``log_add``: whether to use log add for lexicon Trie smearing\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}