{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Device ASR with Emformer RNN-T\n\n**Author**: [Moto Hira](moto@meta.com)_, [Jeff Hwang](jeffhwang@meta.com)_.\n\nThis tutorial shows how to use Emformer RNN-T and streaming API\nto perform speech recognition on a streaming device input, i.e. microphone\non laptop.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Note

This tutorial requires FFmpeg libraries.\n Please refer to `FFmpeg dependency ` for\n the detail.

\n\n

Note

This tutorial was tested on MacBook Pro and Dynabook with Windows 10.\n\n This tutorial does NOT work on Google Colab because the server running\n this tutorial does not have a microphone that you can talk to.

\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Overview\n\nWe use streaming API to fetch audio from audio device (microphone)\nchunk by chunk, then run inference using Emformer RNN-T.\n\nFor the basic usage of the streaming API and Emformer RNN-T\nplease refer to\n[StreamReader Basic Usage](./streamreader_basic_tutorial.html)_ and\n[Online ASR with Emformer RNN-T](./online_asr_tutorial.html)_.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Checking the supported devices\n\nFirstly, we need to check the devices that Streaming API can access,\nand figure out the arguments (``src`` and ``format``) we need to pass\nto :py:func:`~torchaudio.io.StreamReader` class.\n\nWe use ``ffmpeg`` command for this. ``ffmpeg`` abstracts away the\ndifference of underlying hardware implementations, but the expected\nvalue for ``format`` varies across OS and each ``format`` defines\ndifferent syntax for ``src``.\n\nThe details of supported ``format`` values and ``src`` syntax can\nbe found in https://ffmpeg.org/ffmpeg-devices.html.\n\nFor macOS, the following command will list the available devices.\n\n.. code::\n\n $ ffmpeg -f avfoundation -list_devices true -i dummy\n ...\n [AVFoundation indev @ 0x126e049d0] AVFoundation video devices:\n [AVFoundation indev @ 0x126e049d0] [0] FaceTime HD Camera\n [AVFoundation indev @ 0x126e049d0] [1] Capture screen 0\n [AVFoundation indev @ 0x126e049d0] AVFoundation audio devices:\n [AVFoundation indev @ 0x126e049d0] [0] ZoomAudioDevice\n [AVFoundation indev @ 0x126e049d0] [1] MacBook Pro Microphone\n\nWe will use the following values for Streaming API.\n\n.. code::\n\n StreamReader(\n src = \":1\", # no video, audio from device 1, \"MacBook Pro Microphone\"\n format = \"avfoundation\",\n )\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For Windows, ``dshow`` device should work.\n\n.. code::\n\n > ffmpeg -f dshow -list_devices true -i dummy\n ...\n [dshow @ 000001adcabb02c0] DirectShow video devices (some may be both video and audio devices)\n [dshow @ 000001adcabb02c0] \"TOSHIBA Web Camera - FHD\"\n [dshow @ 000001adcabb02c0] Alternative name \"@device_pnp_\\\\?\\usb#vid_10f1&pid_1a42&mi_00#7&27d916e6&0&0000#{65e8773d-8f56-11d0-a3b9-00a0c9223196}\\global\"\n [dshow @ 000001adcabb02c0] DirectShow audio devices\n [dshow @ 000001adcabb02c0] \"... (Realtek High Definition Audio)\"\n [dshow @ 000001adcabb02c0] Alternative name \"@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\\wave_{BF2B8AE1-10B8-4CA4-A0DC-D02E18A56177}\"\n\nIn the above case, the following value can be used to stream from microphone.\n\n.. code::\n\n StreamReader(\n src = \"audio=@device_cm_{33D9A762-90C8-11D0-BD43-00A0C911CE86}\\wave_{BF2B8AE1-10B8-4CA4-A0DC-D02E18A56177}\",\n format = \"dshow\",\n )\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Data acquisition\n\nStreaming audio from microphone input requires properly timing data\nacquisition. Failing to do so may introduce discontinuities in the\ndata stream.\n\nFor this reason, we will run the data acquisition in a subprocess.\n\nFirstly, we create a helper function that encapsulates the whole\nprocess executed in the subprocess.\n\nThis function initializes the streaming API, acquires data then\nputs it in a queue, which the main process is watching.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torchaudio\n\n\n# The data acquisition process will stop after this number of steps.\n# This eliminates the need of process synchronization and makes this\n# tutorial simple.\nNUM_ITER = 100\n\n\ndef stream(q, format, src, segment_length, sample_rate):\n from torchaudio.io import StreamReader\n\n print(\"Building StreamReader...\")\n streamer = StreamReader(src, format=format)\n streamer.add_basic_audio_stream(frames_per_chunk=segment_length, sample_rate=sample_rate)\n\n print(streamer.get_src_stream_info(0))\n print(streamer.get_out_stream_info(0))\n\n print(\"Streaming...\")\n print()\n stream_iterator = streamer.stream(timeout=-1, backoff=1.0)\n for _ in range(NUM_ITER):\n (chunk,) = next(stream_iterator)\n q.put(chunk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The notable difference from the non-device streaming is that,\nwe provide ``timeout`` and ``backoff`` parameters to ``stream`` method.\n\nWhen acquiring data, if the rate of acquisition requests is higher\nthan that at which the hardware can prepare the data, then\nthe underlying implementation reports special error code, and expects\nclient code to retry.\n\nPrecise timing is the key for smooth streaming. Reporting this error\nfrom low-level implementation all the way back to Python layer,\nbefore retrying adds undesired overhead.\nFor this reason, the retry behavior is implemented in C++ layer, and\n``timeout`` and ``backoff`` parameters allow client code to control the\nbehavior.\n\nFor the detail of ``timeout`` and ``backoff`` parameters, please refer\nto the documentation of\n:py:meth:`~torchaudio.io.StreamReader.stream` method.\n\n

Note

The proper value of ``backoff`` depends on the system configuration.\n One way to see if ``backoff`` value is appropriate is to save the\n series of acquired chunks as a continuous audio and listen to it.\n If ``backoff`` value is too large, then the data stream is discontinuous.\n The resulting audio sounds sped up.\n If ``backoff`` value is too small or zero, the audio stream is fine,\n but the data acquisition process enters busy-waiting state, and\n this increases the CPU consumption.

\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Building inference pipeline\n\nThe next step is to create components required for inference.\n\nThis is the same process as\n[Online ASR with Emformer RNN-T](./online_asr_tutorial.html)_.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class Pipeline:\n \"\"\"Build inference pipeline from RNNTBundle.\n\n Args:\n bundle (torchaudio.pipelines.RNNTBundle): Bundle object\n beam_width (int): Beam size of beam search decoder.\n \"\"\"\n\n def __init__(self, bundle: torchaudio.pipelines.RNNTBundle, beam_width: int = 10):\n self.bundle = bundle\n self.feature_extractor = bundle.get_streaming_feature_extractor()\n self.decoder = bundle.get_decoder()\n self.token_processor = bundle.get_token_processor()\n\n self.beam_width = beam_width\n\n self.state = None\n self.hypotheses = None\n\n def infer(self, segment: torch.Tensor) -> str:\n \"\"\"Perform streaming inference\"\"\"\n features, length = self.feature_extractor(segment)\n self.hypotheses, self.state = self.decoder.infer(\n features, length, self.beam_width, state=self.state, hypothesis=self.hypotheses\n )\n transcript = self.token_processor(self.hypotheses[0][0], lstrip=False)\n return transcript" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class ContextCacher:\n \"\"\"Cache the end of input data and prepend the next input data with it.\n\n Args:\n segment_length (int): The size of main segment.\n If the incoming segment is shorter, then the segment is padded.\n context_length (int): The size of the context, cached and appended.\n \"\"\"\n\n def __init__(self, segment_length: int, context_length: int):\n self.segment_length = segment_length\n self.context_length = context_length\n self.context = torch.zeros([context_length])\n\n def __call__(self, chunk: torch.Tensor):\n if chunk.size(0) < self.segment_length:\n chunk = torch.nn.functional.pad(chunk, (0, self.segment_length - chunk.size(0)))\n chunk_with_context = torch.cat((self.context, chunk))\n self.context = chunk[-self.context_length :]\n return chunk_with_context" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. The main process\n\nThe execution flow of the main process is as follows:\n\n1. Initialize the inference pipeline.\n2. Launch data acquisition subprocess.\n3. Run inference.\n4. Clean up\n\n

Note

As the data acquisition subprocess will be launched with `\"spawn\"`\n method, all the code on global scope are executed on the subprocess\n as well.\n\n We want to instantiate pipeline only in the main process,\n so we put them in a function and invoke it within\n `__name__ == \"__main__\"` guard.

\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def main(device, src, bundle):\n print(torch.__version__)\n print(torchaudio.__version__)\n\n print(\"Building pipeline...\")\n pipeline = Pipeline(bundle)\n\n sample_rate = bundle.sample_rate\n segment_length = bundle.segment_length * bundle.hop_length\n context_length = bundle.right_context_length * bundle.hop_length\n\n print(f\"Sample rate: {sample_rate}\")\n print(f\"Main segment: {segment_length} frames ({segment_length / sample_rate} seconds)\")\n print(f\"Right context: {context_length} frames ({context_length / sample_rate} seconds)\")\n\n cacher = ContextCacher(segment_length, context_length)\n\n @torch.inference_mode()\n def infer():\n for _ in range(NUM_ITER):\n chunk = q.get()\n segment = cacher(chunk[:, 0])\n transcript = pipeline.infer(segment)\n print(transcript, end=\"\\r\", flush=True)\n\n import torch.multiprocessing as mp\n\n ctx = mp.get_context(\"spawn\")\n q = ctx.Queue()\n p = ctx.Process(target=stream, args=(q, device, src, segment_length, sample_rate))\n p.start()\n infer()\n p.join()\n\n\nif __name__ == \"__main__\":\n main(\n device=\"avfoundation\",\n src=\":1\",\n bundle=torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH,\n )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ".. code::\n\n Building pipeline...\n Sample rate: 16000\n Main segment: 2560 frames (0.16 seconds)\n Right context: 640 frames (0.04 seconds)\n Building StreamReader...\n SourceAudioStream(media_type='audio', codec='pcm_f32le', codec_long_name='PCM 32-bit floating point little-endian', format='flt', bit_rate=1536000, sample_rate=48000.0, num_channels=1)\n OutputStream(source_index=0, filter_description='aresample=16000,aformat=sample_fmts=fltp')\n Streaming...\n\n hello world\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tag: :obj:`torchaudio.io`\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 0 }