{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Device AV-ASR with Emformer RNN-T\n\n**Author**: [Pingchuan Ma](pingchuanma@meta.com)_, [Moto\nHira](moto@meta.com)_.\n\nThis tutorial shows how to run on-device audio-visual speech recognition\n(AV-ASR, or AVSR) with TorchAudio on a streaming device input,\ni.e.\u00a0microphone on laptop. AV-ASR is the task of transcribing text from\naudio and visual streams, which has recently attracted a lot of research\nattention due to its robustness against noise.\n\n
This tutorial requires ffmpeg, sentencepiece, mediapipe,\n opencv-python and scikit-image libraries.\n\n There are multiple ways to install ffmpeg libraries.\n If you are using Anaconda Python\n distribution, ``conda install -c conda-forge 'ffmpeg<7'`` will\n install compatible FFmpeg libraries.\n\n You can run\n ``pip install sentencepiece mediapipe opencv-python scikit-image`` to\n install the other libraries mentioned.
To run this tutorial, please make sure you are in the `tutorial` folder.
We tested the tutorial on torchaudio version 2.0.2 on Macbook Pro (M1 Pro).