{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# StreamReader Advanced Usages\n\n**Author**: [Moto Hira](moto@meta.com)_\n\nThis tutorial is the continuation of\n[StreamReader Basic Usages](./streamreader_basic_tutorial.html)_.\n\nThis shows how to use :py:class:`~torchaudio.io.StreamReader` for\n\n- Device inputs, such as microphone, webcam and screen recording\n- Generating synthetic audio / video\n- Applying preprocessing with custom filter expressions\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torchaudio\n\nprint(torch.__version__)\nprint(torchaudio.__version__)\n\nimport IPython\nimport matplotlib.pyplot as plt\nfrom torchaudio.io import StreamReader\n\nbase_url = \"https://download.pytorch.org/torchaudio/tutorial-assets\"\nAUDIO_URL = f\"{base_url}/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav\"\nVIDEO_URL = f\"{base_url}/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4.mp4\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Audio / Video device input\n\n.. seealso::\n\n - [Accelerated Video Decoding with NVDEC](../hw_acceleration_tutorial.html)_.\n - [Online ASR with Emformer RNN-T](./online_asr_tutorial.html)_.\n - [Device ASR with Emformer RNN-T](./device_asr.html)_.\n\nGiven that the system has proper media devices and libavdevice is\nconfigured to use the devices, the streaming API can\npull media streams from these devices.\n\nTo do this, we pass additional parameters ``format`` and ``option``\nto the constructor. ``format`` specifies the device component and\n``option`` dictionary is specific to the specified component.\n\nThe exact arguments to be passed depend on the system configuration.\nPlease refer to https://ffmpeg.org/ffmpeg-devices.html for the detail.\n\nThe following example illustrates how one can do this on MacBook Pro.\n\nFirst, we need to check the available devices.\n\n.. code::\n\n $ ffmpeg -f avfoundation -list_devices true -i \"\"\n [AVFoundation indev @ 0x143f04e50] AVFoundation video devices:\n [AVFoundation indev @ 0x143f04e50] [0] FaceTime HD Camera\n [AVFoundation indev @ 0x143f04e50] [1] Capture screen 0\n [AVFoundation indev @ 0x143f04e50] AVFoundation audio devices:\n [AVFoundation indev @ 0x143f04e50] [0] MacBook Pro Microphone\n\nWe use `FaceTime HD Camera` as video device (index 0) and\n`MacBook Pro Microphone` as audio device (index 0).\n\nIf we do not pass any ``option``, the device uses its default\nconfiguration. The decoder might not support the configuration.\n\n.. code::\n\n >>> StreamReader(\n ... src=\"0:0\", # The first 0 means `FaceTime HD Camera`, and\n ... # the second 0 indicates `MacBook Pro Microphone`.\n ... format=\"avfoundation\",\n ... )\n [avfoundation @ 0x125d4fe00] Selected framerate (29.970030) is not supported by the device.\n [avfoundation @ 0x125d4fe00] Supported modes:\n [avfoundation @ 0x125d4fe00] 1280x720@[1.000000 30.000000]fps\n [avfoundation @ 0x125d4fe00] 640x480@[1.000000 30.000000]fps\n Traceback (most recent call last):\n File \"\", line 1, in \n ...\n RuntimeError: Failed to open the input: 0:0\n\nBy providing ``option``, we can change the format that the device\nstreams to a format supported by decoder.\n\n.. code::\n\n >>> streamer = StreamReader(\n ... src=\"0:0\",\n ... format=\"avfoundation\",\n ... option={\"framerate\": \"30\", \"pixel_format\": \"bgr0\"},\n ... )\n >>> for i in range(streamer.num_src_streams):\n ... print(streamer.get_src_stream_info(i))\n SourceVideoStream(media_type='video', codec='rawvideo', codec_long_name='raw video', format='bgr0', bit_rate=0, width=640, height=480, frame_rate=30.0)\n SourceAudioStream(media_type='audio', codec='pcm_f32le', codec_long_name='PCM 32-bit floating point little-endian', format='flt', bit_rate=3072000, sample_rate=48000.0, num_channels=2)\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n## Synthetic source streams\n\nAs a part of device integration, ffmpeg provides a \"virtual device\"\ninterface. This interface provides synthetic audio / video data\ngeneration using libavfilter.\n\nTo use this, we set ``format=lavfi`` and provide a filter description\nto ``src``.\n\nThe detail of filter description can be found at\nhttps://ffmpeg.org/ffmpeg-filters.html\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Audio Examples\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sine wave\nhttps://ffmpeg.org/ffmpeg-filters.html#sine\n\n.. code::\n\n StreamReader(src=\"sine=sample_rate=8000:frequency=360\", format=\"lavfi\")\n\n.. raw:: html\n\n \n \n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Signal with arbitral expression\n\nhttps://ffmpeg.org/ffmpeg-filters.html#aevalsrc\n\n.. code::\n\n # 5 Hz binaural beats on a 360 Hz carrier\n StreamReader(\n src=(\n 'aevalsrc='\n 'sample_rate=8000:'\n 'exprs=0.1*sin(2*PI*(360-5/2)*t)|0.1*sin(2*PI*(360+5/2)*t)'\n ),\n format='lavfi',\n )\n\n.. raw:: html\n\n \n \n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Noise\nhttps://ffmpeg.org/ffmpeg-filters.html#anoisesrc\n\n.. code::\n\n StreamReader(src=\"anoisesrc=color=pink:sample_rate=8000:amplitude=0.5\", format=\"lavfi\")\n\n.. raw:: html\n\n \n \n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Video Examples\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Cellular automaton\nhttps://ffmpeg.org/ffmpeg-filters.html#cellauto\n\n.. code::\n\n StreamReader(src=f\"cellauto\", format=\"lavfi\")\n\n.. raw:: html\n\n \n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Mandelbrot\nhttps://ffmpeg.org/ffmpeg-filters.html#cellauto\n\n.. code::\n\n StreamReader(src=f\"mandelbrot\", format=\"lavfi\")\n\n.. raw:: html\n\n \n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### MPlayer Test patterns\nhttps://ffmpeg.org/ffmpeg-filters.html#mptestsrc\n\n.. code::\n\n StreamReader(src=f\"mptestsrc\", format=\"lavfi\")\n\n.. raw:: html\n\n \n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### John Conway's life game\nhttps://ffmpeg.org/ffmpeg-filters.html#life\n\n.. code::\n\n StreamReader(src=f\"life\", format=\"lavfi\")\n\n.. raw:: html\n\n \n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sierpinski carpet/triangle fractal\nhttps://ffmpeg.org/ffmpeg-filters.html#sierpinski\n\n.. code::\n\n StreamReader(src=f\"sierpinski\", format=\"lavfi\")\n\n.. raw:: html\n\n \n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom filters\n\nWhen defining an output stream, you can use\n:py:meth:`~torchaudio.io.StreamReader.add_audio_stream` and\n:py:meth:`~torchaudio.io.StreamReader.add_video_stream` methods.\n\nThese methods take ``filter_desc`` argument, which is a string\nformatted according to ffmpeg's\n[filter expression](https://ffmpeg.org/ffmpeg-filters.html).\n\nThe difference between ``add_basic_(audio|video)_stream`` and\n``add_(audio|video)_stream`` is that ``add_basic_(audio|video)_stream``\nconstructs the filter expression and passes it to the same underlying\nimplementation. Everything ``add_basic_(audio|video)_stream`` can be\nachieved with ``add_(audio|video)_stream``.\n\n

Note

- When applying custom filters, the client code must convert\n the audio/video stream to one of the formats that torchaudio\n can convert to tensor format.\n This can be achieved, for example, by applying\n ``format=pix_fmts=rgb24`` to video stream and\n ``aformat=sample_fmts=fltp`` to audio stream.\n - Each output stream has separate filter graph. Therefore, it is\n not possible to use different input/output streams for a\n filter expression. However, it is possible to split one input\n stream into multiple of them, and merge them later.

\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Audio Examples\n\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# fmt: off\ndescs = [\n # No filtering\n \"anull\",\n # Apply a highpass filter then a lowpass filter\n \"highpass=f=200,lowpass=f=1000\",\n # Manipulate spectrogram\n (\n \"afftfilt=\"\n \"real='hypot(re,im)*sin(0)':\"\n \"imag='hypot(re,im)*cos(0)':\"\n \"win_size=512:\"\n \"overlap=0.75\"\n ),\n # Manipulate spectrogram\n (\n \"afftfilt=\"\n \"real='hypot(re,im)*cos((random(0)*2-1)*2*3.14)':\"\n \"imag='hypot(re,im)*sin((random(1)*2-1)*2*3.14)':\"\n \"win_size=128:\"\n \"overlap=0.8\"\n ),\n]\n# fmt: on" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sample_rate = 8000\n\nstreamer = StreamReader(AUDIO_URL)\nfor desc in descs:\n streamer.add_audio_stream(\n frames_per_chunk=40000,\n filter_desc=f\"aresample={sample_rate},{desc},aformat=sample_fmts=fltp\",\n )\n\nchunks = next(streamer.stream())\n\n\ndef _display(i):\n print(\"filter_desc:\", streamer.get_out_stream_info(i).filter_description)\n fig, axs = plt.subplots(2, 1)\n waveform = chunks[i][:, 0]\n axs[0].plot(waveform)\n axs[0].grid(True)\n axs[0].set_ylim([-1, 1])\n plt.setp(axs[0].get_xticklabels(), visible=False)\n axs[1].specgram(waveform, Fs=sample_rate)\n fig.tight_layout()\n return IPython.display.Audio(chunks[i].T, rate=sample_rate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Original\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "_display(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Highpass / lowpass filter\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "_display(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### FFT filter - Robot \ud83e\udd16\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "_display(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### FFT filter - Whisper\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "_display(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Video Examples\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# fmt: off\ndescs = [\n # No effect\n \"null\",\n # Split the input stream and apply horizontal flip to the right half.\n (\n \"split [main][tmp];\"\n \"[tmp] crop=iw/2:ih:0:0, hflip [flip];\"\n \"[main][flip] overlay=W/2:0\"\n ),\n # Edge detection\n \"edgedetect=mode=canny\",\n # Rotate image by randomly and fill the background with brown\n \"rotate=angle=-random(1)*PI:fillcolor=brown\",\n # Manipulate pixel values based on the coordinate\n \"geq=r='X/W*r(X,Y)':g='(1-X/W)*g(X,Y)':b='(H-Y)/H*b(X,Y)'\"\n]\n# fmt: on" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "streamer = StreamReader(VIDEO_URL)\nfor desc in descs:\n streamer.add_video_stream(\n frames_per_chunk=30,\n filter_desc=f\"fps=10,{desc},format=pix_fmts=rgb24\",\n )\n\nstreamer.seek(12)\n\nchunks = next(streamer.stream())\n\n\ndef _display(i):\n print(\"filter_desc:\", streamer.get_out_stream_info(i).filter_description)\n _, axs = plt.subplots(1, 3, figsize=(8, 1.9))\n chunk = chunks[i]\n for j in range(3):\n axs[j].imshow(chunk[10 * j + 1].permute(1, 2, 0))\n axs[j].set_axis_off()\n plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Original\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "_display(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Mirror\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "_display(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Edge detection\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "_display(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Random rotation\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "_display(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Pixel manipulation\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "_display(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tag: :obj:`torchaudio.io`\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 0 }