{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# StreamWriter Basic Usage\n\n**Author**: [Moto Hira](moto@meta.com)_\n\nThis tutorial shows how to use :py:class:`torchaudio.io.StreamWriter` to\nencode and save audio/video data into various formats/destinations.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Note

This tutorial requires FFmpeg libraries.\n Please refer to `FFmpeg dependency ` for\n the detail.

\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Warning

TorchAudio dynamically loads compatible FFmpeg libraries\n installed on the system.\n The types of supported formats (media format, encoder, encoder\n options, etc) depend on the libraries.\n\n To check the available muxers and encoders, you can use the\n following command\n\n```console\nffmpeg -muxers\nffmpeg -encoders

\n```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torchaudio\n\nprint(torch.__version__)\nprint(torchaudio.__version__)\n\nfrom torchaudio.io import StreamWriter\n\nprint(\"FFmpeg library versions\")\nfor k, v in torchaudio.utils.ffmpeg_utils.get_versions().items():\n print(f\" {k}: {v}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import io\nimport os\nimport tempfile\n\nfrom IPython.display import Audio, Video\n\nfrom torchaudio.utils import download_asset\n\nSAMPLE_PATH = download_asset(\"tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav\")\nWAVEFORM, SAMPLE_RATE = torchaudio.load(SAMPLE_PATH, channels_first=False)\nNUM_FRAMES, NUM_CHANNELS = WAVEFORM.shape\n\n_BASE_DIR = tempfile.TemporaryDirectory()\n\n\ndef get_path(filename):\n return os.path.join(_BASE_DIR.name, filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The basic usage\n\nTo save Tensor data into media formats with StreamWriter, there\nare three necessary steps\n\n1. Specify the output\n2. Configure streams\n3. Write data\n\nThe following code illustrates how to save audio data as WAV file.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 1. Define the destination. (local file in this case)\npath = get_path(\"test.wav\")\ns = StreamWriter(path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 2. Configure the stream. (8kHz, Stereo WAV)\ns.add_audio_stream(\n sample_rate=SAMPLE_RATE,\n num_channels=NUM_CHANNELS,\n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 3. Write the data\nwith s.open():\n s.write_audio_chunk(0, WAVEFORM)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Audio(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we look into each step in more detail.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write destination\n\nStreamWriter supports different types of write destinations\n\n1. Local files\n2. File-like objects\n3. Streaming protocols (such as RTMP and UDP)\n4. Media devices (speakers and video players) \u2020\n\n\u2020 For media devices, please refer to\n[StreamWriter Advanced Usages](./streamwriter_advanced.html)_.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Local files\n\nStreamWriter supports saving media to local files.\n\n\n.. code::\n\n StreamWriter(dst=\"audio.wav\")\n\n StreamWriter(dst=\"audio.mp3\")\n\nThis works for still images and videos as well.\n\n.. code::\n\n StreamWriter(dst=\"image.jpeg\")\n\n StreamWriter(dst=\"video.mpeg\")\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File-like objects\n\nYou can also pass a file-lie object. A file-like object must implement\n``write`` method conforming to :py:attr:`io.RawIOBase.write`.\n\n.. code::\n\n # Open the local file as fileobj\n with open(\"audio.wav\", \"wb\") as dst:\n StreamWriter(dst=dst)\n\n.. code::\n\n # In-memory encoding\n buffer = io.BytesIO()\n StreamWriter(dst=buffer)\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Streaming protocols\n\nYou can stream the media with streaming protocols\n\n.. code::\n\n # Real-Time Messaging Protocol\n StreamWriter(dst=\"rtmp://localhost:1234/live/app\", format=\"flv\")\n\n # UDP\n StreamWriter(dst=\"udp://localhost:48550\", format=\"mpegts\")\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuring output streams\n\nOnce the destination is specified, the next step is to configure the streams.\nFor typical audio and still image cases, only one stream is required,\nbut for video with audio, at least two streams (one for audio and the other\nfor video) need to be configured.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Audio Stream\n\nAn audio stream can be added with\n:py:meth:`~torchaudio.io.StreamWriter.add_audio_stream` method.\n\nFor writing regular audio files, at minimum ``sample_rate`` and ``num_channels``\nare required.\n\n.. code::\n\n s = StreamWriter(\"audio.wav\")\n s.add_audio_stream(sample_rate=8000, num_channels=2)\n\nBy default, audio streams expect the input waveform tensors to be ``torch.float32`` type.\nIf the above case, the data will be encoded into the detault encoding format of WAV format,\nwhich is 16-bit signed integer Linear PCM. StreamWriter converts the sample format internally.\n\nIf the encoder supports multiple sample formats and you want to change the encoder sample format,\nyou can use ``encoder_format`` option.\n\nIn the following example, the StreamWriter expects the data type of the input waveform Tensor\nto be ``torch.float32``, but it will convert the sample to 16-bit signed integer when encoding.\n\n.. code::\n\n s = StreamWriter(\"audio.mp3\")\n s.add_audio_stream(\n ...,\n encoder=\"libmp3lame\", # \"libmp3lame\" is often the default encoder for mp3,\n # but specifying it manually, for the sake of illustration.\n\n encoder_format=\"s16p\", # \"libmp3lame\" encoder supports the following sample format.\n # - \"s16p\" (16-bit signed integer)\n # - \"s32p\" (32-bit signed integer)\n # - \"fltp\" (32-bit floating point)\n )\n\nIf the data type of your waveform Tensor is something other than ``torch.float32``,\nyou can provide ``format`` option to change the expected data type.\n\nThe following example configures StreamWriter to expect Tensor of ``torch.int16`` type.\n\n.. code::\n\n # Audio data passed to StreamWriter must be torch.int16\n s.add_audio_stream(..., format=\"s16\")\n\nThe following figure illustrates how ``format`` and ``encoder_format`` options work\nfor audio streams.\n\n\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Video Stream\n\nTo add a still image or a video stream, you can use\n:py:meth:`~torchaudio.io.StreamWriter.add_video_stream` method.\n\nAt minimum, ``frame_rate``, ``height`` and ``width`` are required.\n\n.. code::\n\n s = StreamWriter(\"video.mp4\")\n s.add_video_stream(frame_rate=10, height=96, width=128)\n\nFor still images, please use ``frame_rate=1``.\n\n.. code::\n\n s = StreamWriter(\"image.png\")\n s.add_video_stream(frame_rate=1, ...)\n\nSimilar to the audio stream, you can provide ``format`` and ``encoder_format``\noption to controll the format of input data and encoding.\n\nThe following example encodes video data in YUV422 format.\n\n.. code::\n\n s = StreamWriter(\"video.mov\")\n s.add_video_stream(\n ...,\n encoder=\"libx264\", # libx264 supports different YUV formats, such as\n # yuv420p yuvj420p yuv422p yuvj422p yuv444p yuvj444p nv12 nv16 nv21\n\n encoder_format=\"yuv422p\", # StreamWriter will convert the input data to YUV422 internally\n )\n\nYUV formats are commonly used in video encoding. Many YUV formats are composed of chroma\nchannel of different plane size than that of luma channel. This makes it difficult to\ndirectly express it as ``torch.Tensor`` type.\nTherefore, StreamWriter will automatically convert the input video Tensor into the target format.\n\nStreamWriter expects the input image tensor to be 4-D (`time`, `channel`, `height`, `width`)\nand ``torch.uint8`` type.\n\nThe default color channel is RGB. That is three color channels corresponding red, green and blue.\nIf your input has different color channel, such as BGR and YUV, you can specify it with\n``format`` option.\n\nThe following example specifies BGR format.\n\n.. code::\n\n s.add_video_stream(..., format=\"bgr24\")\n # Image data passed to StreamWriter must have\n # three color channels representing Blue Green Red.\n #\n # The shape of the input tensor has to be\n # (time, channel==3, height, width)\n\n\nThe following figure illustrates how ``format`` and ``encoder_format`` options work for\nvideo streams.\n\n\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Write data\n\nOnce streams are configured, the next step is to open the output location\nand start writing data.\n\nUse :py:meth:`~torchaudio.io.StreamWriter.open` method to open the\ndestination, and then write data with :py:meth:`~torchaudio.io.StreamWriter.write_audio_chunk`\nand/or :py:meth:`~torchaudio.io.StreamWriter.write_video_chunk`.\n\nAudio tensors are expected to have the shape of `(time, channels)`,\nand video/image tensors are expected to have the shape of `(time, channels, height, width)`.\n\nChannels, height and width must match the configuration of the corresponding\nstream, specified with ``\"format\"`` option.\n\nTensor representing a still image must have only one frame in time dimension,\nbut audio and video tensors can have arbitral number of frames in time dimension.\n\nThe following code snippet illustrates this;\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex) Audio\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Configure stream\ns = StreamWriter(dst=get_path(\"audio.wav\"))\ns.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)\n\n# Write data\nwith s.open():\n s.write_audio_chunk(0, WAVEFORM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex) Image\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Image config\nheight = 96\nwidth = 128\n\n# Configure stream\ns = StreamWriter(dst=get_path(\"image.png\"))\ns.add_video_stream(frame_rate=1, height=height, width=width, format=\"rgb24\")\n\n# Generate image\nchunk = torch.randint(256, (1, 3, height, width), dtype=torch.uint8)\n\n# Write data\nwith s.open():\n s.write_video_chunk(0, chunk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex) Video without audio\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Video config\nframe_rate = 30\nheight = 96\nwidth = 128\n\n# Configure stream\ns = StreamWriter(dst=get_path(\"video.mp4\"))\ns.add_video_stream(frame_rate=frame_rate, height=height, width=width, format=\"rgb24\")\n\n# Generate video chunk (3 seconds)\ntime = int(frame_rate * 3)\nchunk = torch.randint(256, (time, 3, height, width), dtype=torch.uint8)\n\n# Write data\nwith s.open():\n s.write_video_chunk(0, chunk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ex) Video with audio\n\nTo write video with audio, separate streams have to be configured.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Configure stream\ns = StreamWriter(dst=get_path(\"video.mp4\"))\ns.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)\ns.add_video_stream(frame_rate=frame_rate, height=height, width=width, format=\"rgb24\")\n\n# Generate audio/video chunk (3 seconds)\ntime = int(SAMPLE_RATE * 3)\naudio_chunk = torch.randn((time, NUM_CHANNELS))\ntime = int(frame_rate * 3)\nvideo_chunk = torch.randint(256, (time, 3, height, width), dtype=torch.uint8)\n\n# Write data\nwith s.open():\n s.write_audio_chunk(0, audio_chunk)\n s.write_video_chunk(1, video_chunk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing data chunk by chunk\n\nWhen writing data, it is possible to split data along time dimension and\nwrite them by smaller chunks.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Write data in one-go\ndst1 = io.BytesIO()\ns = StreamWriter(dst=dst1, format=\"mp3\")\ns.add_audio_stream(SAMPLE_RATE, NUM_CHANNELS)\nwith s.open():\n s.write_audio_chunk(0, WAVEFORM)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Write data in smaller chunks\ndst2 = io.BytesIO()\ns = StreamWriter(dst=dst2, format=\"mp3\")\ns.add_audio_stream(SAMPLE_RATE, NUM_CHANNELS)\nwith s.open():\n for start in range(0, NUM_FRAMES, SAMPLE_RATE):\n end = start + SAMPLE_RATE\n s.write_audio_chunk(0, WAVEFORM[start:end, ...])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Check that the contents are same\ndst1.seek(0)\nbytes1 = dst1.read()\n\nprint(f\"bytes1: {len(bytes1)}\")\nprint(f\"{bytes1[:10]}...{bytes1[-10:]}\\n\")\n\ndst2.seek(0)\nbytes2 = dst2.read()\n\nprint(f\"bytes2: {len(bytes2)}\")\nprint(f\"{bytes2[:10]}...{bytes2[-10:]}\\n\")\n\nassert bytes1 == bytes2\n\nimport matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example - Spectrum Visualizer\n\nIn this section, we use StreamWriter to create a spectrum visualization\nof audio and save it as a video file.\n\nTo create spectrum visualization, we use\n:py:class:`torchaudio.transforms.Spectrogram`, to get spectrum presentation\nof audio, generate raster images of its visualization using matplotplib,\nthen use StreamWriter to convert them to video with the original audio.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torchaudio.transforms as T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare Data\n\nFirst, we prepare the spectrogram data.\nWe use :py:class:`~torchaudio.transforms.Spectrogram`.\n\nWe adjust ``hop_length`` so that one frame of the spectrogram corresponds\nto one video frame.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "frame_rate = 20\nn_fft = 4000\n\ntrans = T.Spectrogram(\n n_fft=n_fft,\n hop_length=SAMPLE_RATE // frame_rate, # One FFT per one video frame\n normalized=True,\n power=1,\n)\nspecs = trans(WAVEFORM.T)[0].T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting spectrogram looks like the following.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "spec_db = T.AmplitudeToDB(stype=\"magnitude\", top_db=80)(specs.T)\n_ = plt.imshow(spec_db, aspect=\"auto\", origin=\"lower\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare Canvas\n\nWe use ``matplotlib`` to visualize the spectrogram per frame.\nWe create a helper function that plots the spectrogram data and\ngenerates a raster imager of the figure.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=[3.2, 2.4])\nax.set_position([0, 0, 1, 1])\nax.set_facecolor(\"black\")\nncols, nrows = fig.canvas.get_width_height()\n\n\ndef _plot(data):\n ax.clear()\n x = list(range(len(data)))\n R, G, B = 238 / 255, 76 / 255, 44 / 255\n for coeff, alpha in [(0.8, 0.7), (1, 1)]:\n d = data**coeff\n ax.fill_between(x, d, -d, color=[R, G, B, alpha])\n xlim = n_fft // 2 + 1\n ax.set_xlim([-1, n_fft // 2 + 1])\n ax.set_ylim([-1, 1])\n ax.text(\n xlim,\n 0.95,\n f\"Created with TorchAudio\\n{torchaudio.__version__}\",\n color=\"white\",\n ha=\"right\",\n va=\"top\",\n backgroundcolor=\"black\",\n )\n fig.canvas.draw()\n frame = torch.frombuffer(fig.canvas.tostring_rgb(), dtype=torch.uint8)\n return frame.reshape(nrows, ncols, 3).permute(2, 0, 1)\n\n\n# sphinx_gallery_defer_figures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write Video\n\nFinally, we use StreamWriter and write video.\nWe process one second of audio and video frames at a time.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s = StreamWriter(get_path(\"example.mp4\"))\ns.add_audio_stream(sample_rate=SAMPLE_RATE, num_channels=NUM_CHANNELS)\ns.add_video_stream(frame_rate=frame_rate, height=nrows, width=ncols)\n\nwith s.open():\n i = 0\n # Process by second\n for t in range(0, NUM_FRAMES, SAMPLE_RATE):\n # Write audio chunk\n s.write_audio_chunk(0, WAVEFORM[t : t + SAMPLE_RATE, :])\n\n # write 1 second of video chunk\n frames = [_plot(spec) for spec in specs[i : i + frame_rate]]\n if frames:\n s.write_video_chunk(1, torch.stack(frames))\n i += frame_rate\n\nplt.close(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Result\n\nThe result looks like below.\n\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Video(get_path(\"example.mp4\"), embed=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Carefully watching the video, it can be\nobserved that the sound of \"s\" (curio\\ **si**\\ ty, be\\ **si**\\ des, thi\\ **s**\\ ) has\nmore energy allocated on higher frequency side (right side of the video).\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tag: :obj:`torchaudio.io`\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 0 }