{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Accelerated video decoding with NVDEC\n\n\n**Author**: [Moto Hira](moto@meta.com)_\n\nThis tutorial shows how to use NVIDIA\u2019s hardware video decoder (NVDEC)\nwith TorchAudio, and how it improves the performance of video decoding.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Note

This tutorial requires FFmpeg libraries compiled with HW\n acceleration enabled.\n\n Please refer to\n `Enabling GPU video decoder/encoder `\n for how to build FFmpeg with HW acceleration.

\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torchaudio\n\nprint(torch.__version__)\nprint(torchaudio.__version__)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\nimport time\n\nimport matplotlib.pyplot as plt\nfrom torchaudio.io import StreamReader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check the prerequisites\n\nFirst, we check that TorchAudio correctly detects FFmpeg libraries\nthat support HW decoder/encoder.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from torchaudio.utils import ffmpeg_utils" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"FFmpeg Library versions:\")\nfor k, ver in ffmpeg_utils.get_versions().items():\n print(f\" {k}:\\t{'.'.join(str(v) for v in ver)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"Available NVDEC Decoders:\")\nfor k in ffmpeg_utils.get_video_decoders().keys():\n if \"cuvid\" in k:\n print(f\" - {k}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"Avaialbe GPU:\")\nprint(torch.cuda.get_device_properties(0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the following video which has the following properties;\n\n- Codec: H.264\n- Resolution: 960x540\n- FPS: 29.97\n- Pixel format: YUV420P\n\n.. raw:: html\n\n \n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "src = torchaudio.utils.download_asset(\n \"tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4_small.mp4\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Decoding videos with NVDEC\n\nTo use HW video decoder, you need to specify the HW decoder when\ndefining the output video stream by passing ``decoder`` option to\n:py:meth:`~torchaudio.io.StreamReader.add_video_stream` method.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s = StreamReader(src)\ns.add_video_stream(5, decoder=\"h264_cuvid\")\ns.fill_buffer()\n(video,) = s.pop_chunks()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The video frames are decoded and returned as tensor of NCHW format.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(video.shape, video.dtype)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the decoded frames are sent back to CPU memory, and\nCPU tensors are created.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(video.device)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By specifying ``hw_accel`` option, you can convert the decoded frames\nto CUDA tensor.\n``hw_accel`` option takes string values and pass it\nto :py:class:`torch.device`.\n\n

Note

Currently, ``hw_accel`` option and\n :py:meth:`~torchaudio.io.StreamReader.add_basic_video_stream`\n are not compatible. ``add_basic_video_stream`` adds post-decoding\n process, which is designed for frames in CPU memory.\n Please use :py:meth:`~torchaudio.io.StreamReader.add_video_stream`.

\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "s = StreamReader(src)\ns.add_video_stream(5, decoder=\"h264_cuvid\", hw_accel=\"cuda:0\")\ns.fill_buffer()\n(video,) = s.pop_chunks()\n\nprint(video.shape, video.dtype, video.device)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Note

When there are multiple of GPUs available, ``StreamReader`` by\n default uses the first GPU. You can change this by providing\n ``\"gpu\"`` option.

\n\n.. code::\n\n # Video data is sent to CUDA device 0, decoded and\n # converted on the same device.\n s.add_video_stream(\n ...,\n decoder=\"h264_cuvid\",\n decoder_option={\"gpu\": \"0\"},\n hw_accel=\"cuda:0\",\n )\n\n

Note

``\"gpu\"`` option and ``hw_accel`` option can be specified\n independently. If they do not match, decoded frames are\n transfered to the device specified by ``hw_accell``\n automatically.

\n\n.. code::\n\n # Video data is sent to CUDA device 0, and decoded there.\n # Then it is transfered to CUDA device 1, and converted to\n # CUDA tensor.\n s.add_video_stream(\n ...,\n decoder=\"h264_cuvid\",\n decoder_option={\"gpu\": \"0\"},\n hw_accel=\"cuda:1\",\n )\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization\n\nLet's look at the frames decoded by HW decoder and compare them\nagainst equivalent results from software decoders.\n\nThe following function seeks into the given timestamp and decode one\nframe with the specificed decoder.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def test_decode(decoder: str, seek: float):\n s = StreamReader(src)\n s.seek(seek)\n s.add_video_stream(1, decoder=decoder)\n s.fill_buffer()\n (video,) = s.pop_chunks()\n return video[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "timestamps = [12, 19, 45, 131, 180]\n\ncpu_frames = [test_decode(decoder=\"h264\", seek=ts) for ts in timestamps]\ncuda_frames = [test_decode(decoder=\"h264_cuvid\", seek=ts) for ts in timestamps]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Note

Currently, HW decoder does not support colorspace conversion.\n Decoded frames are YUV format.\n The following function performs YUV to RGB covnersion\n (and axis shuffling for plotting).

\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def yuv_to_rgb(frames):\n frames = frames.cpu().to(torch.float)\n y = frames[..., 0, :, :]\n u = frames[..., 1, :, :]\n v = frames[..., 2, :, :]\n\n y /= 255\n u = u / 255 - 0.5\n v = v / 255 - 0.5\n\n r = y + 1.14 * v\n g = y + -0.396 * u - 0.581 * v\n b = y + 2.029 * u\n\n rgb = torch.stack([r, g, b], -1)\n rgb = (rgb * 255).clamp(0, 255).to(torch.uint8)\n return rgb.numpy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we visualize the resutls.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot():\n n_rows = len(timestamps)\n fig, axes = plt.subplots(n_rows, 2, figsize=[12.8, 16.0])\n for i in range(n_rows):\n axes[i][0].imshow(yuv_to_rgb(cpu_frames[i]))\n axes[i][1].imshow(yuv_to_rgb(cuda_frames[i]))\n\n axes[0][0].set_title(\"Software decoder\")\n axes[0][1].set_title(\"HW decoder\")\n plt.setp(axes, xticks=[], yticks=[])\n plt.tight_layout()\n\n\nplot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "They are indistinguishable to the eyes of the author.\nFeel free to let us know if you spot something. :)\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HW resizing and cropping\n\nYou can use ``decoder_option`` argument to provide decoder-specific\noptions.\n\nThe following options are often relevant in preprocessing.\n\n- ``resize``: Resize the frame into ``(width)x(height)``.\n- ``crop``: Crop the frame ``(top)x(bottom)x(left)x(right)``.\n Note that the specified values are the amount of rows/columns removed.\n The final image size is ``(width - left - right)x(height - top -bottom)``.\n If ``crop`` and ``resize`` options are used together,\n ``crop`` is performed first.\n\nFor other available options, please run\n``ffmpeg -h decoder=h264_cuvid``.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def test_options(option):\n s = StreamReader(src)\n s.seek(87)\n s.add_video_stream(1, decoder=\"h264_cuvid\", hw_accel=\"cuda:0\", decoder_option=option)\n s.fill_buffer()\n (video,) = s.pop_chunks()\n print(f\"Option: {option}:\\t{video.shape}\")\n return video[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "original = test_options(option=None)\nresized = test_options(option={\"resize\": \"480x270\"})\ncropped = test_options(option={\"crop\": \"135x135x240x240\"})\ncropped_and_resized = test_options(option={\"crop\": \"135x135x240x240\", \"resize\": \"640x360\"})" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot():\n fig, axes = plt.subplots(2, 2, figsize=[12.8, 9.6])\n axes[0][0].imshow(yuv_to_rgb(original))\n axes[0][1].imshow(yuv_to_rgb(resized))\n axes[1][0].imshow(yuv_to_rgb(cropped))\n axes[1][1].imshow(yuv_to_rgb(cropped_and_resized))\n\n axes[0][0].set_title(\"Original\")\n axes[0][1].set_title(\"Resized\")\n axes[1][0].set_title(\"Cropped\")\n axes[1][1].set_title(\"Cropped and resized\")\n plt.tight_layout()\n return fig\n\n\nplot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparing resizing methods\n\nUnlike software scaling, NVDEC does not provide an option to choose\nthe scaling algorithm.\nIn ML applicatoins, it is often necessary to construct a\npreprocessing pipeline with a similar numerical property.\nSo here we compare the result of hardware resizing with software\nresizing of different algorithms.\n\nWe will use the following video, which contains the test pattern\ngenerated using the following command.\n\n.. code::\n\n ffmpeg -y -f lavfi -t 12.05 -i mptestsrc -movflags +faststart mptestsrc.mp4\n\n.. raw:: html\n\n \n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "test_src = torchaudio.utils.download_asset(\"tutorial-assets/mptestsrc.mp4\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function decodes video and\napply the specified scaling algorithm.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def decode_resize_ffmpeg(mode, height, width, seek):\n filter_desc = None if mode is None else f\"scale={width}:{height}:sws_flags={mode}\"\n s = StreamReader(test_src)\n s.add_video_stream(1, filter_desc=filter_desc)\n s.seek(seek)\n s.fill_buffer()\n (chunk,) = s.pop_chunks()\n return chunk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function uses HW decoder to decode video and resize.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def decode_resize_cuvid(height, width, seek):\n s = StreamReader(test_src)\n s.add_video_stream(1, decoder=\"h264_cuvid\", decoder_option={\"resize\": f\"{width}x{height}\"}, hw_accel=\"cuda:0\")\n s.seek(seek)\n s.fill_buffer()\n (chunk,) = s.pop_chunks()\n return chunk.cpu()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we execute them and visualize the resulting frames.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "params = {\"height\": 224, \"width\": 224, \"seek\": 3}\n\nframes = [\n decode_resize_ffmpeg(None, **params),\n decode_resize_ffmpeg(\"neighbor\", **params),\n decode_resize_ffmpeg(\"bilinear\", **params),\n decode_resize_ffmpeg(\"bicubic\", **params),\n decode_resize_cuvid(**params),\n decode_resize_ffmpeg(\"spline\", **params),\n decode_resize_ffmpeg(\"lanczos:param0=1\", **params),\n decode_resize_ffmpeg(\"lanczos:param0=3\", **params),\n decode_resize_ffmpeg(\"lanczos:param0=5\", **params),\n]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot():\n fig, axes = plt.subplots(3, 3, figsize=[12.8, 15.2])\n for i, f in enumerate(frames):\n h, w = f.shape[2:4]\n f = f[..., : h // 4, : w // 4]\n axes[i // 3][i % 3].imshow(yuv_to_rgb(f[0]))\n axes[0][0].set_title(\"Original\")\n axes[0][1].set_title(\"nearest neighbor\")\n axes[0][2].set_title(\"bilinear\")\n axes[1][0].set_title(\"bicubic\")\n axes[1][1].set_title(\"NVDEC\")\n axes[1][2].set_title(\"spline\")\n axes[2][0].set_title(\"lanczos(1)\")\n axes[2][1].set_title(\"lanczos(3)\")\n axes[2][2].set_title(\"lanczos(5)\")\n\n plt.setp(axes, xticks=[], yticks=[])\n plt.tight_layout()\n\n\nplot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "None of them is exactly the same. To the eyes of authors, lanczos(1)\nappears to be most similar to NVDEC.\nThe bicubic looks close as well.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Benchmark NVDEC with StreamReader\n\nIn this section, we compare the performace of software video\ndecoding and HW video decoding.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Decode as CUDA frames\n\nFirst, we compare the time it takes for software decoder and\nhardware encoder to decode the same video.\nTo make the result comparable, when using software decoder, we move\nthe resulting tensor to CUDA.\n\nThe procedures to test look like the following\n\n- Use hardware decoder and place data on CUDA directly\n- Use software decoder, generate CPU Tensors and move them to CUDA.\n\n.. note:\n\n Because HW decoder currently only supports reading videos as\n YUV444P format, we decode frames into YUV444P format for the case of\n software decoder as well.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function implements the hardware decoder test case.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def test_decode_cuda(src, decoder, hw_accel=\"cuda\", frames_per_chunk=5):\n s = StreamReader(src)\n s.add_video_stream(frames_per_chunk, decoder=decoder, hw_accel=hw_accel)\n\n num_frames = 0\n chunk = None\n t0 = time.monotonic()\n for (chunk,) in s.stream():\n num_frames += chunk.shape[0]\n elapsed = time.monotonic() - t0\n print(f\" - Shape: {chunk.shape}\")\n fps = num_frames / elapsed\n print(f\" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)\")\n return fps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function implements the software decoder test case.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def test_decode_cpu(src, threads, decoder=None, frames_per_chunk=5):\n s = StreamReader(src)\n s.add_video_stream(frames_per_chunk, decoder=decoder, decoder_option={\"threads\": f\"{threads}\"})\n\n num_frames = 0\n device = torch.device(\"cuda\")\n t0 = time.monotonic()\n for i, (chunk,) in enumerate(s.stream()):\n if i == 0:\n print(f\" - Shape: {chunk.shape}\")\n num_frames += chunk.shape[0]\n chunk = chunk.to(device)\n elapsed = time.monotonic() - t0\n fps = num_frames / elapsed\n print(f\" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)\")\n return fps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each resolution of video, we run multiple software decoder test\ncases with different number of threads.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def run_decode_tests(src, frames_per_chunk=5):\n fps = []\n print(f\"Testing: {os.path.basename(src)}\")\n for threads in [1, 4, 8, 16]:\n print(f\"* Software decoding (num_threads={threads})\")\n fps.append(test_decode_cpu(src, threads))\n print(\"* Hardware decoding\")\n fps.append(test_decode_cuda(src, decoder=\"h264_cuvid\"))\n return fps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we run the tests with videos of different resolutions.\n\n## QVGA\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "src_qvga = torchaudio.utils.download_asset(\"tutorial-assets/testsrc2_qvga.h264.mp4\")\nfps_qvga = run_decode_tests(src_qvga)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## VGA\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "src_vga = torchaudio.utils.download_asset(\"tutorial-assets/testsrc2_vga.h264.mp4\")\nfps_vga = run_decode_tests(src_vga)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## XGA\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "src_xga = torchaudio.utils.download_asset(\"tutorial-assets/testsrc2_xga.h264.mp4\")\nfps_xga = run_decode_tests(src_xga)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Result\n\nNow we plot the result.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot():\n fig, ax = plt.subplots(figsize=[9.6, 6.4])\n\n for items in zip(fps_qvga, fps_vga, fps_xga, \"ov^sx\"):\n ax.plot(items[:-1], marker=items[-1])\n ax.grid(axis=\"both\")\n ax.set_xticks([0, 1, 2], [\"QVGA (320x240)\", \"VGA (640x480)\", \"XGA (1024x768)\"])\n ax.legend(\n [\n \"Software Decoding (threads=1)\",\n \"Software Decoding (threads=4)\",\n \"Software Decoding (threads=8)\",\n \"Software Decoding (threads=16)\",\n \"Hardware Decoding (CUDA Tensor)\",\n ]\n )\n ax.set_title(\"Speed of processing video frames\")\n ax.set_ylabel(\"Frames per second\")\n plt.tight_layout()\n\n\nplot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe couple of things\n\n- Increasing the number of threads in software decoding makes the\n pipeline faster, but the performance saturates around 8 threads.\n- The performance gain from using hardware decoder depends on the\n resolution of video.\n- At lower resolutions like QVGA, hardware decoding is slower than\n software decoding\n- At higher resolutions like XGA, hardware decoding is faster\n than software decoding.\n\n\nIt is worth noting that the performance gain also depends on the\ntype of GPU.\nWe observed that when decoding VGA videos using V100 or A100 GPUs,\nhardware decoders are slower than software decoders. But using A10\nGPU hardware deocder is faster than software decodr.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Decode and resize\n\nNext, we add resize operation to the pipeline.\nWe will compare the following pipelines.\n\n1. Decode video using software decoder and read the frames as\n PyTorch Tensor. Resize the tensor using\n :py:func:`torch.nn.functional.interpolate`, then send\n the resulting tensor to CUDA device.\n2. Decode video using software decoder, resize the frame with\n FFmpeg's filter graph, read the resized frames as PyTorch tensor,\n then send it to CUDA device.\n3. Decode and resize video simulaneously with HW decoder, read the\n resulting frames as CUDA tensor.\n\nThe pipeline 1 represents common video loading implementations.\n\nThe pipeline 2 uses FFmpeg's filter graph, which allows to manipulate\nraw frames before converting them to Tensors.\n\nThe pipeline 3 has the minimum amount of data transfer from CPU to\nCUDA, which significantly contribute to performant data loading.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function implements the pipeline 1. It uses PyTorch's\n:py:func:`torch.nn.functional.interpolate`.\nWe use ``bincubic`` mode, as we saw that the resulting frames are\nclosest to NVDEC resizing.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def test_decode_then_resize(src, height, width, mode=\"bicubic\", frames_per_chunk=5):\n s = StreamReader(src)\n s.add_video_stream(frames_per_chunk, decoder_option={\"threads\": \"8\"})\n\n num_frames = 0\n device = torch.device(\"cuda\")\n chunk = None\n t0 = time.monotonic()\n for (chunk,) in s.stream():\n num_frames += chunk.shape[0]\n chunk = torch.nn.functional.interpolate(chunk, [height, width], mode=mode, antialias=True)\n chunk = chunk.to(device)\n elapsed = time.monotonic() - t0\n fps = num_frames / elapsed\n print(f\" - Shape: {chunk.shape}\")\n print(f\" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)\")\n return fps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function implements the pipeline 2. Frames are resized\nas part of decoding process, then sent to CUDA device.\n\nWe use ``bincubic`` mode, to make the result comparable with\nPyTorch-based implementation above.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def test_decode_and_resize(src, height, width, mode=\"bicubic\", frames_per_chunk=5):\n s = StreamReader(src)\n s.add_video_stream(\n frames_per_chunk, filter_desc=f\"scale={width}:{height}:sws_flags={mode}\", decoder_option={\"threads\": \"8\"}\n )\n\n num_frames = 0\n device = torch.device(\"cuda\")\n chunk = None\n t0 = time.monotonic()\n for (chunk,) in s.stream():\n num_frames += chunk.shape[0]\n chunk = chunk.to(device)\n elapsed = time.monotonic() - t0\n fps = num_frames / elapsed\n print(f\" - Shape: {chunk.shape}\")\n print(f\" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)\")\n return fps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function implements the pipeline 3. Resizing is\nperformed by NVDEC and the resulting tensor is placed on CUDA memory.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def test_hw_decode_and_resize(src, decoder, decoder_option, hw_accel=\"cuda\", frames_per_chunk=5):\n s = StreamReader(src)\n s.add_video_stream(5, decoder=decoder, decoder_option=decoder_option, hw_accel=hw_accel)\n\n num_frames = 0\n chunk = None\n t0 = time.monotonic()\n for (chunk,) in s.stream():\n num_frames += chunk.shape[0]\n elapsed = time.monotonic() - t0\n fps = num_frames / elapsed\n print(f\" - Shape: {chunk.shape}\")\n print(f\" - Processed {num_frames} frames in {elapsed:.2f} seconds. ({fps:.2f} fps)\")\n return fps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function run the benchmark functions on given sources.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def run_resize_tests(src):\n print(f\"Testing: {os.path.basename(src)}\")\n height, width = 224, 224\n print(\"* Software decoding with PyTorch interpolate\")\n cpu_resize1 = test_decode_then_resize(src, height=height, width=width)\n print(\"* Software decoding with FFmpeg scale\")\n cpu_resize2 = test_decode_and_resize(src, height=height, width=width)\n print(\"* Hardware decoding with resize\")\n cuda_resize = test_hw_decode_and_resize(src, decoder=\"h264_cuvid\", decoder_option={\"resize\": f\"{width}x{height}\"})\n return [cpu_resize1, cpu_resize2, cuda_resize]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we run the tests.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## QVGA\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fps_qvga = run_resize_tests(src_qvga)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## VGA\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fps_vga = run_resize_tests(src_vga)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## XGA\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fps_xga = run_resize_tests(src_xga)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Result\nNow we plot the result.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot():\n fig, ax = plt.subplots(figsize=[9.6, 6.4])\n\n for items in zip(fps_qvga, fps_vga, fps_xga, \"ov^sx\"):\n ax.plot(items[:-1], marker=items[-1])\n ax.grid(axis=\"both\")\n ax.set_xticks([0, 1, 2], [\"QVGA (320x240)\", \"VGA (640x480)\", \"XGA (1024x768)\"])\n ax.legend(\n [\n \"Software decoding\\nwith resize\\n(PyTorch interpolate)\",\n \"Software decoding\\nwith resize\\n(FFmpeg scale)\",\n \"NVDEC\\nwith resizing\",\n ]\n )\n ax.set_title(\"Speed of processing video frames\")\n ax.set_xlabel(\"Input video resolution\")\n ax.set_ylabel(\"Frames per second\")\n plt.tight_layout()\n\n\nplot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hardware deocder shows a similar trend as previous experiment.\nIn fact, the performance is almost the same. Hardware resizing has\nalmost zero overhead for scaling down the frames.\n\nSoftware decoding also shows a similar trend. Performing resizing as\npart of decoding is faster. One possible explanation is that, video\nframes are internally stored as YUV420P, which has half the number\nof pixels compared to RGB24, or YUV444P. This means that if resizing\nbefore copying frame data to PyTorch tensor, the number of pixels\nmanipulated and copied are smaller than the case where applying\nresizing after frames are converted to Tensor.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tag: :obj:`torchaudio.io`\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 0 }