{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Audio Resampling\n\n**Author**: [Caroline Chen](carolinechen@meta.com)_, [Moto Hira](moto@meta.com)_\n\nThis tutorial shows how to use torchaudio's resampling API.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torchaudio\nimport torchaudio.functional as F\nimport torchaudio.transforms as T\n\nprint(torch.__version__)\nprint(torchaudio.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation\n\nFirst, we import the modules and define the helper functions.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import math\nimport timeit\n\nimport librosa\nimport matplotlib.colors as mcolors\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport resampy\nfrom IPython.display import Audio\n\npd.set_option(\"display.max_rows\", None)\npd.set_option(\"display.max_columns\", None)\n\nDEFAULT_OFFSET = 201\n\n\ndef _get_log_freq(sample_rate, max_sweep_rate, offset):\n \"\"\"Get freqs evenly spaced out in log-scale, between [0, max_sweep_rate // 2]\n\n offset is used to avoid negative infinity `log(offset + x)`.\n\n \"\"\"\n start, stop = math.log(offset), math.log(offset + max_sweep_rate // 2)\n return torch.exp(torch.linspace(start, stop, sample_rate, dtype=torch.double)) - offset\n\n\ndef _get_inverse_log_freq(freq, sample_rate, offset):\n \"\"\"Find the time where the given frequency is given by _get_log_freq\"\"\"\n half = sample_rate // 2\n return sample_rate * (math.log(1 + freq / offset) / math.log(1 + half / offset))\n\n\ndef _get_freq_ticks(sample_rate, offset, f_max):\n # Given the original sample rate used for generating the sweep,\n # find the x-axis value where the log-scale major frequency values fall in\n times, freq = [], []\n for exp in range(2, 5):\n for v in range(1, 10):\n f = v * 10**exp\n if f < sample_rate // 2:\n t = _get_inverse_log_freq(f, sample_rate, offset) / sample_rate\n times.append(t)\n freq.append(f)\n t_max = _get_inverse_log_freq(f_max, sample_rate, offset) / sample_rate\n times.append(t_max)\n freq.append(f_max)\n return times, freq\n\n\ndef get_sine_sweep(sample_rate, offset=DEFAULT_OFFSET):\n max_sweep_rate = sample_rate\n freq = _get_log_freq(sample_rate, max_sweep_rate, offset)\n delta = 2 * math.pi * freq / sample_rate\n cummulative = torch.cumsum(delta, dim=0)\n signal = torch.sin(cummulative).unsqueeze(dim=0)\n return signal\n\n\ndef plot_sweep(\n waveform,\n sample_rate,\n title,\n max_sweep_rate=48000,\n offset=DEFAULT_OFFSET,\n):\n x_ticks = [100, 500, 1000, 5000, 10000, 20000, max_sweep_rate // 2]\n y_ticks = [1000, 5000, 10000, 20000, sample_rate // 2]\n\n time, freq = _get_freq_ticks(max_sweep_rate, offset, sample_rate // 2)\n freq_x = [f if f in x_ticks and f <= max_sweep_rate // 2 else None for f in freq]\n freq_y = [f for f in freq if f in y_ticks and 1000 <= f <= sample_rate // 2]\n\n figure, axis = plt.subplots(1, 1)\n _, _, _, cax = axis.specgram(waveform[0].numpy(), Fs=sample_rate)\n plt.xticks(time, freq_x)\n plt.yticks(freq_y, freq_y)\n axis.set_xlabel(\"Original Signal Frequency (Hz, log scale)\")\n axis.set_ylabel(\"Waveform Frequency (Hz)\")\n axis.xaxis.grid(True, alpha=0.67)\n axis.yaxis.grid(True, alpha=0.67)\n figure.suptitle(f\"{title} (sample rate: {sample_rate} Hz)\")\n plt.colorbar(cax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resampling Overview\n\nTo resample an audio waveform from one freqeuncy to another, you can use\n:py:class:`torchaudio.transforms.Resample` or\n:py:func:`torchaudio.functional.resample`.\n``transforms.Resample`` precomputes and caches the kernel used for resampling,\nwhile ``functional.resample`` computes it on the fly, so using\n``torchaudio.transforms.Resample`` will result in a speedup when resampling\nmultiple waveforms using the same parameters (see Benchmarking section).\n\nBoth resampling methods use [bandlimited sinc\ninterpolation](https://ccrma.stanford.edu/~jos/resample/)_ to compute\nsignal values at arbitrary time steps. The implementation involves\nconvolution, so we can take advantage of GPU / multithreading for\nperformance improvements.\n\n

Note

When using resampling in multiple subprocesses, such as data loading\n with multiple worker processes, your application might create more\n threads than your system can handle efficiently.\n Setting ``torch.set_num_threads(1)`` might help in this case.

\n\nBecause a finite number of samples can only represent a finite number of\nfrequencies, resampling does not produce perfect results, and a variety\nof parameters can be used to control for its quality and computational\nspeed. We demonstrate these properties through resampling a logarithmic\nsine sweep, which is a sine wave that increases exponentially in\nfrequency over time.\n\nThe spectrograms below show the frequency representation of the signal,\nwhere the x-axis corresponds to the frequency of the original\nwaveform (in log scale), y-axis the frequency of the\nplotted waveform, and color intensity the amplitude.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sample_rate = 48000\nwaveform = get_sine_sweep(sample_rate)\n\nplot_sweep(waveform, sample_rate, title=\"Original Waveform\")\nAudio(waveform.numpy()[0], rate=sample_rate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we resample (downsample) it.\n\nWe see that in the spectrogram of the resampled waveform, there is an\nartifact, which was not present in the original waveform.\nThis effect is called aliasing.\n[This page](https://music.arts.uci.edu/dobrian/digitalaudio.htm)_ has\nan explanation of how it happens, and why it looks like a reflection.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "resample_rate = 32000\nresampler = T.Resample(sample_rate, resample_rate, dtype=waveform.dtype)\nresampled_waveform = resampler(waveform)\n\nplot_sweep(resampled_waveform, resample_rate, title=\"Resampled Waveform\")\nAudio(resampled_waveform.numpy()[0], rate=resample_rate)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Controling resampling quality with parameters\n\n### Lowpass filter width\n\nBecause the filter used for interpolation extends infinitely, the\n``lowpass_filter_width`` parameter is used to control for the width of\nthe filter to use to window the interpolation. It is also referred to as\nthe number of zero crossings, since the interpolation passes through\nzero at every time unit. Using a larger ``lowpass_filter_width``\nprovides a sharper, more precise filter, but is more computationally\nexpensive.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sample_rate = 48000\nresample_rate = 32000\n\nresampled_waveform = F.resample(waveform, sample_rate, resample_rate, lowpass_filter_width=6)\nplot_sweep(resampled_waveform, resample_rate, title=\"lowpass_filter_width=6\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "resampled_waveform = F.resample(waveform, sample_rate, resample_rate, lowpass_filter_width=128)\nplot_sweep(resampled_waveform, resample_rate, title=\"lowpass_filter_width=128\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rolloff\n\nThe ``rolloff`` parameter is represented as a fraction of the Nyquist\nfrequency, which is the maximal frequency representable by a given\nfinite sample rate. ``rolloff`` determines the lowpass filter cutoff and\ncontrols the degree of aliasing, which takes place when frequencies\nhigher than the Nyquist are mapped to lower frequencies. A lower rolloff\nwill therefore reduce the amount of aliasing, but it will also reduce\nsome of the higher frequencies.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sample_rate = 48000\nresample_rate = 32000\n\nresampled_waveform = F.resample(waveform, sample_rate, resample_rate, rolloff=0.99)\nplot_sweep(resampled_waveform, resample_rate, title=\"rolloff=0.99\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "resampled_waveform = F.resample(waveform, sample_rate, resample_rate, rolloff=0.8)\nplot_sweep(resampled_waveform, resample_rate, title=\"rolloff=0.8\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Window function\n\nBy default, ``torchaudio``\u2019s resample uses the Hann window filter, which is\na weighted cosine function. It additionally supports the Kaiser window,\nwhich is a near optimal window function that contains an additional\n``beta`` parameter that allows for the design of the smoothness of the\nfilter and width of impulse. This can be controlled using the\n``resampling_method`` parameter.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sample_rate = 48000\nresample_rate = 32000\n\nresampled_waveform = F.resample(waveform, sample_rate, resample_rate, resampling_method=\"sinc_interp_hann\")\nplot_sweep(resampled_waveform, resample_rate, title=\"Hann Window Default\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "resampled_waveform = F.resample(waveform, sample_rate, resample_rate, resampling_method=\"sinc_interp_kaiser\")\nplot_sweep(resampled_waveform, resample_rate, title=\"Kaiser Window Default\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Comparison against librosa\n\n``torchaudio``\u2019s resample function can be used to produce results similar to\nthat of librosa (resampy)\u2019s kaiser window resampling, with some noise\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sample_rate = 48000\nresample_rate = 32000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### kaiser_best\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "resampled_waveform = F.resample(\n waveform,\n sample_rate,\n resample_rate,\n lowpass_filter_width=64,\n rolloff=0.9475937167399596,\n resampling_method=\"sinc_interp_kaiser\",\n beta=14.769656459379492,\n)\nplot_sweep(resampled_waveform, resample_rate, title=\"Kaiser Window Best (torchaudio)\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "librosa_resampled_waveform = torch.from_numpy(\n librosa.resample(waveform.squeeze().numpy(), orig_sr=sample_rate, target_sr=resample_rate, res_type=\"kaiser_best\")\n).unsqueeze(0)\nplot_sweep(librosa_resampled_waveform, resample_rate, title=\"Kaiser Window Best (librosa)\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mse = torch.square(resampled_waveform - librosa_resampled_waveform).mean().item()\nprint(\"torchaudio and librosa kaiser best MSE:\", mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### kaiser_fast\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "resampled_waveform = F.resample(\n waveform,\n sample_rate,\n resample_rate,\n lowpass_filter_width=16,\n rolloff=0.85,\n resampling_method=\"sinc_interp_kaiser\",\n beta=8.555504641634386,\n)\nplot_sweep(resampled_waveform, resample_rate, title=\"Kaiser Window Fast (torchaudio)\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "librosa_resampled_waveform = torch.from_numpy(\n librosa.resample(waveform.squeeze().numpy(), orig_sr=sample_rate, target_sr=resample_rate, res_type=\"kaiser_fast\")\n).unsqueeze(0)\nplot_sweep(librosa_resampled_waveform, resample_rate, title=\"Kaiser Window Fast (librosa)\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mse = torch.square(resampled_waveform - librosa_resampled_waveform).mean().item()\nprint(\"torchaudio and librosa kaiser fast MSE:\", mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance Benchmarking\n\nBelow are benchmarks for downsampling and upsampling waveforms between\ntwo pairs of sampling rates. We demonstrate the performance implications\nthat the ``lowpass_filter_width``, window type, and sample rates can\nhave. Additionally, we provide a comparison against ``librosa``\\ \u2019s\n``kaiser_best`` and ``kaiser_fast`` using their corresponding parameters\nin ``torchaudio``.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(f\"torchaudio: {torchaudio.__version__}\")\nprint(f\"librosa: {librosa.__version__}\")\nprint(f\"resampy: {resampy.__version__}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def benchmark_resample_functional(\n waveform,\n sample_rate,\n resample_rate,\n lowpass_filter_width=6,\n rolloff=0.99,\n resampling_method=\"sinc_interp_hann\",\n beta=None,\n iters=5,\n):\n return (\n timeit.timeit(\n stmt=\"\"\"\ntorchaudio.functional.resample(\n waveform,\n sample_rate,\n resample_rate,\n lowpass_filter_width=lowpass_filter_width,\n rolloff=rolloff,\n resampling_method=resampling_method,\n beta=beta,\n)\n \"\"\",\n setup=\"import torchaudio\",\n number=iters,\n globals=locals(),\n )\n * 1000\n / iters\n )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def benchmark_resample_transforms(\n waveform,\n sample_rate,\n resample_rate,\n lowpass_filter_width=6,\n rolloff=0.99,\n resampling_method=\"sinc_interp_hann\",\n beta=None,\n iters=5,\n):\n return (\n timeit.timeit(\n stmt=\"resampler(waveform)\",\n setup=\"\"\"\nimport torchaudio\n\nresampler = torchaudio.transforms.Resample(\n sample_rate,\n resample_rate,\n lowpass_filter_width=lowpass_filter_width,\n rolloff=rolloff,\n resampling_method=resampling_method,\n dtype=waveform.dtype,\n beta=beta,\n)\nresampler.to(waveform.device)\n \"\"\",\n number=iters,\n globals=locals(),\n )\n * 1000\n / iters\n )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def benchmark_resample_librosa(\n waveform,\n sample_rate,\n resample_rate,\n res_type=None,\n iters=5,\n):\n waveform_np = waveform.squeeze().numpy()\n return (\n timeit.timeit(\n stmt=\"\"\"\nlibrosa.resample(\n waveform_np,\n orig_sr=sample_rate,\n target_sr=resample_rate,\n res_type=res_type,\n)\n \"\"\",\n setup=\"import librosa\",\n number=iters,\n globals=locals(),\n )\n * 1000\n / iters\n )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def benchmark(sample_rate, resample_rate):\n times, rows = [], []\n waveform = get_sine_sweep(sample_rate).to(torch.float32)\n\n args = (waveform, sample_rate, resample_rate)\n\n # sinc 64 zero-crossings\n f_time = benchmark_resample_functional(*args, lowpass_filter_width=64)\n t_time = benchmark_resample_transforms(*args, lowpass_filter_width=64)\n times.append([None, f_time, t_time])\n rows.append(\"sinc (width 64)\")\n\n # sinc 6 zero-crossings\n f_time = benchmark_resample_functional(*args, lowpass_filter_width=16)\n t_time = benchmark_resample_transforms(*args, lowpass_filter_width=16)\n times.append([None, f_time, t_time])\n rows.append(\"sinc (width 16)\")\n\n # kaiser best\n kwargs = {\n \"lowpass_filter_width\": 64,\n \"rolloff\": 0.9475937167399596,\n \"resampling_method\": \"sinc_interp_kaiser\",\n \"beta\": 14.769656459379492,\n }\n lib_time = benchmark_resample_librosa(*args, res_type=\"kaiser_best\")\n f_time = benchmark_resample_functional(*args, **kwargs)\n t_time = benchmark_resample_transforms(*args, **kwargs)\n times.append([lib_time, f_time, t_time])\n rows.append(\"kaiser_best\")\n\n # kaiser fast\n kwargs = {\n \"lowpass_filter_width\": 16,\n \"rolloff\": 0.85,\n \"resampling_method\": \"sinc_interp_kaiser\",\n \"beta\": 8.555504641634386,\n }\n lib_time = benchmark_resample_librosa(*args, res_type=\"kaiser_fast\")\n f_time = benchmark_resample_functional(*args, **kwargs)\n t_time = benchmark_resample_transforms(*args, **kwargs)\n times.append([lib_time, f_time, t_time])\n rows.append(\"kaiser_fast\")\n\n df = pd.DataFrame(times, columns=[\"librosa\", \"functional\", \"transforms\"], index=rows)\n return df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot(df):\n print(df.round(2))\n ax = df.plot(kind=\"bar\")\n plt.ylabel(\"Time Elapsed [ms]\")\n plt.xticks(rotation=0, fontsize=10)\n for cont, col, color in zip(ax.containers, df.columns, mcolors.TABLEAU_COLORS):\n label = [\"N/A\" if v != v else str(v) for v in df[col].round(2)]\n ax.bar_label(cont, labels=label, color=color, fontweight=\"bold\", fontsize=\"x-small\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downsample (48 -> 44.1 kHz)\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = benchmark(48_000, 44_100)\nplot(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downsample (16 -> 8 kHz)\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = benchmark(16_000, 8_000)\nplot(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upsample (44.1 -> 48 kHz)\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = benchmark(44_100, 48_000)\nplot(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upsample (8 -> 16 kHz)\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = benchmark(8_000, 16_000)\nplot(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary\n\nTo elaborate on the results:\n\n- a larger ``lowpass_filter_width`` results in a larger resampling kernel,\n and therefore increases computation time for both the kernel computation\n and convolution\n- using ``sinc_interp_kaiser`` results in longer computation times than the default\n ``sinc_interp_hann`` because it is more complex to compute the intermediate\n window values\n- a large GCD between the sample and resample rate will result\n in a simplification that allows for a smaller kernel and faster kernel computation.\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 0 }