{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Audio Feature Extractions\n\n**Author**: [Moto Hira](moto@meta.com)_\n\n``torchaudio`` implements feature extractions commonly used in the audio\ndomain. They are available in ``torchaudio.functional`` and\n``torchaudio.transforms``.\n\n``functional`` implements features as standalone functions.\nThey are stateless.\n\n``transforms`` implements features as objects,\nusing implementations from ``functional`` and ``torch.nn.Module``.\nThey can be serialized using TorchScript.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torchaudio\nimport torchaudio.functional as F\nimport torchaudio.transforms as T\n\nprint(torch.__version__)\nprint(torchaudio.__version__)\n\nimport librosa\nimport matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview of audio features\n\nThe following diagram shows the relationship between common audio features\nand torchaudio APIs to generate them.\n\n\n\nFor the complete list of available features, please refer to the\ndocumentation.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation\n\n

Note

When running this tutorial in Google Colab, install the required packages\n\n .. code::\n\n !pip install librosa

\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from IPython.display import Audio\nfrom matplotlib.patches import Rectangle\nfrom torchaudio.utils import download_asset\n\ntorch.random.manual_seed(0)\n\nSAMPLE_SPEECH = download_asset(\"tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav\")\n\n\ndef plot_waveform(waveform, sr, title=\"Waveform\", ax=None):\n waveform = waveform.numpy()\n\n num_channels, num_frames = waveform.shape\n time_axis = torch.arange(0, num_frames) / sr\n\n if ax is None:\n _, ax = plt.subplots(num_channels, 1)\n ax.plot(time_axis, waveform[0], linewidth=1)\n ax.grid(True)\n ax.set_xlim([0, time_axis[-1]])\n ax.set_title(title)\n\n\ndef plot_spectrogram(specgram, title=None, ylabel=\"freq_bin\", ax=None):\n if ax is None:\n _, ax = plt.subplots(1, 1)\n if title is not None:\n ax.set_title(title)\n ax.set_ylabel(ylabel)\n ax.imshow(librosa.power_to_db(specgram), origin=\"lower\", aspect=\"auto\", interpolation=\"nearest\")\n\n\ndef plot_fbank(fbank, title=None):\n fig, axs = plt.subplots(1, 1)\n axs.set_title(title or \"Filter bank\")\n axs.imshow(fbank, aspect=\"auto\")\n axs.set_ylabel(\"frequency bin\")\n axs.set_xlabel(\"mel bin\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Spectrogram\n\nTo get the frequency make-up of an audio signal as it varies with time,\nyou can use :py:func:`torchaudio.transforms.Spectrogram`.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load audio\nSPEECH_WAVEFORM, SAMPLE_RATE = torchaudio.load(SAMPLE_SPEECH)\n\n# Define transform\nspectrogram = T.Spectrogram(n_fft=512)\n\n# Perform transform\nspec = spectrogram(SPEECH_WAVEFORM)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, axs = plt.subplots(2, 1)\nplot_waveform(SPEECH_WAVEFORM, SAMPLE_RATE, title=\"Original waveform\", ax=axs[0])\nplot_spectrogram(spec[0], title=\"spectrogram\", ax=axs[1])\nfig.tight_layout()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Audio(SPEECH_WAVEFORM.numpy(), rate=SAMPLE_RATE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The effect of ``n_fft`` parameter\n\nThe core of spectrogram computation is (short-term) Fourier transform,\nand the ``n_fft`` parameter corresponds to the $N$ in the following\ndefinition of descrete Fourier transform.\n\n$$ X_k = \\\\sum_{n=0}^{N-1} x_n e^{-\\\\frac{2\\\\pi i}{N} nk} $$\n\n(For the detail of Fourier transform, please refer to\n[Wikipedia](https://en.wikipedia.org/wiki/Fast_Fourier_transform)_.\n\nThe value of ``n_fft`` determines the resolution of frequency axis.\nHowever, with the higher ``n_fft`` value, the energy will be distributed\namong more bins, so when you visualize it, it might look more blurry,\neven thought they are higher resolution.\n\nThe following illustrates this;\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Note

``hop_length`` determines the time axis resolution.\n By default, (i.e. ``hop_length=None`` and ``win_length=None``),\n the value of ``n_fft // 4`` is used.\n Here we use the same ``hop_length`` value across different ``n_fft``\n so that they have the same number of elemets in the time axis.

\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_ffts = [32, 128, 512, 2048]\nhop_length = 64\n\nspecs = []\nfor n_fft in n_ffts:\n spectrogram = T.Spectrogram(n_fft=n_fft, hop_length=hop_length)\n spec = spectrogram(SPEECH_WAVEFORM)\n specs.append(spec)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, axs = plt.subplots(len(specs), 1, sharex=True)\nfor i, (spec, n_fft) in enumerate(zip(specs, n_ffts)):\n plot_spectrogram(spec[0], ylabel=f\"n_fft={n_fft}\", ax=axs[i])\n axs[i].set_xlabel(None)\nfig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When comparing signals, it is desirable to use the same sampling rate,\nhowever if you must use the different sampling rate, care must be\ntaken for interpretating the meaning of ``n_fft``.\nRecall that ``n_fft`` determines the resolution of the frequency\naxis for a given sampling rate. In other words, what each bin on\nthe frequency axis represents is subject to the sampling rate.\n\nAs we have seen above, changing the value of ``n_fft`` does not change\nthe coverage of frequency range for the same input signal.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's downsample the audio and apply spectrogram with the same ``n_fft``\nvalue.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Downsample to half of the original sample rate\nspeech2 = torchaudio.functional.resample(SPEECH_WAVEFORM, SAMPLE_RATE, SAMPLE_RATE // 2)\n# Upsample to the original sample rate\nspeech3 = torchaudio.functional.resample(speech2, SAMPLE_RATE // 2, SAMPLE_RATE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Apply the same spectrogram\nspectrogram = T.Spectrogram(n_fft=512)\n\nspec0 = spectrogram(SPEECH_WAVEFORM)\nspec2 = spectrogram(speech2)\nspec3 = spectrogram(speech3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Visualize it\nfig, axs = plt.subplots(3, 1)\nplot_spectrogram(spec0[0], ylabel=\"Original\", ax=axs[0])\naxs[0].add_patch(Rectangle((0, 3), 212, 128, edgecolor=\"r\", facecolor=\"none\"))\nplot_spectrogram(spec2[0], ylabel=\"Downsampled\", ax=axs[1])\nplot_spectrogram(spec3[0], ylabel=\"Upsampled\", ax=axs[2])\nfig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above visualization, the second plot (\"Downsampled\") might\ngive the impression that the spectrogram is streched.\nThis is because the meaning of frequency bins is different from\nthe original one.\nEven though, they have the same number of bins, in the second plot,\nthe frequency is only covered to the half of the original sampling\nrate.\nThis becomes more clear if we resample the downsampled signal again\nso that it has the same sample rate as the original.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GriffinLim\n\nTo recover a waveform from a spectrogram, you can use\n:py:class:`torchaudio.transforms.GriffinLim`.\n\nThe same set of parameters used for spectrogram must be used.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Define transforms\nn_fft = 1024\nspectrogram = T.Spectrogram(n_fft=n_fft)\ngriffin_lim = T.GriffinLim(n_fft=n_fft)\n\n# Apply the transforms\nspec = spectrogram(SPEECH_WAVEFORM)\nreconstructed_waveform = griffin_lim(spec)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "_, axes = plt.subplots(2, 1, sharex=True, sharey=True)\nplot_waveform(SPEECH_WAVEFORM, SAMPLE_RATE, title=\"Original\", ax=axes[0])\nplot_waveform(reconstructed_waveform, SAMPLE_RATE, title=\"Reconstructed\", ax=axes[1])\nAudio(reconstructed_waveform, rate=SAMPLE_RATE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mel Filter Bank\n\n:py:func:`torchaudio.functional.melscale_fbanks` generates the filter bank\nfor converting frequency bins to mel-scale bins.\n\nSince this function does not require input audio/features, there is no\nequivalent transform in :py:func:`torchaudio.transforms`.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_fft = 256\nn_mels = 64\nsample_rate = 6000\n\nmel_filters = F.melscale_fbanks(\n int(n_fft // 2 + 1),\n n_mels=n_mels,\n f_min=0.0,\n f_max=sample_rate / 2.0,\n sample_rate=sample_rate,\n norm=\"slaney\",\n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_fbank(mel_filters, \"Mel Filter Bank - torchaudio\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparison against librosa\n\nFor reference, here is the equivalent way to get the mel filter bank\nwith ``librosa``.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "mel_filters_librosa = librosa.filters.mel(\n sr=sample_rate,\n n_fft=n_fft,\n n_mels=n_mels,\n fmin=0.0,\n fmax=sample_rate / 2.0,\n norm=\"slaney\",\n htk=True,\n).T" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_fbank(mel_filters_librosa, \"Mel Filter Bank - librosa\")\n\nmse = torch.square(mel_filters - mel_filters_librosa).mean().item()\nprint(\"Mean Square Difference: \", mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MelSpectrogram\n\nGenerating a mel-scale spectrogram involves generating a spectrogram\nand performing mel-scale conversion. In ``torchaudio``,\n:py:func:`torchaudio.transforms.MelSpectrogram` provides\nthis functionality.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_fft = 1024\nwin_length = None\nhop_length = 512\nn_mels = 128\n\nmel_spectrogram = T.MelSpectrogram(\n sample_rate=sample_rate,\n n_fft=n_fft,\n win_length=win_length,\n hop_length=hop_length,\n center=True,\n pad_mode=\"reflect\",\n power=2.0,\n norm=\"slaney\",\n n_mels=n_mels,\n mel_scale=\"htk\",\n)\n\nmelspec = mel_spectrogram(SPEECH_WAVEFORM)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_spectrogram(melspec[0], title=\"MelSpectrogram - torchaudio\", ylabel=\"mel freq\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparison against librosa\n\nFor reference, here is the equivalent means of generating mel-scale\nspectrograms with ``librosa``.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "melspec_librosa = librosa.feature.melspectrogram(\n y=SPEECH_WAVEFORM.numpy()[0],\n sr=sample_rate,\n n_fft=n_fft,\n hop_length=hop_length,\n win_length=win_length,\n center=True,\n pad_mode=\"reflect\",\n power=2.0,\n n_mels=n_mels,\n norm=\"slaney\",\n htk=True,\n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_spectrogram(melspec_librosa, title=\"MelSpectrogram - librosa\", ylabel=\"mel freq\")\n\nmse = torch.square(melspec - melspec_librosa).mean().item()\nprint(\"Mean Square Difference: \", mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MFCC\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_fft = 2048\nwin_length = None\nhop_length = 512\nn_mels = 256\nn_mfcc = 256\n\nmfcc_transform = T.MFCC(\n sample_rate=sample_rate,\n n_mfcc=n_mfcc,\n melkwargs={\n \"n_fft\": n_fft,\n \"n_mels\": n_mels,\n \"hop_length\": hop_length,\n \"mel_scale\": \"htk\",\n },\n)\n\nmfcc = mfcc_transform(SPEECH_WAVEFORM)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_spectrogram(mfcc[0], title=\"MFCC\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparison against librosa\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "melspec = librosa.feature.melspectrogram(\n y=SPEECH_WAVEFORM.numpy()[0],\n sr=sample_rate,\n n_fft=n_fft,\n win_length=win_length,\n hop_length=hop_length,\n n_mels=n_mels,\n htk=True,\n norm=None,\n)\n\nmfcc_librosa = librosa.feature.mfcc(\n S=librosa.core.spectrum.power_to_db(melspec),\n n_mfcc=n_mfcc,\n dct_type=2,\n norm=\"ortho\",\n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_spectrogram(mfcc_librosa, title=\"MFCC (librosa)\")\n\nmse = torch.square(mfcc - mfcc_librosa).mean().item()\nprint(\"Mean Square Difference: \", mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LFCC\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "n_fft = 2048\nwin_length = None\nhop_length = 512\nn_lfcc = 256\n\nlfcc_transform = T.LFCC(\n sample_rate=sample_rate,\n n_lfcc=n_lfcc,\n speckwargs={\n \"n_fft\": n_fft,\n \"win_length\": win_length,\n \"hop_length\": hop_length,\n },\n)\n\nlfcc = lfcc_transform(SPEECH_WAVEFORM)\nplot_spectrogram(lfcc[0], title=\"LFCC\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pitch\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pitch = F.detect_pitch_frequency(SPEECH_WAVEFORM, SAMPLE_RATE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot_pitch(waveform, sr, pitch):\n figure, axis = plt.subplots(1, 1)\n axis.set_title(\"Pitch Feature\")\n axis.grid(True)\n\n end_time = waveform.shape[1] / sr\n time_axis = torch.linspace(0, end_time, waveform.shape[1])\n axis.plot(time_axis, waveform[0], linewidth=1, color=\"gray\", alpha=0.3)\n\n axis2 = axis.twinx()\n time_axis = torch.linspace(0, end_time, pitch.shape[1])\n axis2.plot(time_axis, pitch[0], linewidth=2, label=\"Pitch\", color=\"green\")\n\n axis2.legend(loc=0)\n\n\nplot_pitch(SPEECH_WAVEFORM, SAMPLE_RATE, pitch)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 0 }