{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Audio Feature Augmentation\n\n**Author**: [Moto Hira](moto@meta.com)_\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# When running this tutorial in Google Colab, install the required packages\n# with the following.\n# !pip install torchaudio librosa\n\nimport torch\nimport torchaudio\nimport torchaudio.transforms as T\n\nprint(torch.__version__)\nprint(torchaudio.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import librosa\nimport matplotlib.pyplot as plt\nfrom IPython.display import Audio\nfrom torchaudio.utils import download_asset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we will use a speech data from\n[VOiCES dataset](https://iqtlabs.github.io/voices/)_,\nwhich is licensed under Creative Commos BY 4.0.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "SAMPLE_WAV_SPEECH_PATH = download_asset(\"tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav\")\n\n\ndef _get_sample(path, resample=None):\n effects = [[\"remix\", \"1\"]]\n if resample:\n effects.extend(\n [\n [\"lowpass\", f\"{resample // 2}\"],\n [\"rate\", f\"{resample}\"],\n ]\n )\n return torchaudio.sox_effects.apply_effects_file(path, effects=effects)\n\n\ndef get_speech_sample(*, resample=None):\n return _get_sample(SAMPLE_WAV_SPEECH_PATH, resample=resample)\n\n\ndef get_spectrogram(\n n_fft=400,\n win_len=None,\n hop_len=None,\n power=2.0,\n):\n waveform, _ = get_speech_sample()\n spectrogram = T.Spectrogram(\n n_fft=n_fft,\n win_length=win_len,\n hop_length=hop_len,\n center=True,\n pad_mode=\"reflect\",\n power=power,\n )\n return spectrogram(waveform)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SpecAugment\n\n[SpecAugment](https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html)_\nis a popular spectrogram augmentation technique.\n\n``torchaudio`` implements :py:func:`torchaudio.transforms.TimeStretch`,\n:py:func:`torchaudio.transforms.TimeMasking` and\n:py:func:`torchaudio.transforms.FrequencyMasking`.\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TimeStretch\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "spec = get_spectrogram(power=None)\nstretch = T.TimeStretch()\n\nspec_12 = stretch(spec, overriding_rate=1.2)\nspec_09 = stretch(spec, overriding_rate=0.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualization\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot():\n def plot_spec(ax, spec, title):\n ax.set_title(title)\n ax.imshow(librosa.amplitude_to_db(spec), origin=\"lower\", aspect=\"auto\")\n\n fig, axes = plt.subplots(3, 1, sharex=True, sharey=True)\n plot_spec(axes[0], torch.abs(spec_12[0]), title=\"Stretched x1.2\")\n plot_spec(axes[1], torch.abs(spec[0]), title=\"Original\")\n plot_spec(axes[2], torch.abs(spec_09[0]), title=\"Stretched x0.9\")\n fig.tight_layout()\n\n\nplot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Audio Samples\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def preview(spec, rate=16000):\n ispec = T.InverseSpectrogram()\n waveform = ispec(spec)\n\n return Audio(waveform[0].numpy().T, rate=rate)\n\n\npreview(spec)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "preview(spec_12)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "preview(spec_09)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Time and Frequency Masking\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "torch.random.manual_seed(4)\n\ntime_masking = T.TimeMasking(time_mask_param=80)\nfreq_masking = T.FrequencyMasking(freq_mask_param=80)\n\nspec = get_spectrogram()\ntime_masked = time_masking(spec)\nfreq_masked = freq_masking(spec)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot():\n def plot_spec(ax, spec, title):\n ax.set_title(title)\n ax.imshow(librosa.power_to_db(spec), origin=\"lower\", aspect=\"auto\")\n\n fig, axes = plt.subplots(3, 1, sharex=True, sharey=True)\n plot_spec(axes[0], spec[0], title=\"Original\")\n plot_spec(axes[1], time_masked[0], title=\"Masked along time axis\")\n plot_spec(axes[2], freq_masked[0], title=\"Masked along frequency axis\")\n fig.tight_layout()\n\n\nplot()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 0 }