A Quick Look at PyTorch Audio

A few days ago, a work colleague asked me for some advice on machine learning with audio. My quick reply was that I know very little about working with audio data. So, I pointed him to some other ML experts.

I hadn’t even looked at audio data for a long time so I figured I’d code up a minimal example. I installed the torchaudio library, downloaded the built-in yes-no audio dataset, displayed one sample in its numeric form, displayed the sample as a waveform, displayed it as a spectogram, and saved the sample as a .wav file so I could play it over a standard application.

Installing torchaudio was not trivial. I was working with PyTorch 1.10.0 and when I tried a naive “pip install torchaudio” command the installation attempted to install the current version of torchaudio (0.12.1) but that required a newer version of PyTorch than my 1.10.0 current version. So, instead, I downloaded an older version of torchaudio (torchaudio-0.8.0-cp37-none-win_amd64.whl) as a .whl file from the PyPi Web site and tried to install the older version using the command “pip install –no-deps (local .whl file)”. The torchaudio install seemed to work.

However, I did run into a few other install-related glitches before I got my demo to run. I don’t remember the details but there was some sort of missing component(s). By doing a Google search for the error message(s) I was able to eventually get my audio demo to work. Moral: Installing the torchaudio library will almost certainly cause you some minor grief.

The demo code is relatively self-explanatory but has many subtleties. The demo downloads the YesNo audio dataset. The dataset consists of 60 recordings of a man saying “yes” or “no” in Hebrew eight times. The labels are 0 for “no” and 1 for “yes”. The pattern of yes and no is random. For example, recording [3] in the demo is “no, no, yes, no, no, no, yes, no”.

Sample [3] of the yes-no audio dataset in .wav format created by the torchaudio.save() method.

The data is downloaded in a form that is derived from the PyTorch Dataset class and so it can be loaded using a DataLoader (but the demo code does not do this).

In my opinion, machine learning for audio data has evolved into an area that is distinct and quite different from ML for image data, ML for natural language, and ML for tabular data.

Machine learning with deep neural networks has made incredible advances in the past few years, but even so ML is still in its very early stages. It’s hard to imagine what ML for audio will be like in just a few years. Here are three old album covers from the early 1960s during the first days of the U.S. space program when rockets were crude and rudimentary. Just a few years later, the U.S. put a man on the moon on July 20, 1969.

Demo code:

# audio_demo.py

import torch as T
import torchaudio as TA
import matplotlib.pyplot as plt

# -----------------------------------------------------------

print("\nBegin PyTorch audio demo ")

print("\nLoading the yes-no audio dataset into memory ")
yesno_data = TA.datasets.YESNO(root="./", download=True)

print(type(yesno_data)); input()

print("\nGetting yes-no sample [3] ")
idx = 3
waveform, sample_rate, labels = yesno_data[idx]

print("\nWaveform data: ")
print(waveform)
print(waveform.shape)

print("\nSample rate: ")
print(sample_rate)

print("\nLabels: ")
print(labels)

print("\nDisplaying data in waveform format ")
plt.figure()
plt.plot(waveform.t().numpy())
plt.show()

print("\nDisplaying data in spectogram format ")
specgram = TA.transforms.Spectrogram()(waveform)
print("Shape of spectrogram: ", specgram.shape)

plt.figure()
plt.imshow(specgram.log2()[0, :, :].numpy())
plt.show()

print("\nSaving raw waveform data in .wav format file ")
path = ".\\yes_no_3.wav"
TA.save(path, waveform, sample_rate)

print("\nEnd demo ")

# -----------------------------------------------------------