Skip to content

[PyTorch] Tutorial(6) Audio of Processing Module: torchaudio

Last Updated on 2021-05-12 by Clay

Just as torchvision is a module in PyTorch that specializes in processing pictures, torchaudio to be recorded today is a module in PyTorch that specializes in processing audio.

Able to process text, picture, and audio ... etc. PyTorch is really a convenient deep learning framework!

As always, the official teaching document is attached: Pytorch tutorial


Introduction of torchaudio

If you want to use torchaudio, you need to use the following command to install it.

pip3 install torchaudio

It may take some time.


First, we need to import the packages and modules we need.

# -*- coding: utf-8 -*-
import torchaudio
import matplotlib.pyplot as plt
import requests



This is the official test sound file, you can use your own. I save the file as test.wav. (According to the official instructions, torchaudio only suppoers .wav and .mp3.)

# Test music
fileName = 'test.wav'
url = "https://pytorch.org/tutorials//_static/img/steam-train-whistle-daniel_simon-converted-from-mp3.wav"
r = requests.get(url)

# Save
with open(fileName, 'wb') as f:
    f.write(r.content)



Draw a waveform graph

waveform, sample_rate = torchaudio.load(fileName)

print('Shape of waveform: {}'.format(waveform.size()))
print('Sample rate of waveform: {}'.format(sample_rate))

plt.figure()
plt.plot(waveform.t().numpy())
plt.show()



Output:

Shape of waveform: torch.Size([2, 276858])
Sample rate of waveform: 44100

You don't have to use numpy() to convert to numpy data type, just use the original tensor data type. The. most important thing is to use t() to transpos, so that there are two values in one dimension, and we can draw a waveform.

  • waveform: Acoording to the explanation on the official website, it is "original audio signal"
  • sample_rate: As the sampling rate, here is 44100 Hz, which is often used as the sampling rate of CDs.


In addition, we can also resample:

new_sample_rate = sample_rate/10
channel = 0
transformed = torchaudio.transforms
transformed = transformed.Resample(sample_rate, new_sample_rate)
transformed = transformed(waveform[channel, :].view(1, -1))
print('Shape of transformed waveform:', transformed.size())

plt.figure()
plt.plot(transformed[0, :].numpy())
plt.show()



Output:

Shape of transformed waveform: torch.Size([1, 27686])
tensor([ 4.5531e-03,  1.6837e-02,  8.0987e-03,  …, -5.0898e-06, 6.0601e-06,  2.6707e-05])

Spectrogram

The frequency spectrum, as the name implies, is the representation method of the "time domain" signal in the "frequency domain", which is usually converted by the "Fourier transform". Usually the amplitude is the y-axis and the frequency is the x-axis.

# Spectrogram
specgram = torchaudio.transforms.Spectrogram()(waveform)
print('Shape of spectrogram:', specgram.size())

plt.figure()
plt.imshow(specgram.log2()[0, :, :].numpy())
plt.show()



Output:

Shape of transformed waveform: torch.Size([1, 27686])
tensor([ 4.5531e-03,  1.6837e-02,  8.0987e-03,  …, -5.0898e-06,
         6.0601e-06,  2.6707e-05])

Additional Record

Basically, I only followed a few functions that I might use; as for many transform technologies provided by torchaudio, I think I will find time to record a complete one, and I should not briefly introduce the visualized audio files with this article. mixed together.

Under the official PyTorch teaching, there are drawing teaching such as Mel Spectrogram; but I may not use it for the time being.

It is worth narrating that the package of Kaldi was mentioned in the official teaching-awful, I am completely unfamiliar! Seems to be a well-known Python audio processing package? I think I should study this kit more if I have time.


Read More

Leave a Reply