Understanding Sound: A Comprehensive Guide to Digital Audio
Written on
What is Sound?
This article provides an overview of digital sound and audio data, supplemented by a Python example that explores audio files, sound waves, and spectrograms.
Machine Learning and Data Analysis in Sound is an emerging field with significant potential for various Data Science applications. Here, you will discover what sound is and how it can be represented on a computer.
You will also gain insights into Fourier Transforms and their functionality. Furthermore, you'll see how to utilize the Fourier Transform to prepare audio data for Machine Learning or data analysis, complete with a comprehensive Python example.
What Constitutes Audio Data?
You might be familiar with tabular data, the conventional way of organizing information in rows and columns. This format is ideal for many Machine Learning challenges.
The next level of complexity is image data, which is trickier to handle since it cannot be structured as tabular data. Computers display images by assigning values to small squares called pixels. By assigning Red, Green, and Blue values to each pixel, images can be visualized.
When performing Machine Learning on images, this logic is replicated by using pixel values in a three-dimensional vector (image height, image width, and the three color channels). If you wish to delve deeper into machine learning concerning images, consider resources on object detection and image segmentation.
The next challenge is understanding sound (or audio) data. At its most basic, Audio Data is a numerical representation of sounds.
Sound as Air Pressure
Audio Data can be challenging to work with because, unlike tabular and image data, it lacks a clear and organized structure.
Sound, in its most fundamental (non-digital) form, is a fluctuation in air pressure that the human ear can perceive.
From Analog Sound to Digital Audio: The Role of Microphones
To store sound in a digital format, it must first be converted into a structure that computers can manage.
This conversion involves two main steps:
- A Microphone transforms air pressure variations into voltage. This is a crucial yet often overlooked function of microphones.
- An Analog-to-Digital Converter receives this voltage (think of it as different 'intensities' of electric current) and translates these voltages into numerical values: these numbers represent the digital waveform, which can be stored on a computer.
Audio Data: Amplitude and Frequency
Having grasped the basics of converting sound to numbers, the next question is: what do these numbers signify in terms of sound?
There are two critical types of values: amplitude and frequency.
- The amplitude of a sound wave indicates its volume.
- The frequency of a sound wave denotes its pitch.
Thus, a single sound wave is characterized by these two fundamental properties: amplitude and pitch, allowing for the creation of tones that range from loud to soft and low to high.
Real-world Sounds as Composite Waves
Now that you understand the nature of a single sound wave, let’s consider a more complex and realistic scenario. Typically, multiple sounds coexist in an audio file. For instance, in music, several tones are usually heard simultaneously. Similarly, nature recordings often feature various animal sounds and environmental noises all at once.
Audio files containing only one frequency are rare. Consequently, summarizing an audio file as data necessitates describing multiple waves.
A complete sound consists of a blend of waves, representing various frequencies at different amplitudes. In simpler terms, sound is a combination of high and low tones at varying volumes.
To analyze such intricate waves, they must be separated into amplitudes per frequency over time, typically depicted through a spectrogram.
Spectrograms: Visualizing Sound Waves
Spectrograms are graphical representations that illustrate sound over time. The graph displays time on the x-axis and frequency on the y-axis, with color indicating the amplitude of a specific frequency at a given moment.
Creating Spectrograms with Fourier Transform
The Fourier Transform is a sophisticated mathematical technique that allows for the decomposition of a "complex" sound into a spectrogram that displays volume (amplitude) for each frequency over time.
While a detailed explanation is beyond this article's scope, I encourage you to explore more about Fourier Transform through various resources.
Representing Sound in Python with Librosa
Now let’s put theory into practice by manipulating a real music file in Python using the librosa library, which is excellent for audio processing in Python.
To install librosa, simply run !pip install librosa in a Jupyter Notebook, or consider using Google Colab for this task.
When working with sound data, you will typically utilize .WAV files, as they are uncompressed. Other formats, like .FLAC or .MP3, compress the audio, which may affect its numerical representation, though it’s not impossible to use them.
Librosa includes an example music file that you can import as follows:
Once the music file is imported, you can use a Jupyter Notebook function to play it back:
You will see a soundbar in your notebook that allows you to listen to the music.
Next, let’s visualize the sound wave, making it "visible." Remember, real sounds are complex and do not appear as simple waves.
You can print the sound wave using the following code:
This will produce a plot resembling the following:
Next, we will create the most informative visualization of sound data: the spectrogram. To derive the spectrogram input data, apply the Fourier Transform to the wave data y.
While Fourier Transforms involve advanced mathematics, we can leverage a function in librosa to perform these calculations effortlessly, as shown in the following code:
You will receive an array that looks similar to this:
Before finalizing the spectrogram, the data needs to be converted into decibels. The spectrogram function does not operate with the current amplitude format.
You can convert the data using the following code:
Finally, you can display the spectrogram using the following code:
This will yield a graph depicting the spectrogram of the nutcracker. You have successfully imported a music file and transformed it into a visual data format. The spectrogram provides a comprehensive overview of the volume at each frequency of the nutcracker, making it an ideal visual representation of music.
Conclusion
In this article, you have acquired foundational knowledge about working with sound and audio in a digital format. You have transformed audio data from a .wav format—only audible—into a visual representation as a spectrogram.
Having mastered audio data importation and preparation, you are now prepared to tackle more advanced applications, such as employing machine learning for music genre classification or sound detection.
I hope you found this article beneficial. Thank you for reading, and stay tuned for more content on statistics, mathematics, and data!