TinyML Made Easy: Sound Classification (KWS) - Hackster.io

2022-12-21 15:28:46 By : Mr. Kevin Leung

Add the following snippet to your HTML:

We are continuing our exploration of Machine Learning on a giant tiny device, the Seeed XIAO BLE Sense. And now, classifying sound waves.

Read up about this project on

We are continuing our exploration of Machine Learning on a giant tiny device, the Seeed XIAO BLE Sense. And now, classifying sound waves.

In my tutorial, TinyML Made Easy: Anomaly Detection & Motion Classification, we explored Embedded Machine Learning, or simply, TinyML, running on the robust and still very tiny device, the Seed XIAO BLE Sense. In that tutorial, besides installing and testing the device, we explored motion classification using real data signals from its onboard accelerometer. In this new project, we will use the same XIAO BLE Sense to classify sound, explicitly working as "Key Word Spotting" (KWS). A KWS is a typical TinyML application and an essential part of a voice assistant.

For starting, it is essential to realize that Voice Assistants on the market, like Google Home or Amazon Echo-Dot, only react to humans when they are “waked up” by particular keywords such as “ Hey Google” on the first one and “Alexa” on the second.

In other words, the complete process of recognizing voice commands is based on a multi-stage model or Cascade Detection.

Stage 1: A smaller microprocessor inside the Echo-Dot or Google Home continuously listens to the sound, waiting for the keyword to be spotted. For such detection, a TinyML model at the edge is used (KWS application).

Stage 2: Only when triggered by the KWS application on Stage 1 is the data sent to the cloud and processed on a larger model.

In this project, we will focus on Stage 1 ( KWS or Keyword Spotting), where we will use the XIAO BLE Sense, which has a digital microphone that will be used to spot the keyword.

The below diagram will give an idea of how the final KWS application should work (during inference):

Our KWS application will recognize three classes of sound:

The main component of the KWS application is its model. So, we must train such a model with our specific keywords:

The critical component of Machine Learning Workflow is the dataset. Once we have decided on specific keywords ( UNIFEI and IESTI), all dataset should be created from zero. When working with accelerometers, creating a dataset with data captured by the same type of sensor was essential. In the case of sound, it is different because of what we will classify is audio data.

The sound waves should be converted to audio data when we speak a keyword. The conversion should be done by sampling the signal generated by the microphone in 16KHz with a 16bits depth.

So, any device that can generate audio data with this basic specification (16Khz/16bits) will work fine. As a device, we can use the proper XIAO BLE Sense, a computer, or even your mobile phone.

In the tutorial, TinyML Made Easy: Anomaly Detection & Motion Classification, we learned how to install and test our device using the Arduino IDE and connect it to Edge Impulse Studio for data capturing. For that, we use the EI CLI function Data Forwarder, but according to Jan Jongboom, Edge Impulse CTO, audio goes too fast for the data forwarder to be captured. If you have PCM data already, then turning it into a WAV file and uploading it with the uploader is the easiest. With accelerometers, our sample frequency was around 100Hz, with audio being 16KHz.

So, we can not connect the XIAO directly to the Studio yet (but Edge Impulse should do it soon!). But we can capture sound using any smartphone connected online with them. We will not explore this option here, but you can easily follow EI documentation and tutorial.

The easiest way to capture audio and save it locally as.wav file is using an expansion board for the XIAO family of devices, the Seeed Studio XIAO Expansion board.

This expansion board enables to build of prototypes and projects easily and quickly, using its rich peripherals such as OLED Display, SD Card interface, RTC, passive buzzer, RESET/User button, 5V servo connector, and multiple data interfaces.

This tutorial will focus on classifying keywords, and the MicroSD card available on the device will be very important in helping us with data capture.

Saving recorded audio from the microphone on an SD card

Connect the XIAO BLE Sense on the Expansion Board and insert an SD card into the SD card slot at the back.

Next, download the Seeed_Arduino_FS Library as a zip file:

And install the downloaded library: Seeed_Arduino_Mic-master.zip on your Arduino IDE:

Sketch -> Include Library -> Add .ZIP Library...

Next, navigate to File > Examples > Seeed Arduino Mic > mic_Saved_OnSDcard to open the sketch: mic_Saved_OnSDcard.

Each time you press the reset button, a 5 seconds audio sample is recorded and saved on the SD card. I changed the original file to add LEDs to help during the recording process as below:

I realized that sometimes at the beginning and the end of each sample, a "spike" was recorded, so I cut the initial 300ms from each 5s sample. The spike verified at the end always happened after the recording process and should be eliminated on Edge Impulse Studio before training. Also, I increased the microphone gain to 30 dB.

The complete file ( Xiao_mic_Saved_OnSDcard.ino) can be found on my Git Hub (3_KWS): Seeed-XIAO-BLE-Sense.

During the recording process, the.wav file names are shown on Serial Monitor:

Take the SD card from the Expansion Board and insert it into your computer:

The files are ready to be uploaded to Edge Impulse Studio

Alternatively, you can use your PC or smartphone to capture audio data with a sampling frequency of 16KHz and a bit depth of 16 Bits. A good app for that is Voice Recorder Pro(IOS). You should save your record as.wav files and send them to your computer.

When the raw dataset is created, you should initiate a new project at Edge Impulse Studio:

Once the project is created, go to the Data Acquisition section and select the Upload Existing Data tool. Choose the files to be uploaded, for example, I started uploading the samples recorded with the XIAO BLE Sense:

The samples will now appear in the Data acquisition section:

Click on three dots after the sample name and select Split sample. Once inside de tool, split the data into 1-second records (try to avoid start and end portions):

This procedure should be repeated for all samples. After that, upload other class samples (IESTI and SILENCE) captured with the XIAO and your PC or smartphone.

In the end, my dataset has around 70 1-second samples for each class:

Now, you should split that dataset into Train/Test. You can do it manually (using the three dots menu, moving samples individually), or you can use the option Perform Train / Test Split on Dashboard - Danger Zone.

We can optionally check all dataset using the tab Data Explorer. The data points seem apart, which means that the classification model should work:

An impulse takes raw data, uses signal processing to extract features, and then uses a learning block to classify new data.

First, we will take the data points with a 1-second window, augmenting the data, sliding that window each 500ms. Note that the option zero-point pad is set. This is important to fill with zeros samples smaller than 1 second (in some cases, I reduced the 1000 ms window on the split tool to avoid noises and spikes.

Each 1-second audio sample should be pre-processed, converting it to an image. For that, we will use MFCC, which extracts features from audio signals using Mel Frequency Cepstral Coefficients, which are great for the human voice.

For classification, we will select KERAS, which means we will build our model from scratch (Image Classification, using Convolution Neural Network).

The next step is to create the images to be trained in the next phase:

We will keep the default parameter values. We do not spend much memory to pre-process data (only 17KB), but the processing time is relatively high (177 ms for a Cortex-M4 CPU as our XIAO. Save parameters and generate features:

The model that we will use is a Convolution Neural Network (CNN). We will use two blocks of Conv1D + MaxPooling (with 8 and 16 neurons, respectively) and a 0.25 Dropout. And on the last layer, after Flattening three neurons, one for each class:

As hyper-parameters, we will have a Learning Rate of 0.005 and a model that will be trained by 100 epochs. The result seems OK:

Testing the model with the data put apart before training (Test Data), we got an accuracy of 75%. Based on the small amount of data used, it is OK, but I strongly suggest increasing the number of samples.

Collecting more data, the Test accuracy moved up around 5%, going from 75% to around 81%:

Now, we can proceed with the project, but before deployment on our device, it is possible to perform Live Classification using a Smart Phone, confirming that the model is working with live and real data:

The Studio will package all the needed libraries, preprocessing functions, and trained models, downloading them to your computer. You should select the option Arduino Library and at the bottom, select Quantized (Int8) and Build.

A Zip file will be created and downloaded to your computer:

On your Arduino IDE, go to the Sketch tab and select the option Add .ZIP Library.

And Choose the.zip file downloaded by the Studio:

Now it is time for a real test. We will make inferences wholly disconnected from the Studio. Let's change one of the code examples created when you deploy the Arduino Library.

In your Arduino IDE, go to the File/Examples tab and look for your project, and on examples, select nano_ble33_sense_microphone_continuous:

Even though the XIAO is not the same as the Arduino, both have the same MPU and PDM microphone, so the code works as it is. Upload the sketch to XIAO and open the Serial Monitor. Start talking one or another Keyword and confirm that the model is working correctly:

Now that we know that the model is working by detecting our two keywords, let's modify the code so we can see the result with the XIAO BLE Sense completely offline (disconnected from the PC and powered by a battery).

The idea is that whenever the keyword UNIFEI is detected, the LED Red will be ON; if it is IESTI, LED Green will be ON, and if it is SILENCE (No Keyword), both LEDs will be OFF.

Installing and Testing the SSD Display

In your Arduino IDE, Install the u8g2library and run the below code for testing:

And you should see the "Hello World" displayed on the SSD:

Now, let's create some functions that depending on the values of pred_index and pred_value, will trigger the proper LED and display the class and probability. The code below will simulate some inference results and present them on display and LEDs:

Running the above code, you should get the below result:

Now, you should merge the above code (Initialization and functions) with the nano_ble33_sense_microphone_continuous.ino that you have used initially for testing your model. Also, you should include the below code on loop(), between the lines:

And replacing the original function to print inference results on the Serial Monitor:

Here you can see how the final project is

The Seeed XIAO BLE Sense is really a giant tiny device! However, it is powerful, trustworthy, not expensive, low power, and has suitable sensors to be used on the most common embedded machine learning applications such as movement and sound. Even though Edge Impulse does not officially support XIAO BLE Sense (yet!), we also realized that it could use the Studio for training and deployment.

Before we finish, take into consideration that Sound Classification is much more than just voice. For example, you can develop TinyML projects around sound in several areas as:

If you want to learn more about Embedded Machine Learning (TinyML), please see these references:

As always, I hope this project can help others find their way in the exciting world of AI!

Greetings from the south of the world!

See you at my next project!

Hackster.io, an Avnet Community © 2022