Acoustic-Visual Multimodal Scene Recognition

a case study of audio/image aerial scene classification

5 min readJun 17, 2022

Multimodal Scene Recognition with image/audio/video as inputs presents challenges in integrating data, designing models, and identifying co-training strategies, in contrast to developing solutions for Single-Modality (e.g., image-only or audio-only) Scene Recognition. This post consists of two parts. The first part provides five multimodal scene recognition datasets for reference. These datasets cover cases in location classification, event classification, and crowd counting. The second part describes a case study of an audio/image aerial scene classification problem based on a pre-trained audio/image embedding approach. The complete pipeline of data processing and model building, together with the performance results, will be described in detail.

Five Multimodal Scene Recognition Datasets

TAU Urban Audio-Visual Scenes 2021 ~ A location classification dataset consists of 12292 videos and sound clips categorized into 10 classes: airport, metro station, public square, shopping mall, street pedestrian, street traffic, traveling by bus, traveling by metro, traveling by train, and urban park.[paper][dataset][webpage]

***Figure 1. TAU Urban Audio-Visual Scenes 2021 — Location Classification.***

ADVANCE (AuDio Visual Aerial sceNe reCognition datasEt) ~ A location classification dataset consists of 5075 paired images and sound clips categorized into 13 classes: airport, beach, bridge, farmland, forest, grassland, harbor, lake, residential area, orchard, sparse shrub land, sports land, and train station. [paper][dataset][website]

**Figure 2. ADVANCE — Location Classification.**

DISCO (auDIoviSual Crowd cOunting dataset) ~ A crowd counting dataset consists of 1935 paired images and audio clips. The minimum, average, and maximum number of people in images are 1, 87.9, and 709. [paper][dataset][website]

Crowded Scene Recognition Dataset ~ An event classification dataset consists of 341 videos categorized into 5 classes: riot, noise-street, firework-event, music-event, and sport-atmosphere. [paper][dataset]

**Figure 4. Crowded Scene Recognition Dataset — Event Classification.**

XD-Violence ~ An event classification dataset consists of 2349 non-violent videos and 2405 violent videos, where the violent videos are categorized into 6 classes: abuse, car accident, explosion, fighting, riot, and shooting.[paper][webpage]

**Figure 5. XD-Violence — Event Classification.**

Aerial Scene Recognition on ADVANCE Dataset

ADVANCE (AuDio Visual Aerial sceNe reCognition datasEt) is a location classification dataset consisting of 5075 paired images and sound clips. These image/audio pairs are categorized into 13 classes: airport, beach, bridge, farmland, forest, grassland, harbor, lake, residential area, orchard, sparse shrub land, sports land, and train station. The image files are 512 x 512 jpg files and the audio files are 10-seconds wav files. Figure 6 illustrates the coordinates and samples of image/audio pairs and the distribution of image/audio pairs in classes.

**Figure 6. (a) Coordinates and samples of image/audio pairs; (b) Distributions of image/audio pairs in classes.**

The pipeline of data processing and model building is shown in Figure 7.

**Figure 7. The pipeline of data processing and model building.**

The first step of the pipeline is to convert the image/audio pairs to the corresponding image/audio embeddings using OpenL3. OpenL3 is an open-source Python library for computing deep audio and image embeddings. OpenL3 is an enhancement to L³-Net where both utilize audio-visual correspondence to align audio and video and then generate their representations accordingly. In the case of OpenL3, its model is trained by audio/video from two subsets under AudioSet: the environment subset and the music subset. The audio embedding from OpenL3 has been applied to three downstream audio datasets ( UrbanSound8K, Environmental Sound Classification Dataset ESC-50, and DCASE 2013 Scene Classification Dataset) for evaluation. To apply this mechanism to the image/audio pairs under the ADVANCE dataset, OpenL3 is customized to produce representations with the embedding size of 512 for both image and audio. The shape of an image embedding is (1, 512) and the shape of an audio embedding is (96, 512) because the latter case includes a temporal dimension. Finally, the image embedding is duplicated and then interleaved with the audio embedding to create a joined representation in the shape of (102, 512) to a convolutional neural network for feature extraction and representation classification. The architecture and configuration of the convolutional neural network are shown in Figure 8.

Figure 8. The architecture and configuration of the convolutional neural network implemented by Keras.

To the best of my observation and knowledge, there are two advantages to the above pipeline of data processing and model building. First, because the ADVANCE dataset is a small dataset in terms of the total number of image/audio pairs and the average number of image/audio pairs per class, using a pre-trained image/audio embedding mechanism based on an environment dataset should be better than co-training an image-specific model and an audio-specific model with a small dataset from scratch. Second, interleaving image embedding and audio embedding naturally simulates the original idea of audio-visual correspondence. Furthermore, such a joined representation significantly simplifies the architecture of the downstream model, and a convolutional neural network should be able to capture useful feature maps from the joined representation. Luckily, the performance results stated below more or less justify the above claims.

After training 200 epochs, the proposed model could reach accuracy between 93% and 96%. The details of a trial run reaching an accuracy of 95.8% are shown below:

the strategy for splitting and augmenting the data is given in Figure 9
the history of accuracy and loss is given in Figure 10
the confusion matrix and the classification report are given in Figure 11

Figure 9. The strategy for splitting and augmenting the data. (train/test ratio: 9:1, augment by upsampling)

**Figure 10. The history of accuracy and loss.**

Figure 11. Confusion Matrix and Classification Report.

Conclusions

This post has presented an overview of multimodal scene recognition with image/audio/video inputs. The first part of this article provides a brief of five multimodal scene recognition datasets, covering location classification, event classification, and crowd counting. These datasets should provide interesting challenges for researchers and practitioners in the field. The second part of this article describes a case study of an image/audio aerial scene recognition problem. The key components in the proposed pipeline include a pre-trained image/audio embedding mechanism, an interleaved/augmented image/audio joined representation, and a convolutional neural network for feature extraction and representation classification. The proposed pipeline has delivered reasonable performance results for the image/audio aerial scene recognition problem. However, the more advanced approach such as the multimodal transformer could be an interesting area for further investigation.

Thanks for reading this post and sharing your thought on this topic.

Acoustic-Visual Multimodal Scene Recognition

a case study of audio/image aerial scene classification

Five Multimodal Scene Recognition Datasets

Aerial Scene Recognition on ADVANCE Dataset

Conclusions

Written by franky