Ambient acoustic context dataset for building responsive, context-augmented voice assistants

Curating AudioSet with the fine granularity for a workplace setting

While Google's AudioSet provides a large-scale audio dataset with various events, the events are labeled in a 10-second audio segment (weakly labeled), which is difficult to know when the event actually occured within the segment. We employed the crowdsourcing approach and curated the AudioSet to have the fine granularity of the labels to one second (strongly labeled) for typical activities in a worklplace setting. In this way, the researchers can build more responsive and accurate audio understanding machine learning models.

Dataset description

The dataset contains around 57,000 1-second segments for activities that occur in a workplace setting. We curated Google AudioSet to annotate the audio labels in 1-second granularity, using Amazin Mechanical Turk. We asked the crowd workers to listen to 1-second segments and choose the right label. To ensure the quality of the annotations, we excluded audio segments that did not reach majority agreement among the turkers. Disclaimer: While we tried to make sure the annotations are correct for all segments, there could exist segments that have incorrect annotations. Please make sure to test such cases in your data pipelines.

Event # of segments
Clicking 88
Door 113
Conversation 120
Male Speech 124
Female Speech 136
Chatter 158
Knock 180
Walk 220
Hubbub 227
Television 239
Clapping 240
Silence 251
Typing 315
Applause 360
Laughter 530
Crowd 1,218
Other, unidentifiable events 24,081
Speech 28,671

Dataset structure

The structure of dataset mainly follows that of Google AudioSet. Here is the dataset structure.

┣ [audio_segment_id]_[index].wav
┣ [audio_segment_id]_[index]_labels.txt
┣ ...
┗ ...

As shown in the above structure, each directory contains 1-second audio segments ([audio_segment_id]_[index].wav) and corresponding label file ([audio_segment_id]_[index]_labels.txt). The wav files can be directly used for training and testing a machine learning model or can be converted into audio embeddings using VGGish model.

[start_seconds], [id], [name]

The label files are structured as shown in the above. It contains start time for the event and ID/name for the event. The ID and name follow the Google AudioSet Ontology convention


Augmenting Conversational Agents with Ambient Acoustic Contexts
Chunjong Park, Chulhong Min, Sourav Bhattacharaya, Fahim Kawsar
In 22nd International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI ’20),October 5–8, 2020