![]() |
![]() |
![]() |
The goal of acoustic scene classification is to classify a test recording into one of the predefined ten acoustic scene classes
dcase_util
– used to ease data handling, https://github.com/DCASE-REPO/dcase_utilkeras
– neural network API for fast experimentation used on top of tensorflow
machine learning frameworkscikit-learn
– set of machine learning tools, here used to evaluate the system outputTUT Urban Acoustic Scenes 2018 development dataset is used in this example:
Dataset can be downloaded and accessed easily by using dataset handler class from dcase_util
:
db = dcase_util.datasets.TUTUrbanAcousticScenes_2018_DevelopmentSet(
data_path=dataset_storage_path
).initialize()
Audio file count : 8640 Scene class count: 10
Basic statistics of the dataset:
MetaDataContainer :: Class Items : 8640 Unique Files : 8640 Scene labels : 10 Event labels : 0 Tags : 0 Identifiers : 286 Datasets : 0 Source labels : 1 Scene statistics Scene label Count Identifiers -------------------- ------ ----------- airport 864 22 bus 864 36 metro 864 29 metro_station 864 40 park 864 25 public_square 864 24 shopping_mall 864 22 street_pedestrian 864 28 street_traffic 864 25 tram 864 35
dcase_util
can be used to split the original training set into a new training set and validation set (70/30 split done according to recording locations)Scene label | Train set (items) | Test set (items) | Split percentage | Train (locations) | Test (locations) |
---|---|---|---|---|---|
airport | 599 | 265 | 69 | 15 | 7 |
bus | 622 | 242 | 71 | 26 | 10 |
metro | 603 | 261 | 69 | 20 | 9 |
metro_station | 605 | 259 | 70 | 28 | 12 |
park | 622 | 242 | 71 | 18 | 7 |
public_square | 648 | 216 | 75 | 18 | 6 |
shopping_mall | 585 | 279 | 67 | 16 | 6 |
street_pedestrian | 617 | 247 | 71 | 20 | 8 |
street_traffic | 618 | 246 | 71 | 18 | 7 |
tram | 603 | 261 | 69 | 24 | 11 |
Overall | 6122 | 2518 | 70 | 203 | 83 |
training_files, validation_files = db.validation_split(
validation_amount=0.3, # split target 30%
fold=1, # cross-validation fold id
# balance based on scenes and locations
balancing_mode='identifier_two_level_hierarchy',
disable_progress_bar=True
)
train_meta = db.train(1).filter(file_list=training_files)
validation_meta = db.train(1).filter(file_list=validation_files)
Training items : 4134 Validation items: 1988
Scene label | Train (locations) | Validation (locations) | Split percentage | Train set (items) | Validation set (items) | Split percentage |
---|---|---|---|---|---|---|
airport | 9 | 6 | 40.0 | 411 | 188 | 31.4 |
bus | 16 | 10 | 38.5 | 413 | 209 | 33.6 |
metro | 13 | 7 | 35.0 | 422 | 181 | 30.0 |
metro_station | 18 | 10 | 35.7 | 408 | 197 | 32.6 |
park | 12 | 6 | 33.3 | 425 | 197 | 31.7 |
public_square | 12 | 6 | 33.3 | 433 | 215 | 33.2 |
shopping_mall | 10 | 6 | 37.5 | 360 | 225 | 38.5 |
street_pedestrian | 13 | 7 | 35.0 | 422 | 195 | 31.6 |
street_traffic | 12 | 6 | 33.3 | 425 | 193 | 31.2 |
tram | 16 | 8 | 33.3 | 415 | 188 | 31.2 |
Overall | 131 | 72 | 35.5 | 4134 | 1988 | 32.5 |
Feature extractor initialized with parameters and used to extract features:
# Load audio
audio = dcase_util.containers.AudioContainer().load(
filename=db.audio_files[0], mono=True
)
# Create feature extractor
mel_extractor = dcase_util.features.MelExtractor(
n_mels=40,
win_length_seconds=0.04,
hop_length_seconds=0.02,
fs=audio.fs
)
# Extract features
mel_data = mel_extractor.extract(y=audio)
mel_data shape (frequency, time): (40, 501)
1) Feature matrix
# Load audio
audio = dcase_util.containers.AudioContainer().load(
filename=train_meta[0].filename, mono=True
)
# Extract log-mel energies
sequence_length = 500 # 10s / 0.02s = 500
mel_extractor = dcase_util.features.MelExtractor(
n_mels=40, win_length_seconds=0.04, hop_length_seconds=0.02, fs=audio.fs
)
features = mel_extractor.extract(audio.data)[:,:sequence_length]
features shape (frequency, time): (40, 500)
2) Target vector (one-hot encoded vector)
# List of scene labels
scene_labels = db.scene_labels()
# Empty target vector
target_vector = numpy.zeros(len(scene_labels))
# Place one at correct position
target_vector[scene_labels.index(train_meta[0].scene_label)] = 1
array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
target vector shape (classes, ): (10,)
All learning data is collected into X_train
and Y_train
matrices:
X_train shape (sequence, frequence, time): (4134, 40, 500) Y_train shape (sequence, classes): (4134, 10)
Matrix data:
Neural network consists of two CNN blocks, the pooling layer, and the output layer.
Next, we create a neural network structure layer by layer.
Input layer:
input_layer = Input(
shape=(feature_vector_length, sequence_length),
name='Input'
)
Reshaping layer to add channel axis into input data:
x = Reshape(
target_shape=(feature_vector_length, sequence_length, 1),
name='Input_Reshape'
)(input_layer)
Output shape (sequence, frequency, time, channel): (None, 40, 500, 1)
Two convolutional layer groups consisting of:
1) Convolution to capture context and extract high-level features:
kernel 5x5 and filters 64
x = Conv2D(
filters=64,
kernel_size=(5, 5),
activation='linear',
padding='same',
data_format='channels_last',
name='Conv1'
)(x)
2) Batch normalization to enable higher learning rates
x = BatchNormalization(
axis=-1,
name='Conv1_BatchNorm'
)(x)
3) Activation (ReLu) to introduce non-linearity
x = Activation(
activation='relu', name='Conv1_Activation'
)(x)
4) Pooling (2D) to extract dominant features
x = MaxPooling2D(
pool_size=(2, 4), name='Conv1_Pooling'
)(x)
5) Dropout to avoid overfitting
x = Dropout(
rate=0.2, name='Conv1_DropOut'
)(x)
Output shape of CNN layer group 1 (sequence, frequency, time, feature): (None, 20, 125, 64)
Second convolutional layer group:
x = Conv2D(filters=64, kernel_size=(5, 5), activation='linear', padding='same', data_format='channels_last', name='Conv2')(x)
x = BatchNormalization(axis=-1, name='Conv2_BatchNorm')(x)
x = Activation(activation='relu', name='Conv2_Activation')(x)
x = MaxPooling2D(pool_size=(2, 2), name='Conv2_Pooling')(x)
x = Dropout(rate=0.2, name='Conv2_DropOut')(x)
Output shape of CNN layer group 2 (sequence, frequency, time, feature): (None, 10, 62, 64)
Global max pooling is applied to the output of the last convolutional layer group to summarize output into a single vector:
x = GlobalMaxPooling2D(
data_format='channels_last',
name='GlobalPooling'
)(x)
Output shape (sequence, feature): (None, 64)
Output layer as fully-connected layer with a softmax activation:
output_layer = Dense(
units=len(db.scene_labels()),
activation='softmax',
name='Output'
)(x)
Output shape (sequence, class): (None, 10)
Key parameters:
categorical_crossentropy
categorical_accuracy
model.compile(
loss='categorical_crossentropy',
metrics=['categorical_accuracy'],
optimizer=keras.optimizers.Adam(learning_rate=0.001, decay=0.001)
)
# Track power consumption during the training
tracker = EmissionsTracker("Sound Classification Tutorial", output_dir=os.path.join('data', 'training_codecarbon'))
tracker.start()
# Track time
start_time = time.time()
# Start training process
history = model.fit(
x=X_train, y=Y_train,
validation_data=(X_validation, Y_validation),
callbacks=callback_list,
verbose=0,
epochs=100,
batch_size=16
)
# Stop tracking
stop_time = time.time()
tracker.stop()
Training Loss Metric categorical_crossentropy categorical_accuracy Epoch Train Val Train Val ------- --------------- --------------- --------------- --------------- 1 1.5075 1.6670 0.4557 0.3602 2 1.3583 1.4756 0.4981 0.4542 3 1.2583 1.4834 0.5467 0.4814 4 1.1897 1.4104 0.5622 0.4306 5 1.1327 1.3432 0.5936 0.4628 6 1.0801 1.9897 0.6101 0.3099 7 1.0256 1.2844 0.6352 0.5231 8 0.9925 1.4193 0.6432 0.4527 9 0.9589 1.3612 0.6589 0.4899 10 0.9415 1.3491 0.6626 0.4643 11 0.9026 1.2894 0.6802 0.5040 12 0.8873 1.4195 0.6848 0.4437 13 0.8590 1.5074 0.6971 0.4366 14 0.8474 1.1322 0.6971 0.5568 15 0.8359 1.3068 0.7042 0.5367 16 0.8178 1.2733 0.7083 0.5392 17 0.7952 1.1023 0.7170 0.5805 18 0.7781 1.2346 0.7339 0.5161 19 0.7725 1.0566 0.7308 0.5850 20 0.7717 1.3855 0.7281 0.5065 21 0.7492 1.1186 0.7472 0.5830 22 0.7438 1.1939 0.7436 0.5418 23 0.7258 1.0572 0.7511 0.5946 24 0.7284 1.0736 0.7446 0.5800 25 0.7125 1.0651 0.7559 0.5805 26 0.7073 1.1462 0.7535 0.5629 27 0.7035 1.3761 0.7574 0.4764 28 0.6953 1.1581 0.7511 0.5392 29 0.6875 1.0682 0.7651 0.5936 30 0.6741 1.0484 0.7721 0.6056 31 0.6759 1.3890 0.7683 0.5020 32 0.6818 1.1066 0.7663 0.5830
Extract features for test item:
# Get test item
item = db.test(fold=1)[0]
# Extract features
features = mel_extractor.extract(
AudioContainer().load(filename=item.filename, mono=True)
)[:,:sequence_length]
Reshape the matrix to match the model input:
input_data = numpy.expand_dims(features, 0)
Feed input data into the model to get probabilities for each scene class:
probabilities = model.predict(x=input_data)
Classify by selecting the class giving maximum output:
frame_decisions = dcase_util.data.ProbabilityEncoder().binarization(
probabilities=probabilities.T,
binarization_type='frame_max'
).T
Scene label:
import sklearn
# Get confusion matrix with counts
confusion_matrix = sklearn.metrics.confusion_matrix(y_true, y_pred)
# Transform matrix into percentages, normalize row-wise
conf = confusion_matrix * 100.0 / confusion_matrix.sum(axis=1)[:, numpy.newaxis]
# Fetch class-wise accuracies from diagonal
class_wise_accuracies = numpy.diag(conf)
# Calculate overall accuracy
macro_averaged_accuracy = numpy.mean(class_wise_accuracies)
Macro-averaged accuracy: 61.6 %
Scene label | Accuracy |
---|---|
airport | 69.1 |
bus | 56.2 |
metro | 62.5 |
metro_station | 35.5 |
park | 85.5 |
public_square | 53.2 |
shopping_mall | 64.5 |
street_pedestrian | 44.5 |
street_traffic | 81.3 |
tram | 64.0 |
Average | 61.6 |