Collecting and Preprocessing Music Data

By Bill Sharlow

Day 2: Building an AI-Powered Music Composer

Welcome back to our journey of building an AI-powered music composer! Today, we’ll dive deeper into the process of collecting and preprocessing music data, laying the foundation for training our AI model to generate captivating musical compositions.

Collecting Music Data

Before we can train our AI model, we need a diverse and representative dataset of music compositions to learn from. Here are some sources where you can collect music data:

  1. Online Databases: Explore online repositories and databases of MIDI files, such as the MIDI Archive or MuseScore, where you can find a wide range of musical compositions across various genres and styles.
  2. Music APIs: Utilize music APIs like Spotify, SoundCloud, or the Echo Nest API to access a vast collection of audio tracks and extract musical features for analysis and modeling.
  3. Personal Collection: If you’re a musician or composer yourself, consider using your own collection of MIDI files or recordings to train the AI model on your unique musical style and compositions.

Preprocessing Music Data

Once you’ve collected the music data, it’s essential to preprocess it into a format suitable for training machine learning models. Here’s a basic preprocessing pipeline for MIDI data:

  1. Loading MIDI Files: Use MIDI processing libraries like mido or pretty_midi to load MIDI files and extract musical information such as note pitch, duration, velocity, and timing.
  2. Sequence Alignment: Align the musical sequences across different MIDI files to ensure consistency in timing and structure, especially if the dataset contains compositions of varying lengths or tempos.
  3. Feature Extraction: Extract relevant features from the MIDI data, such as note sequences, chord progressions, and rhythm patterns, to represent the musical content in a machine-readable format.

Example Code: MIDI Data Preprocessing

Let’s continue our example code from the previous day and expand it to preprocess MIDI data by extracting note sequences:

import mido
import numpy as np

def process_midi_file(file_path):
    midi_file = mido.MidiFile(file_path)
    notes = []

    for msg in midi_file:
        if msg.type == 'note_on':

    return np.array(notes)

def extract_note_sequences(data_dir):
    note_sequences = []

    # Iterate over MIDI files in the directory
    for file_name in os.listdir(data_dir):
        if file_name.endswith('.mid'):
            file_path = os.path.join(data_dir, file_name)
            notes = process_midi_file(file_path)

    return note_sequences

# Example usage
data_dir = 'path/to/your/midi/data'
note_sequences = extract_note_sequences(data_dir)
print("Extracted Note Sequences:", note_sequences)

In this code snippet, we define a function extract_note_sequences to iterate over MIDI files in a directory, extract note sequences using the process_midi_file function from the previous day, and store them in a list.


In today’s blog post, we’ve delved into the crucial steps of collecting and preprocessing music data for training our AI-powered music composer. By sourcing diverse music datasets and converting them into a machine-readable format, we’ve set the stage for training our AI model to generate original musical compositions.

In the next blog post, we’ll explore the implementation of recurrent neural networks (RNNs) for music generation, taking our project one step closer to fruition. Stay tuned for more exciting developments in our AI music composition journey!

If you have any questions or thoughts, feel free to share them in the comments section below!

Leave a Comment