Run pipeline on a new corpus#

You can set up our data extraction and analysis pipeline to run on your own corpus of musical recordings.

Setting up#

First things first, we’d recommend that you clone our repository, so that you can see our example corpus files:

git clone https://github.com/HuwCheston/Jazz-Trio-Database

The corpus files will then be located inside .\references, with the file extension .xlsx (i.e., they should be opened using Microsoft Excel or a similar spreadsheet package).

Corpus structure#

Each corpus file is a multipage spreadsheet, where one page corresponds to one musical ensemble. The structure for each page, however, is always the same. Every row should correspond to an individual recording: each column should contain some metadata about the recording, most of which was compiled by scraping the MusicBrainz database. See below for a description of these columns:

Column description

Field	Description
`recording_title`	Title of the recording
`release_title`	Title of the earliest album released that contains the track
`recording_date_estimate`	Estimated date of recording
`piano`, `bass`, `drums`	Names of the musicians playing the corresponding instruments
`recording_position`	Track number of the recording on `release_title`
`recording_length`	Duration of the recording, in format `%M:%S`
`channel_overrides`	Key-value pairs relating to panning: `piano: l` means the piano is panned to the left channel
`recording_id_for_lbz`	The estimated URL for the recording on ListenBrainz
`link`	The `recording_id_for_lbz` column, as a clickable link
`is_acceptable(Y/N)`	Whether or not the track meets the inclusion criteria and should be processed (see below)

The following columns are only relevant when is_acceptable(Y/N) == Y (see below for descriptions of where this might be the case)

Field	Description
`start_timestamp`	The beginning of the excerpt to process
`end_timestamp`	The end of the excerpt to process
`youtube_link`	A stable link to the performance on YouTube, to be ripped with `yt_dlp`
`channel_overrides`	Whether to use mono left-right outputs for individual instruments: in format `instrument: channel, instrument: channel`, i.e. `piano: l, bass: r`
`time_signature`	The number of quarter note beats per measure for the track
`first_downbeat`	The first clear quarter note downbeat in the track
`notes`	Written comments about the track
`rating_audio_X`	Subjective rating (1–3, 3 = best) of source-separation quality for instrument `X`
`rating_detection_X`	Subjective rating of onset detection quality for instrument `X`
`rating_comments`	Comments given by the rater when making subjective judgements
`has_annotations`	Whether the track has ground truth annotation files: see Create new manual annotations

Selecting tracks for inclusion#

For each track (row), you’ll then need to decide whether it should be processed by listening to it and setting is_appropriate(Y/N) to Y, for appropriate tracks. Appropriate tracks should also additional metadata: see the dropdown, above.

The decision for whether to include a track should be made according to the requirements of your project. When creating our corpus files, we followed the below criteria:

Reading the corpus in Python#

We’ve provided a helper class which will convert a .xlsx corpus file into the correct format needed for our pipeline. This is located inside .\src\utils.py as CorpusMaker. Call the class with the .from_excel() constructor, passing in the filename of the corpus, as follows:

from src import utils

corpus = utils.CorpusMaker.from_excel(fname='...')

If the above command works without errors, you should be good to go. You can access the dataframe of tracks (equivalent to the original spreadsheet) by accessing the tracks attribute of utils.CorpusMaker (type: pd.DataFrame).

Tip

By default, tracks which did not pass the selection criteria (i.e., those where is_acceptable(Y/N) != "Y") will not be processed by utils.CorpusMaker. To suppress this functionality and include all tracks in the output, pass keep_all_tracks=True when calling utils.CorpusMaker.from_excel().

Using the corpus with our pipeline#

Once you’ve confirmed that the corpus can be accessed in Python, ensure that this is present inside the .\references directory. Then, when running either .\src\clean\make_dataset.py or .\src\detect\detect_onsets.py, pass the -corpus flag, followed by the filename, in order to direct the program to use your corpus file over the defaults. For more information, see the respective sections on this page.