Run pipeline on a new corpus#
You can set up our data extraction and analysis pipeline to run on your own corpus of musical recordings.
Setting up#
First things first, we’d recommend that you clone our repository, so that you can see our example corpus files:
git clone https://github.com/HuwCheston/Jazz-Trio-Database
The corpus files will then be located inside .\references
, with the file extension .xlsx
(i.e., they should be opened using Microsoft Excel or a similar spreadsheet package).
Corpus structure#
Each corpus file is a multipage spreadsheet, where one page corresponds to one musical ensemble. The structure for each page, however, is always the same. Every row should correspond to an individual recording: each column should contain some metadata about the recording, most of which was compiled by scraping the MusicBrainz database. See below for a description of these columns:
Column description
Field |
Description |
---|---|
|
Title of the recording |
|
Title of the earliest album released that contains the track |
|
Estimated date of recording |
|
Names of the musicians playing the corresponding instruments |
|
Track number of the recording on |
|
Duration of the recording, in format |
|
Key-value pairs relating to panning: |
|
The estimated URL for the recording on ListenBrainz |
|
The |
|
Whether or not the track meets the inclusion criteria and should be processed (see below) |
The following columns are only relevant when is_acceptable(Y/N) == Y
(see below for descriptions of where this might be the case)
Field |
Description |
---|---|
|
The beginning of the excerpt to process |
|
The end of the excerpt to process |
|
A stable link to the performance on YouTube, to be ripped with |
|
Whether to use mono left-right outputs for individual instruments: in format |
|
The number of quarter note beats per measure for the track |
|
The first clear quarter note downbeat in the track |
|
Written comments about the track |
|
Subjective rating (1–3, 3 = best) of source-separation quality for instrument |
|
Subjective rating of onset detection quality for instrument |
|
Comments given by the rater when making subjective judgements |
|
Whether the track has ground truth annotation files: see Create new manual annotations |
Selecting tracks for inclusion#
For each track (row), you’ll then need to decide whether it should be processed by listening to it and setting is_appropriate(Y/N)
to Y
, for appropriate tracks. Appropriate tracks should also additional metadata: see the dropdown, above.
The decision for whether to include a track should be made according to the requirements of your project. When creating our corpus files, we followed the below criteria:
Inclusion criteria
Instruments:
Piano, bass, drums (i.e., Piano trio) only
Piano:
Bass:
Must be acoustic/upright bass, not electric (listen out for sound of fingers ‘slapping’ strings)
Drums:
No auxiliary percussion (congas, shakers) - whether played by the drummer or someone else
No mallets
No brushes
Tempo:
Medium to Up, i.e., approx 100 to 300 beats-per-minute
Feel:
Swung quavers only
No ballads
No ‘straight 8s’: e.g. latin, afro-cuban, rock
No material that changes from non-swing to swing feels (and back) during solos quickly
Section:
Piano solo only: everybody needs to be improvising!
In recordings with two pianos solos, choose the first
Avoid material with solo breaks interspersed throughout ensemble playing
Material with break leading into solo is ok, as long as this then continues uninterrupted
Quality:
Avoid obviously bootlegged examples (e.g. recorded by attendees to concerts)
Reading the corpus in Python#
We’ve provided a helper class which will convert a .xlsx
corpus file into the correct format needed for our pipeline. This is located inside .\src\utils.py
as CorpusMaker
. Call the class with the .from_excel()
constructor, passing in the filename of the corpus, as follows:
from src import utils
corpus = utils.CorpusMaker.from_excel(fname='...')
If the above command works without errors, you should be good to go. You can access the dataframe of tracks (equivalent to the original spreadsheet) by accessing the tracks
attribute of utils.CorpusMaker
(type: pd.DataFrame
).
Tip
By default, tracks which did not pass the selection criteria (i.e., those where is_acceptable(Y/N) != "Y"
) will not be processed by utils.CorpusMaker
. To suppress this functionality and include all tracks in the output, pass keep_all_tracks=True
when calling utils.CorpusMaker.from_excel()
.
Using the corpus with our pipeline#
Once you’ve confirmed that the corpus can be accessed in Python, ensure that this is present inside the .\references
directory. Then, when running either .\src\clean\make_dataset.py
or .\src\detect\detect_onsets.py
, pass the -corpus
flag, followed by the filename, in order to direct the program to use your corpus file over the defaults. For more information, see the respective sections on this page.