# Run pipeline on a new corpus (new-corpus)= You can set up our data extraction and analysis pipeline to run on your own corpus of musical recordings. ## Setting up First things first, we'd recommend that you clone our repository, so that you can see our example corpus files: ``` git clone https://github.com/HuwCheston/Jazz-Trio-Database ``` The corpus files will then be located inside `.\references`, with the file extension `.xlsx` (i.e., they should be opened using Microsoft Excel or a similar spreadsheet package). ## Corpus structure Each corpus file is a multipage spreadsheet, where one page corresponds to one musical ensemble. The structure for each page, however, is always the same. Every row should correspond to an individual recording: each column should contain some metadata about the recording, most of which was compiled by scraping the [MusicBrainz database](https://musicbrainz.org/doc/MusicBrainz_Database). See below for a description of these columns: :::{dropdown} Column description | Field | Description | |---------------------------|-----------------------------------------------------------------------------------------------| | `recording_title` | Title of the recording | | `release_title` | Title of the earliest album released that contains the track | | `recording_date_estimate` | Estimated date of recording | | `piano`, `bass`, `drums` | Names of the musicians playing the corresponding instruments | | `recording_position` | Track number of the recording on `release_title` | | `recording_length` | Duration of the recording, in format `%M:%S` | | `channel_overrides` | Key-value pairs relating to panning: `piano: l` means the piano is panned to the left channel | | `recording_id_for_lbz` | The estimated URL for the recording on [ListenBrainz](https://listenbrainz.org/) | | `link` | The `recording_id_for_lbz` column, as a clickable link | | `is_acceptable(Y/N)` | Whether or not the track meets the inclusion criteria and should be processed (see below) | The following columns are only relevant when `is_acceptable(Y/N) == Y` (see below for descriptions of where this might be the case) | Field | Description | |----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------| | `start_timestamp` | The beginning of the excerpt to process | | `end_timestamp` | The end of the excerpt to process | | `youtube_link` | A stable link to the performance on YouTube, to be ripped with `yt_dlp` | | `channel_overrides` | Whether to use mono left-right outputs for individual instruments: in format `instrument: channel, instrument: channel`, i.e. `piano: l, bass: r` | | `time_signature` | The number of quarter note beats per measure for the track | | `first_downbeat` | The first clear quarter note downbeat in the track | | `notes` | Written comments about the track | | `rating_audio_X` | Subjective rating (1–3, 3 = best) of source-separation quality for instrument `X` | | `rating_detection_X` | Subjective rating of onset detection quality for instrument `X` | | `rating_comments` | Comments given by the rater when making subjective judgements | | `has_annotations` | Whether the track has ground truth annotation files: see {ref}`Create new manual annotations ` | ::: ## Selecting tracks for inclusion For each track (row), you'll then need to decide whether it should be processed by listening to it and setting `is_appropriate(Y/N)` to `Y`, for appropriate tracks. Appropriate tracks should also additional metadata: see the dropdown, above. The decision for whether to include a track should be made according to the requirements of your project. When creating our corpus files, we followed the below criteria: :::{dropdown} Inclusion criteria ***Instruments:*** - Piano, bass, drums (i.e., Piano trio) only - Piano: - No [electric/synthesiser/rhodes piano](https://www.youtube.com/watch?v=d1GQZLEnXFs) - Bass: - Must be acoustic/upright bass, [not electric (listen out for sound of fingers 'slapping' strings)](https://www.youtube.com/watch?v=N5uuC0x5JPk) - No [use of the bow](https://www.youtube.com/watch?v=r9D7zdJFLp0) - Drums: - No [auxiliary percussion (congas, shakers)](https://www.youtube.com/watch?v=4vFSxGhV29M) - whether played by the drummer or someone else - No [mallets](https://www.youtube.com/watch?v=Xl-nblp_SQs) - No [brushes](https://www.youtube.com/watch?v=Lr5RiPvxzBQ) ***Tempo:*** - Medium to Up, i.e., approx 100 to 300 beats-per-minute ***Feel:*** - Swung quavers only - No [ballads](https://www.youtube.com/watch?v=a2LFVWB) - No ['straight 8s'](https://www.youtube.com/watch?v=DiQagjy5INI): e.g. latin, afro-cuban, rock - No [rock-y/rnb-y style straight 8s stuff](https://www.youtube.com/watch?v=y7dpXzDR4Ug) - No ['free' playing](https://www.youtube.com/watch?v=v9mV_1WSNTw) - No material that [changes from non-swing to swing feels](https://www.youtube.com/watch?v=L34b0ut8Loc) (and back) during solos quickly ***Section:*** - Piano solo only: everybody needs to be improvising! - In recordings with [two pianos solos](https://www.youtube.com/watch?v=jYQupyUOYpo), choose the first - Avoid material with [solo breaks interspersed](https://www.youtube.com/watch?v=PrEcT2Q51lw) throughout ensemble playing - Material with break leading into solo is ok, as long as this then continues uninterrupted ***Quality:*** - Avoid [obviously bootlegged examples](https://www.youtube.com/watch?v=PFqhZ63PtVY) (e.g. recorded by attendees to concerts) ::: ## Reading the corpus in Python We've provided a helper class which will convert a `.xlsx` corpus file into the correct format needed for our pipeline. This is located inside `.\src\utils.py` as `CorpusMaker`. Call the class with the `.from_excel()` constructor, passing in the filename of the corpus, as follows: ``` from src import utils corpus = utils.CorpusMaker.from_excel(fname='...') ``` If the above command works without errors, you should be good to go. You can access the dataframe of tracks (equivalent to the original spreadsheet) by accessing the `tracks` attribute of `utils.CorpusMaker` (type: `pd.DataFrame`). ```{tip} By default, tracks which did not pass the selection criteria (i.e., those where `is_acceptable(Y/N) != "Y"`) will not be processed by `utils.CorpusMaker`. To suppress this functionality and include all tracks in the output, pass `keep_all_tracks=True` when calling `utils.CorpusMaker.from_excel()`. ``` ## Using the corpus with our pipeline Once you've confirmed that the corpus can be accessed in Python, ensure that this is present inside the `.\references` directory. Then, when running either `.\src\clean\make_dataset.py` or `.\src\detect\detect_onsets.py`, pass the `-corpus` flag, followed by the filename, in order to direct the program to use your corpus file over the defaults. For more information, see {ref}`the respective sections on this page `.