# Building the database
(build-database)=

## Windows installation

### Requirements

* Windows 10
* Python 3.10 **accessible via `PATH`** *(i.e., `python` via `cmd.exe` should run without errors)*
* FFmpeg **accessible via `PATH`**
* Git

### Setting up
(build-database-setup)=

First things first, you'll need to clone the source code repository for the Jazz Trio Database to a new folder on your local machine by running the following command on a local terminal:

```
git clone https://github.com/HuwCheston/Jazz-Trio-Database
```
 
We'd now recommend that you create a new virtual environment to keep the dependencies required for the Jazz Trio Database separate from your main Python environment. To do so, run the following:

```
cd Jazz-Trio-Database
pip install virtualenv
python -m venv venv
python test_environment.py
call venv\Scripts\activate.bat
```

Your terminal input should now have a `(venv)` preface, reading something like `(venv) C:\Jazz-Trio-Database`. 

Next, install the project dependencies:

```
pip install -r requirements.txt
```

### Audio download and separation
(build-database-separate)=

Next, you can run the following command to download and separate the audio:

```
python src\clean\make_dataset.py

>>> processing "Blues for Gwen" from 1962 album Inception, leader McCoy Tyner ...
>>> ...
>>> ...
>>> dataset corpus_chronology made in X secs !
```

This command will begin by downloading the audio from a given track on YouTube (stored in `.\data\raw`), and then separating this into separate piano, bass, and drums stems (stored in `.\data\processed`). [Hennequin et al., 2020](https://doi.org/10.21105/joss.02154) is used for piano separation, and [Rouard et al., 2022](https://doi.org/10.48550/arXiv.2211.08553) for bass and drums separation.

```{warning}
By default, if a YouTube source cannot be found, the program will try to retry downloading a set number of times before terminating prematurely. Sources that cannot be found will be printed directly to the console. If this occurs, first check your internet connection and the source itself. If a particular source has been removed from YouTube, please [open a new issue on the GitHub repository](https://github.com/HuwCheston/Jazz-Trio-Database/issues/new) or [contact us](mailto:hwc31@cam.ac.uk?subject=Missing YouTube link&cc=huwcheston@gmail.com).
```

By default, this command will process audio for tracks contained in `.\references\corpus_chronology.xlsx`: this contains 300 tracks by 30 different pianists, and is the dataset used in `Cheston et al. (2024)`. To process audio from a different corpus, pass the `-corpus` flag to the above command, followed by the name of the corpus spreadsheet inside `.\references`. For example, to process audio from `.\references\corpus_bill_evans.xlsx`:

```
python src\clean\make_dataset.py -corpus corpus_bill_evans

>>> processing "A Sleepin Bee" from 1968 album At the Montreux Jazz Festival, leader Bill Evans ...
>>> ...
>>> ...
>>> dataset corpus_bill_evans made in X secs !
```

```{tip}
If the program detects that audio for a particular track has already been downloaded or separated, it will skip over this stage when re-running the command. In order to force download or separation, pass the `-force-download` or `-force-separation` flags when running the command.

If you want to suppress the use of either Spleeter or Demucs for separation (not recommended), you can additionally pass the `-no-spleeter` and `-no-demucs` commands. 
```

### Onset detection
(build-database-detect)=

After the audio for a given corpus file has been separated and extracted, we can move on to extracting onset data from the recordings. To do so, run the following command:

```
python src\detect\detect_onsets.py
>>> detecting onsets in 300 tracks (0 from disc) using -1 CPUs ...
>>> ...
>>> ...
>>> onsets detected for all tracks in corpus corpus_chronology in X secs !
```

```{warning}
The onset detection program will, by default, use every single CPU core available on your machine. To control this, pass in the `-n_jobs` flag, followed by an integer corresponding to the number of cores to use.
```

This command will process onsets in the source separated piano, bass, and drums files for each track in the corpus (contained in `.\data\processed`), and beats in the overall audio mixture (contained in `.\data\raw`). Once again, by default this command uses the tracks described in `.\references\corpus_chronology.xlsx` by default. This can be changed by passing the `-corpus` flag and a valid corpus file, as above.

```{tip}
If the program detects that onsets have already been detected for a particular track, this will be skipped when re-running the command. To force processing, pass in the `-ignore_cache` flag when running the command.
```

By default, the program will detect onsets for every track in the provided corpus file. If you just want to check the results out without processing every track, you can pass in the `-annotated-only` flag when running the command to only process those tracks with corresponding manual annotation files (located inside `.\references\manual_annotation`). 

```{tip}
Alternatively, you can pass in the `-one-track-only` flag to (you guessed it!) process the first track in the corpus.
```

### Check the results

Once the program has finished running, you can check out the results by listening to the click tracks contained in `.\reports\click_tracks`. By default, for the piano, bass, and drums stems, the high-pitched tones in each click track correspond to onsets matched with a quarter note beat, and the lower-pitched tones onsets that were not matched. In the audio mixture (`mix`), high-pitched tones correspond with the estimated downbeat and low-pitched tones every other beat.

```{tip}
To run the onset detection without generating click tracks, pass in the `-no_click` argument when running the command.
```

The program will also output a Python `pickle` file inside the `.\models` directory. This contains the serialised `OnsetMaker` class instances (defined in `src\detect\detect_utils.py`).

You can now run many of the analysis scripts contained inside the `.\notebooks` directory, or pass the serialised `OnsetMaker` instances to the feature extraction classes defined in `src\features\features_utils.py`.

## Linux installation

***TODO***