Installation¶

This section describes how to set up the OpusDistillery pipeline locally, as well as on three of our supported clusters.

Locally¶

System Requirements¶

Ubuntu 18.04 (it can work on other Linux distributions, but might require setup scripts fixes; see more details in marian installation instructions).
One or several Nvidia GPUs with CUDA drivers installed and at least 8 GB of memory.
CUDNN installed
At least 16 CPU cores ( some steps of the pipeline utilize multiple cores pretty well, so the more the better).
64 GB RAM (128 GB+ might be required for bigger datasets)
200+ GB of disk space ( mostly for datasets and transformations ). It depends on chosen datasets and can be significantly higher.

Installation¶

Clone the repo:

git clone https://github.com/Helsinki-NLP/OpusDistillery/
cd OpusDistillery

Choose a Snakemake profile from profiles/ or create a new one
Adjust paths in the Makefile if needed and set PROFILE variable to the name of your profile
Adjust Snakemake and workflow settings in the profiles/<profile>/config.yaml, see Snakemake CLI reference for details
Configure experiment and datasets in configs/config.prod.yml (or configs/config.test.yml for test run)
Change source code if needed for the experiment
(Cluster mode) Adjust cluster settings in the cluster profile. For slurm-moz: profiles/slurm-moz/config.cluster.yml You can also modify profiles/slurm-moz/submit.sh or create a new Snakemake profile.
(Cluster mode) It might require further tuning of requested resources in Snakemake file:
- Use threads for a rule to adjust parallelism
- Use resources: mem_mb=<memory> to adjust total memory requirements per task (default is set in profile/slurm-moz/config.yaml)
Install Mamba - fast Conda package manager

make conda

Install Snakemake

make snakemake

Update git submodules

make git-modules

Install requirements:

source ../mambaforge/etc/profile.d/conda.sh ; conda activate ; conda activate snakemake
pip install -r requirements.txt 

You are all set!

On a Cluster¶

System Requirements¶

Slurm cluster with CPU and Nvidia GPU nodes
CUDA 11.2 ( it was also tested on 11.5)
CUDNN library installed
Singularity module if running with containerization (recommended)
If running without containerization, there is no procedure to configure the environment automatically. All the required modules (for example parallel) should be preinstalled and loaded in ~/.bashrc

Installation on Puhti and Mahti¶

Clone the repository.
Download the Ftt.sif container to the repository root (ask Ona)
Create a virtual Python environment for Snakemake (e.g. in the parent dir of the repository):
1. The environment needs to be created with a non-containerized python, as otherwise Apptainer integration will not work. On puhti and mahti, the python executables in /usr/bin/ should work: /usr/bin/python3.9 -m venv snakemake_env.
2. Activate the virtual environment: source ./snakemake_env/bin/activate.
3. Install snakemake: pip install snakemake.
Install micromamba (e.g. in the parent dir of the repository): curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
Return to the repository directory and update Git submodules: make git-modules
Create a data directory (e.g. in the parent dir of the repository) and create a tmp dir in it.
If the data directory is not located in the parent directory of the repository, edit profiles/slurm-puhti/config.yaml or profiles/slurm-mahti/config.yaml and change the bindings in the singularity-args section to point to your data directory, and also enter the data directory path as the root value of the config section.
Edit profiles/slurm-puhti/config.cluster.yaml to change the CSC account to one you have access to.
Load cuda modules: module load gcc/9.4.0 cuda cudnn
Run pipeline: make run-hpc PROFILE="slurm-puhti" or make run PROFILE="slurm-mahti". More information in Basic Usage.

Installation on Lumi¶

Clone the repository.
Download the Ftt.sif container to the repository root (ask Ona)
Create a virtual Python environment for Snakemake (e.g. in the parent dir of the repository):
1. The environment needs to be created with a non-containerized python, as otherwise Apptainer integration will not work. On lumi, use the cray-python module (it is not containerized): module load cray-python; python -m venv snakemake_env.
2. Activate the virtual environment: source ./snakemake_env/bin/activate.
3. Install snakemake: pip install snakemake.
Install micromamba (e.g. in the parent dir of the repository): curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
Return to the repository directory and update Git submodules: make git-modules
Create a data directory (e.g. in the parent dir of the repository) and create a tmp dir in it.
If the data directory is not located in the parent directory of the repository, edit profiles/slurm-lumi/config.yaml and change the bindings in the singularity-args section to point to your data directory, and also enter the data directory path as the root value of the config section.
Edit profiles/slurm-puhti/config.cluster.yaml to change the CSC account to one you have access to.
Load rocm module: module load rocm.
Copy the marian executables to 3rd_party/lumi-marian/build (compiling lumi-marian is currently hacky, so this workaround makes things easier) from /scratch/project_462000447/lumi-marian
Enter export SINGULARITYENV_LD_LIBRARY_PATH=$LD_LIBRARY_PATH to make sure Marian can find all the libraries when it runs containerized.
Run pipeline: make run-hpc PROFILE="slurm-lumi". More information in Basic Usage.