Installation¶
This section describes how to set up the OpusDistillery pipeline locally, as well as on three of our supported clusters.
Locally¶
System Requirements¶
Ubuntu 18.04 (it can work on other Linux distributions, but might require
setup
scripts fixes; see more details in marian installation instructions).One or several Nvidia GPUs with CUDA drivers installed and at least 8 GB of memory.
CUDNN installed
At least 16 CPU cores ( some steps of the pipeline utilize multiple cores pretty well, so the more the better).
64 GB RAM (128 GB+ might be required for bigger datasets)
200+ GB of disk space ( mostly for datasets and transformations ). It depends on chosen datasets and can be significantly higher.
Installation¶
Clone the repo:
git clone https://github.com/Helsinki-NLP/OpusDistillery/
cd OpusDistillery
Choose a Snakemake profile from
profiles/
or create a new oneAdjust paths in the
Makefile
if needed and setPROFILE
variable to the name of your profileAdjust Snakemake and workflow settings in the
profiles/<profile>/config.yaml
, see Snakemake CLI reference for detailsConfigure experiment and datasets in
configs/config.prod.yml
(orconfigs/config.test.yml
for test run)Change source code if needed for the experiment
(Cluster mode) Adjust cluster settings in the cluster profile. For
slurm-moz
:profiles/slurm-moz/config.cluster.yml
You can also modifyprofiles/slurm-moz/submit.sh
or create a new Snakemake profile.(Cluster mode) It might require further tuning of requested resources in
Snakemake
file:Use
threads
for a rule to adjust parallelismUse
resources: mem_mb=<memory>
to adjust total memory requirements per task (default is set inprofile/slurm-moz/config.yaml
)
Install Mamba - fast Conda package manager
make conda
Install Snakemake
make snakemake
Update git submodules
make git-modules
Install requirements:
source ../mambaforge/etc/profile.d/conda.sh ; conda activate ; conda activate snakemake
pip install -r requirements.txt
You are all set!
On a Cluster¶
System Requirements¶
Slurm cluster with CPU and Nvidia GPU nodes
CUDA 11.2 ( it was also tested on 11.5)
CUDNN library installed
Singularity module if running with containerization (recommended)
If running without containerization, there is no procedure to configure the environment automatically. All the required modules (for example
parallel
) should be preinstalled and loaded in ~/.bashrc
Installation on Puhti and Mahti¶
Clone the repository.
Download the Ftt.sif container to the repository root (ask Ona)
Create a virtual Python environment for Snakemake (e.g. in the parent dir of the repository):
The environment needs to be created with a non-containerized python, as otherwise Apptainer integration will not work. On puhti and mahti, the python executables in /usr/bin/ should work:
/usr/bin/python3.9 -m venv snakemake_env
.Activate the virtual environment:
source ./snakemake_env/bin/activate
.Install snakemake:
pip install snakemake
.
Install micromamba (e.g. in the parent dir of the repository):
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
Return to the repository directory and update Git submodules:
make git-modules
Create a data directory (e.g. in the parent dir of the repository) and create a tmp dir in it.
If the data directory is not located in the parent directory of the repository, edit profiles/slurm-puhti/config.yaml or profiles/slurm-mahti/config.yaml and change the bindings in the singularity-args section to point to your data directory, and also enter the data directory path as the root value of the config section.
Edit profiles/slurm-puhti/config.cluster.yaml to change the CSC account to one you have access to.
Load cuda modules: module load gcc/9.4.0 cuda cudnn
Run pipeline:
make run-hpc PROFILE="slurm-puhti"
ormake run PROFILE="slurm-mahti"
. More information in Basic Usage.
Installation on Lumi¶
Clone the repository.
Download the Ftt.sif container to the repository root (ask Ona)
Create a virtual Python environment for Snakemake (e.g. in the parent dir of the repository):
The environment needs to be created with a non-containerized python, as otherwise Apptainer integration will not work. On lumi, use the cray-python module (it is not containerized):
module load cray-python; python -m venv snakemake_env
.Activate the virtual environment:
source ./snakemake_env/bin/activate
.Install snakemake:
pip install snakemake
.
Install micromamba (e.g. in the parent dir of the repository):
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
Return to the repository directory and update Git submodules:
make git-modules
Create a data directory (e.g. in the parent dir of the repository) and create a tmp dir in it.
If the data directory is not located in the parent directory of the repository, edit profiles/slurm-lumi/config.yaml and change the bindings in the singularity-args section to point to your data directory, and also enter the data directory path as the root value of the config section.
Edit profiles/slurm-puhti/config.cluster.yaml to change the CSC account to one you have access to.
Load rocm module: module load rocm.
Copy the marian executables to 3rd_party/lumi-marian/build (compiling lumi-marian is currently hacky, so this workaround makes things easier) from
/scratch/project_462000447/lumi-marian
Enter export SINGULARITYENV_LD_LIBRARY_PATH=$LD_LIBRARY_PATH to make sure Marian can find all the libraries when it runs containerized.
Run pipeline:
make run-hpc PROFILE="slurm-lumi"
. More information in Basic Usage.