NVidia deep learning frameworks

From MetaCentrum
Jump to navigation Jump to search

Deep Learning Frameworks + SDK

NVIDIA provides GPU-tuned frameworks for deep learning packed as Docker containers under NVIDIA GPU CLOUD (NGC). You can find there some scripts and models for deep learning, too.

Deep learning frameworks documentation
Kaldi
MXNet
NVCaffe
PyTorch
TensorFlow
SDK documentation

NGC containers are released monthly (RR.MM). You can find changelog and HW/driver support matrix at Support Matrix.

Terms Of Use

Some NGC items are available only after registration at [1].

How to run NGC container in MetaCentrum with Singularity

Get NGC API key

If you have not done so already, you need to register first at [2] to get NGC API key. After you log in, you can find this API key under the Setup menu in you personal tab.

Build Singularity image

Building the image is a resource-intensive process and must be done as interactive job with large enough scratch (at least 10 GB). Some temporary directories are by default bound to /tmp, which has a limited user quota on MetaCentrum. Therefore it is advisable to bind them to scratch directory instead.

SINGULARITY_CACHEDIR  # /storage with >10GB of free space, if you intend to build images repeatedly, otherwise in scratch
SINGULARITY_TMPDIR # in scratch
SINGULARITY_LOCALCACHEDIR # in scratch

Run Singularity image

Use singularity run image.simg or singularity shell image.simg command plus other options (see below).

  • run will launch container` in case of frameworks there is usually available Jyputer notebook or Jupyter Lab
  • shell will launch interactive shell
  • exec will run a particular command

Directory bind parameters

You need to bind local directories to the image to be able to access the local data and storages.

Parameters
-H /storage/XXXX/home/YOUR_USERNAME - neccessary, Singularity does not resolve automatically the META /storage symlinks
-B /local/dir:/inside/dir - recommended for scratch, tmp, and other temporary directories

NVIDIA drivers

--nv - the container will use local host's NVIDIA drivers (and not the ones it was built with)


Example no. 1: TensorFlow

Build image

Within interactive job

qsub -I -l select=1:ncpus=2:mem=4gb:scratch_local=10gb -l walltime=1:00:00

run the script of similar form:

#/bin/bash
export NGCDIR="/storage/brno2/home/melounova/ngc_sandbox" # directory where the image will go
export SINGULARITY_DOCKER_USERNAME='$oauthtoken'
export SINGULARITY_DOCKER_PASSWORD=Yj..........Az # API Key you get after logging in at https://ngc.nvidia.com/
export SINGULARITY_CACHEDIR="/storage/brno2/home/melounova/.singularity" # the cache dir must exist
mkdir $SCRATCHDIR/tmp
export SINGULARITY_TMPDIR=$SCRATCHDIR/tmp
export SINGULARITY_LOCALCACHEDIR="$SCRATCHDIR"

singularity -v build $NGCDIR/TensorFlow.simg docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3 # build the image TensorFlow.simg

Run image

Run interactive job with GPU

qsub -I -q gpu -l select=1:ngpus=1:scratch_local=10g

and again set the temporary dirs

export SINGULARITY_CACHEDIR=/storage/brno2/home/melounova/.singularity"
mkdir $SCRATCHDIR/tmp
export SINGULARITY_TMPDIR=$SCRATCHDIR/tmp
export SINGULARITY_LOCALCACHEDIR=$SCRATCHDIR"

Then running the image

singularity run --nv -H /storage/XXX/home/USERNAME/ ./ngc_sandbox/TensorFlow.simg # change to your paths

will get you shell in container. We recommend to setup Jupyter password in this shell by jupyter notebook password command.

To run Jupyter Lab in container type

singularity exec --nv -H /storage/XXX/home/USERNAME/ ./ngc_sandbox/TensorFlow.simg jupyter-lab

After issuing this command, Jupyter Lab will be launched on the node and will be available on http://NODE_NAME.metacentrum.cz:8888. To enter use the Jupyter password you set in the previous step.

Through the Jupyter lab/notebook web interface you can run the calculation.

Example no. 2: Kaldi

Build the Kaldi container in the same way as in the previous example, substituting only

singularity -v build $NGCDIR/Kaldi.simg docker://nvcr.io/nvidia/kaldi:20.03-py3

This container does not contain Jupyter, however there is benchmark GPU computation example librispeech in /workspace/nvidia-examples/librispeech directory. It is not possible to change the files inside the container and therefore the testing directory must be copied out.

Run interactive GPU job as in previous example (don't forget to set the cache and tmp directories!).

Shell the image and copy out the benchmark directory.

singularity shell --nv -H /storage/brno2/home/melounova /storage/brno2/home/melounova/ngc_sandbox/Kaldi.simg # change to your paths!
cp -r /workspace/ ./
cd workspace/nvidia-examples/librispeech/

In default_parameters.inc it is neccessary to set the number of GPUs and path to workspace directory:

vi default_parameters.inc
...
WORKSPACE=${WORKSPACE:-"/storage/praha1/home/melounova/workspace/"} # change to your paths!
GPU=${GPU:-1}

Download and prepare the testing data:

./prepare_data.sh

Run benchmark calculation:

./run_benchmark.sh

How to run NGC container in MetaCentrum with Podman

There is a possibility to run Docker images, currently limited only to CERIT-SC machines ( frontend zuphux.cerit-sc.cz, PBS server cerit-pbs.cerit-sc.cz). To avoid the issue with root privileges we use Podman wrapper to run Docker images. On the other hand the Podman approach is less complicated.

Example no. 1: TensorFlow

Prepare a script called e.g. run.sh:

#!/usr/bin/podmanwrapper docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3
#PODMAN_OPT -p 8888:8888/tcp
jupyter-lab

Run interactive job in cerit gpu queue:

qsub -q @cerit-pbs.cerit-sc.cz -l select=mem=4gb:scratch_local=1gb:os=debian10 -l walltime=02:00:00

and run the script run.sh, e.g.

melounova@black1:~$ /storage/brno2/home/melounova/ngc_sandbox/podman_tensorflow/run.sh

After the image launches, you will be prompted to open Jupyter in your browser:

   To access the notebook, open this file in a browser:
       file:///root/.local/share/jupyter/runtime/nbserver-68-open.html
   Or copy and paste one of these URLs:
       http://hostname:8888/?token=1d198ea6385ca538d97a4141d3ee0be8912973ecb3bd8c68

NOTE Don't forget to substitute "hostname" for the real hostname of the computational node! (black1.cerit-sc.cz in this example.)


License

Terms Of Use

  • Free for academic use, non-commercial use only
  • Must be cited as “This software contains source code provided by NVIDIA Corporation.”