In this section we describe how to build Conda environments for deep learning projects using Horovod to enable distributed training across multiple GPUs (either on the same node or spread across multuple nodes).
Install NVIDIA CUDA Toolkit 10.1 (documentation) which is the most recent version of NVIDIA CUDA Toolkit supported by all three deep learning frameworks that are currently supported by Horovod.
cudatoolkit
package?Typically when installing PyTorch, TensorFlow, or Apache MXNet with GPU support using Conda, you
add the appropriate version of the cudatoolkit
package to your environment.yml
file.
Unfortunately, for the moment at least, the cudatoolkit packages available via Conda do not
include the NVIDIA CUDA Compiler (NVCC), which is required in order to build Horovod extensions
for PyTorch, TensorFlow, or MXNet.
cudatoolkit-dev
package?While there are cudatoolkit-dev
packages available from conda-forge
that do include NVCC,
we have had difficulty getting these packages to consistently install properly. Some of the
available builds require manual intervention to accept license agreements, making these builds
unsuitable for installing on remote systems (which is critical functionality). Other builds seems
to work on Ubuntu but not on other flavors of Linux.
Despite this, we would encourage you to try adding cudatoolkit-dev
to your environment.yml
file and see what happens! The package is well maintained so perhaps it will become more stable in
the future.
nvcc_linux-64
meta-packageThe most robust approach to obtain NVCC and still use Conda to manage all the other dependencies is to install the NVIDIA CUDA Toolkit on your system and then install a meta-package nvcc_linux-64 from conda-forge, which configures your Conda environment to use the NVCC installed on the system together with the other CUDA Toolkit components installed inside the Conda environment.
environment.yml
fileWe prefer to specify as many dependencies as possible in the Conda environment.yml
file
and only specify dependencies in requirements.txt
for install via pip
that are not
available via Conda channels. Check the Horovod installation guide for details of required
dependencies.
Use the recommended channel priorities. Note that conda-forge
has priority over
defaults
and pytorch
has priority over conda-forge
.
name: null channels: - pytorch - conda-forge - defaults
There are a few things worth noting about the dependencies.
cudnn
and nccl
(and the optional
cupti
).cxx-compiler
and nvcc_linux-64
, to make sure that suitable C,
and C++ compilers are installed and that the resulting Conda environment is aware of the
manually installed CUDA Toolkit.openmpi
package directly, you should instead opt for mpi4py Conda
package which provides a CUDA-aware build of OpenMPI.cmake
to insure that the Horovod extensions for Gloo are built.Below are the core required dependencies. The complete environment.yml
file is available
on GitHub.
dependencies: - bokeh=1.4 - cmake=3.16 # insures that Gloo library extensions will be built - cudnn=7.6 - cupti=10.1 - cxx-compiler=1.0 # insures C and C++ compilers are available - jupyterlab=1.2 - mpi4py=3.0 # installs cuda-aware openmpi - nccl=2.5 - nodejs=13 - nvcc_linux-64=10.1 # configures environment to be "cuda-aware" - pip=20.0 - pip: - mxnet-cu101mkl==1.6.* # MXNET is installed prior to horovod - -r file:requirements.txt - python=3.7 - pytorch=1.5 - tensorboard=2.1 - tensorflow-gpu=2.1 - torchvision=0.6
requirements.txt
fileThe requirements.txt
file is where all of the pip
dependencies, including Horovod itself,
are listed for installation. In addition to Horovod we typically will also use pip
to install
JupyterLab extensions to enable GPU and CPU resource monitoring via jupyterlab-nvdashboard and
Tensorboard support via jupyter-tensorboard.
horovod==0.19.* jupyterlab-nvdashboard==0.2.* jupyter-tensorboard==0.2.* # make sure horovod is re-compiled if environment is re-built --no-binary=horovod
Note the use of the --no-binary
option at the end of the file. Including this option ensures
that Horovod will be re-built whenever the Conda environment is re-built.
After adding any necessary dependencies that should be downloaded via Conda to the
environment.yml
file and any dependencies that should be downloaded via pip
to the
requirements.txt
file, create the Conda environment in a sub-directory env
of your
project directory by running the following commands.
$ export ENV_PREFIX=$PWD/env
$ export HOROVOD_CUDA_HOME=$CUDA_HOME
$ export HOROVOD_NCCL_HOME=$ENV_PREFIX
$ export HOROVOD_GPU_OPERATIONS=NCCL
$ conda env create --prefix $ENV_PREFIX --file environment.yml --force
By default Horovod will try and build extensions for all detected frameworks. See the documentation on environment variables for the details on additional environment variables that can be set prior to building Horovod.
Once the new environment has been created you can activate the environment with the following command.
$ conda activate $ENV_PREFIX
postBuild
fileIf you wish to use any JupyterLab extensions included in the environment.yml
and
requirements.txt
files, then you may need to rebuild the JupyterLab application.
For simplicity, we typically include the instructions for re-building JupyterLab in a
postBuild
script. Here is what this script looks like in the example Horovod environments.
jupyter labextension install --no-build jupyterlab-nvdashboard
jupyter labextension install --no-build jupyterlab_tensorboard
jupyter lab build
Use the following commands to source the postBuild
script.
$ conda activate $ENV_PREFIX # optional if environment already active
$ . postBuild
To see the full list of packages installed into the environment, run the following command.
$ conda activate $ENV_PREFIX # optional if environment already active
$ conda list
After building the Conda environment, check that Horovod has been built with support for the deep learning frameworks TensorFlow, PyTorch, Apache MXNet, and the contollers MPI and Gloo with the following command.
$ conda activate $ENV_PREFIX # optional if environment already active
$ horovodrun --check-build
You should see output similar to the following.:
Horovod v0.19.4: Available Frameworks: [X] TensorFlow [X] PyTorch [X] MXNet Available Controllers: [X] MPI [X] Gloo Available Tensor Operations: [X] NCCL [ ] DDL [ ] CCL [X] MPI [X] Gloo
We typically wrap these commands into a shell script create-conda-env.sh
. Running the shell
script will set the Horovod build variables, create the Conda environment, activate the Conda
environment, and build JupyterLab with any additional extensions.
#!/bin/bash --login
set -e
export ENV_PREFIX=$PWD/env
export HOROVOD_CUDA_HOME=$CUDA_HOME
export HOROVOD_NCCL_HOME=$ENV_PREFIX
export HOROVOD_GPU_OPERATIONS=NCCL
conda env create --prefix $ENV_PREFIX --file environment.yml --force
conda activate $ENV_PREFIX
. postBuild
We recommend that you put scripts inside a bin
directory in your project root directory. Run
the script from the project root directory as follows.
./bin/create-conda-env.sh # assumes that $CUDA_HOME is set properly
If you add (remove) dependencies to (from) the environment.yml
file or the
requirements.txt
file after the environment has already been created, then you can
re-create the environment with the following command.
$ conda env create --prefix $ENV_PREFIX --file environment.yml --force
However, whenever we add (remove) dependencies we prefer to re-run the Bash script which will re-build both the Conda environment and JupyterLab.
$ ./bin/create-conda-env.sh
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。