The steps described in this page can be followed to build a Docker image that is suitable for running distributed Spark applications using XGBoost and leveraging RAPIDS to take advantage of NVIDIA GPUs.
A Python application requiring this Docker image is provided by Bright as a Jupyter notebook. It is distributed with the cm-jupyter package and can be found under /cm/shared/examples/jupyter/notebooks/.
1. Software versions
The resulting Docker image will provide software with the following main versions:
- Operating System: Debian GNU/Linux 10
- Apache Spark: 3.0.1
- NVIDIA CUDA: 10.2.89
- NVIDIA Spark RAPIDS plugin: 0.1.0
- NumPy: 1.19.2
- Python: 3.7.3
- RAPIDS cuDF: 0.14
- XGBoost4J: 1.0.0
- XGBoost4J-Spark: 1.0.0
2. Prerequisites
The steps described in this article have been tested with this environment:
- Docker version: 19.03.13
- Docker registry (environment: Bright CM 9.1, with default values for cm-docker-registry-setup)
- GPU-capable host (environment: AWS EC2, with g4dn.12xlarge instance)
3. Dockerfile
The following Dockerfile is going to be used in this knowledge base article:
FROM brightcomputing/jupyter-kernel-sample:k8s-spark-py37-1.2.1
# ref: https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/10.2/ubuntu18.04-x86_64/base/Dockerfile
# Install CUDA repositories
RUN apt-get update \
&& apt-get install -y --no-install-recommends gnupg2 curl ca-certificates unzip \
&& curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - \
&& echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list \
&& echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list \
&& rm -rf /var/lib/apt/lists/*
ENV CUDA_VERSION 10.2.89
ENV CUDA_PKG_VERSION 10-2=$CUDA_VERSION-1
# Install CUDA packages
RUN mkdir -p /usr/share/man/man1/ \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
cuda-cudart-$CUDA_PKG_VERSION \
cuda-compat-10-2 \
cuda-toolkit-10-2 \
cuda-nvtx-10-2 \
&& rm -rf /var/lib/apt/lists/*
# Required for nvidia-docker v1
RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \
&& echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64
# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=10.2 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441"
# ref: https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/getting-started-guides/on-prem-cluster/standalone-python.md
# Install cuDF and RAPIDS
RUN pushd $SPARK_HOME/jars \
&& curl https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/{cudf-0.14-cuda10-2.jar} -o '#1' \
&& curl https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.1.0/{rapids-4-spark_2.12-0.1.0.jar} -o '#1' \
&& popd
# Install XGBoost & NumPy
RUN pushd $SPARK_HOME/jars \
&& curl https://repo1.maven.org/maven2/com/nvidia/xgboost4j_3.0/1.0.0-0.1.0/{xgboost4j_3.0-1.0.0-0.1.0.jar} -o '#1' \
&& curl https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/{xgboost4j-spark_3.0-1.0.0-0.1.0.jar} -o '#1' \
&& popd \
&& pip3 install numpy==1.19.2
ENV LIBS_PATH ${SPARK_HOME}/jars
ENV SPARK_JARS ${LIBS_PATH}/cudf-0.14-cuda10-2.jar,${LIBS_PATH}/xgboost4j_3.0-1.0.0-0.1.0.jar,${LIBS_PATH}/xgboost4j-spark_3.0-1.0.0-0.1.0.jar
ENV JAR_RAPIDS ${SPARK_HOME}/rapids-4-spark_2.12-0.1.0.jar
# Make XGBoost available in Python
ENV PYTHONPATH ${PYTHONPATH}:${LIBS_PATH}/xgboost4j-spark_3.0-1.0.0-0.1.0.jar
# Download example dataset for NVIDIA notebook
# https://github.com/NVIDIA/spark-xgboost-examples/blob/880f8b8a6fde21f2f8308450883c3a980f6d434e/examples/notebooks/python/mortgage-gpu.ipynb
RUN curl https://raw.githubusercontent.com/NVIDIA/spark-xgboost-examples/880f8b8a6fde21f2f8308450883c3a980f6d434e/datasets/mortgage-small.tar.gz -o /tmp/880f8b8a6fd-mortgage-small.tar.gz
4. Image creation
In a cluster with a Docker registry having domain head-node-name and port number 5000, connect to a host meeting all the prerequisites and run:
# mkdir spark-xgboost-image # cd spark-xgboost-image # curl https://support.brightcomputing.com/kb-articles/spark-xgboost/Dockerfile -o Dockerfile # docker build -t head-node-name:5000/spark-xgboost-v1 . # docker push head-node-name:5000/spark-xgboost-v1
It should now be possible to pull the image just created:
# docker pull head-node-name:5000/spark-xgboost-v1