The steps described in this page can be followed to build a Docker image that is suitable for running distributed Spark applications using XGBoost and leveraging RAPIDS to take advantage of NVIDIA GPUs.
A Python application requiring this Docker image is provided by Bright as a Jupyter notebook. It is distributed with the cm-jupyter
package and can be found under /cm/shared/examples/jupyter/notebooks/
.
1. Software versions
The resulting Docker image will provide software with the following main versions:
- Operating System: Debian GNU/Linux 10
- Apache Spark: 3.0.1
- NVIDIA CUDA: 10.2.89
- NVIDIA Spark RAPIDS plugin: 0.1.0
- NumPy: 1.19.2
- Python: 3.7.3
- RAPIDS cuDF: 0.14
- XGBoost4J: 1.0.0
- XGBoost4J-Spark: 1.0.0
2. Prerequisites
The steps described in this article have been tested with this environment:
- Docker version: 19.03.13
- Docker registry (environment: Bright CM 9.1, with default values for cm-docker-registry-setup)
- GPU-capable host (environment: AWS EC2, with g4dn.12xlarge instance)
3. Dockerfile
The following Dockerfile is going to be used in this knowledge base article:
FROM brightcomputing/jupyter-kernel-sample:k8s-spark-py37-1.2.1 # ref: https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/10.2/ubuntu18.04-x86_64/base/Dockerfile # Install CUDA repositories RUN apt-get update \ && apt-get install -y --no-install-recommends gnupg2 curl ca-certificates unzip \ && curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - \ && echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list \ && echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list \ && rm -rf /var/lib/apt/lists/* ENV CUDA_VERSION 10.2.89 ENV CUDA_PKG_VERSION 10-2=$CUDA_VERSION-1 # Install CUDA packages RUN mkdir -p /usr/share/man/man1/ \ && apt-get update \ && apt-get install -y --no-install-recommends \ cuda-cudart-$CUDA_PKG_VERSION \ cuda-compat-10-2 \ cuda-toolkit-10-2 \ cuda-nvtx-10-2 \ && rm -rf /var/lib/apt/lists/* # Required for nvidia-docker v1 RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf \ && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH} ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64 # nvidia-container-runtime ENV NVIDIA_VISIBLE_DEVICES all ENV NVIDIA_DRIVER_CAPABILITIES compute,utility ENV NVIDIA_REQUIRE_CUDA "cuda>=10.2 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441" # ref: https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/getting-started-guides/on-prem-cluster/standalone-python.md # Install cuDF and RAPIDS RUN pushd $SPARK_HOME/jars \ && curl https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/{cudf-0.14-cuda10-2.jar} -o '#1' \ && curl https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.1.0/{rapids-4-spark_2.12-0.1.0.jar} -o '#1' \ && popd # Install XGBoost & NumPy RUN pushd $SPARK_HOME/jars \ && curl https://repo1.maven.org/maven2/com/nvidia/xgboost4j_3.0/1.0.0-0.1.0/{xgboost4j_3.0-1.0.0-0.1.0.jar} -o '#1' \ && curl https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/{xgboost4j-spark_3.0-1.0.0-0.1.0.jar} -o '#1' \ && popd \ && pip3 install numpy==1.19.2 ENV LIBS_PATH ${SPARK_HOME}/jars ENV SPARK_JARS ${LIBS_PATH}/cudf-0.14-cuda10-2.jar,${LIBS_PATH}/xgboost4j_3.0-1.0.0-0.1.0.jar,${LIBS_PATH}/xgboost4j-spark_3.0-1.0.0-0.1.0.jar ENV JAR_RAPIDS ${SPARK_HOME}/rapids-4-spark_2.12-0.1.0.jar # Make XGBoost available in Python ENV PYTHONPATH ${PYTHONPATH}:${LIBS_PATH}/xgboost4j-spark_3.0-1.0.0-0.1.0.jar # Download example dataset for NVIDIA notebook # https://github.com/NVIDIA/spark-xgboost-examples/blob/880f8b8a6fde21f2f8308450883c3a980f6d434e/examples/notebooks/python/mortgage-gpu.ipynb RUN curl https://raw.githubusercontent.com/NVIDIA/spark-xgboost-examples/880f8b8a6fde21f2f8308450883c3a980f6d434e/datasets/mortgage-small.tar.gz -o /tmp/880f8b8a6fd-mortgage-small.tar.gz
4. Image creation
In a cluster with a Docker registry having domain head-node-name
and port number 5000, connect to a host meeting all the prerequisites and run:
# mkdir spark-xgboost-image # cd spark-xgboost-image # curl https://support.brightcomputing.com/kb-articles/spark-xgboost/Dockerfile -o Dockerfile # docker build -t head-node-name:5000/spark-xgboost-v1 . # docker push head-node-name:5000/spark-xgboost-v1
It should now be possible to pull the image just created:
# docker pull head-node-name:5000/spark-xgboost-v1