How to Deploy Spark with Kubernetes on Bright 9.0, 9.1, 9.2.

Contents

The steps described in this page can be followed to run a distributed Spark application using Kubernetes on Bright 9.0, 9.1 or 9.2.

1. Software versions

The Docker image that is going to be used for Spark will provide software with the following main versions:

Operating System: Ubuntu, CentOS, Rocky, RHEL or SLES.
Apache Spark: 3.2.1 (older or newer versions likely also work)
Open JDK: 8

This article uses RHEL 8 with BCM 9.1.

2. Prerequisites

The steps described in this article have been tested with this environment:

Kubernetes: 1.16.1 till 1.21.4 have been tested (default values for cm-kubernetes-setup are sufficient for 9.0 and up)
Docker version: 19.03.15 or up (provided by cm-kubernetes-setup in some cases, otherwise use cm-docker-setup)
Docker registry (default values for cm-docker-registry-setup are sufficient, note that this is cm-container-registry-setup for 9.1+)

3. Locally install Spark

In order to run Spark applications, the spark-submit binary is required.

The binary can be download as follows:

# sudo yum install -y git java-1.8.0-openjdk-devel
# wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
# tar -zxvf spark-3.2.1-bin-hadoop3.2.tgz
# cd spark-3.2.1-bin-hadoop3.2/
# export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
# export PATH=$PATH:$JAVA_HOME/bin

At this point it’s possible to use the ./bin/spark-submit command:

# ./bin/spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.1
      /_/
                        
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_312
Branch HEAD
Compiled by user hgao on 2022-01-20T19:26:14Z
Revision 4f25b3f71238a00508a356591553f2dfa89f8290
Url https://github.com/apache/spark
Type --help for more information.

The spark-submit command will require a valid $JAVA_HOME environment variable, perhaps double-check with $JAVA_HOME/bin/java -version if the value is not correct.

4. Image creation

The spark job needs to be containerized and the following helper scripts from the upstream project, will be used to create an image with the SparkPI example, using Docker.

This step might require setting up Docker on the Host if not available (cm-docker-setup). Alternatively another node in the cluster that already has Docker can also be used.

In a cluster with a Docker registry having domain head-node-name and port number 5000, run:

# module load docker
# ./bin/docker-image-tool.sh -r head-node-name:5000/brightcomputing -t v3.2.1 ./kubernetes/dockerfiles/spark/Dockerfile build
# ./bin/docker-image-tool.sh -r head-node-name:5000/brightcomputing -t v3.2.1 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
# docker push head-node-name:5000/brightcomputing/spark:v3.2.1 
# docker push head-node-name:5000/brightcomputing/spark-py:v3.2.1

It should now be possible to pull the image just created:

# docker pull head-node-name:5000/brightcomputing/spark:v3.2.1
# docker pull head-node-name:5000/brightcomputing/spark-py:v3.2.1

Note: Kubernetes with containerd is configured differently by the cm-container-registry-setup in /etc/containerd/certs.d/head-node-name\:5000/hosts.toml for the ability to pull those images, as it won’t be Docker-based.

5. Configure Kubernetes for Spark

# module load kubernetes
# cat << EOF > spark.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: default
secrets:
- name: spark-token-gmdfz
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: spark-role
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:
- kind: ServiceAccount
  name: spark
  namespace: default
EOF

# kubectl apply -f spark.yaml

We create the above Service Account to run this specific example, and are giving it the ClusterRole/edit privileges to the default namespace. Any other user will do of course, the service account is a parameter of spark-submit in the next step.

6. Run Spark with Kubernetes

Kubernetes can now be used to start a Spark cluster on-demand.

The following example will compute the first digits of π. The Spark driver will be started on a pod running on the head node and listening on port 10433.In addition, 3 Spark executors will be started (and cleaned up) by Kubernetes.

# module load kubernetes
# ./bin/spark-submit \
    --master k8s://https://localhost:10443 \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=3 \
    --conf spark.kubernetes.container.image=head-node-name:5000/brightcomputing/spark-py:v3.2.1 \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    local:///opt/spark/examples/jars/spark-examples_2.12-3.2.1.jar

Note: The full path of the examples jar file is for inside the provided docker image. The jar file was placed there during the Image creation step. You can run the image manually to verify this:

# docker run -it head-node-name:5000/brightcomputing/spark-py:v3.2.1 /bin/bash
185@a58ddeaf0a1d:/opt/spark/work-dir$ ls /opt/spark/examples/jars/
scopt_2.12-3.7.1.jar  spark-examples_2.12-3.2.1.jar

Kubernetes will now schedule Spark pods. Their status will switch from Pending to Running to Succeeded. The final output in the terminal should be similar to the following:

22/03/17 13:10:23 INFO LoggingPodStatusWatcherImpl: Application status for spark-897b6dd3c86a470a971b1c2180f19fa8 (phase: Running)
22/03/17 13:10:23 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-pi-e7ad607f97c884e1-driver
	 namespace: default
	 labels: spark-app-selector -> spark-897b6dd3c86a470a971b1c2180f19fa8, spark-role -> driver
	 pod uid: 4fb53b0d-a2c5-4877-95ac-36ed0e3bdce8
	 creation time: 2022-03-17T12:09:36Z
	 service account name: spark
	 volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-4gxk8
	 node name: node003
	 start time: 2022-03-17T12:09:36Z
	 phase: Succeeded
	 container status: 
		 container name: spark-kubernetes-driver
		 container image: rb-kube92spark:5000/brightcomputing/spark-py:v3.2.1
		 container state: terminated
		 container started at: 2022-03-17T12:09:49Z
		 container finished at: 2022-03-17T12:10:21Z
		 exit code: 0
		 termination reason: Completed
22/03/17 13:10:23 INFO LoggingPodStatusWatcherImpl: Application status for spark-897b6dd3c86a470a971b1c2180f19fa8 (phase: Succeeded)
22/03/17 13:10:23 INFO LoggingPodStatusWatcherImpl: Container final statuses:


	 container name: spark-kubernetes-driver
	 container image: rb-kube92spark:5000/brightcomputing/spark-py:v3.2.1
	 container state: terminated
	 container started at: 2022-03-17T12:09:49Z
	 container finished at: 2022-03-17T12:10:21Z
	 exit code: 0
	 termination reason: Completed
22/03/17 13:10:23 INFO LoggingPodStatusWatcherImpl: Application spark-pi with submission ID default:spark-pi-e7ad607f97c884e1-driver finished
22/03/17 13:10:23 INFO ShutdownHookManager: Shutdown hook called
22/03/17 13:10:23 INFO ShutdownHookManager: Deleting directory /tmp/spark-b85fa3d9-2e75-4ee8-9613-d99719efaee1

Note: For longer running jobs the Spark UI can be viewed. It is exposed as a regular Kubernetes service for each Spark driver (job)

The above example would first result in scheduling the Driver Pod:

# module load kubernetes/default/1.16.1
# kubectl get pod
NAME READY STATUS RESTARTS AGE
spark-pi-45239b7f9c96916d-driver 0/1 ContainerCreating 0 17s

Once running, it will schedule executors:

# kubectl get pod
NAME READY STATUS RESTARTS AGE
spark-pi-45239b7f9c96916d-driver 1/1 Running 0 32s
spark-pi-5639b57f9c96ecd0-exec-1 0/1 ContainerCreating 0 9s
spark-pi-5639b57f9c96ecd0-exec-2 0/1 ContainerCreating 0 9s
spark-pi-5639b57f9c96ecd0-exec-3 0/1 ContainerCreating 0 9s

The Spark UI should be available already as a Kubernetes service at this point.

The executors will transition to Running, then once the Driver transitions to Completed, the executors are cleaned up (Terminated). The Driver pod is kept around to extract any output, such as:

# kubectl logs spark-pi-45239b7f9c96916d-driver | grep ^Pi
Pi is roughly 3.1370956854784273

Updated on March 18, 2022