For data processing tasks, there are different ways you can go about it:
- using SQL to leverage a database engine to perform data transformation
- dataframe-based frameworks such as pandas, ray, dask, polars
- big data processing frameworks such as spark
Check out this article for more info on polars vs spark benchmark.
At larger data scale, other solutions (except spark) can work, but with a lot of vertical scaling, and this can get very expensive. For a comparison, our team had to scale a database to 4/16 GB and it still took the whole night, whereas spark on a single node can process the data in 2 minutes flat.
Using spark would solve a lot of scaling problems, the only problem is that spark in cluster mode is very hard to set up. That’s why most if not all cloud providers provide managed spark runtime (AWS EMR, GCP Dataproc, Azure Databricks). And since these are managed spark, this means you can’t test it locally (and you’ll have to wait at least 5 minutes for the cluster to start and finish bootstrapping.)
As of Apache Spark 3.1 release in March 2021, spark on kubernetes is generally available. This is great in so many ways, including but not limited to:
- you have full control of your infra
- takes less than 10 seconds to start a spark cluster
- can store logs in a central location, to be viewed later via spark history server
- can use minio as local storage backend (better throughput compared to calling S3 via home/work internet)
- cheaper than all managed solutions, even serverless variants (more on this later)
Start a local kubernetes cluster
Minikube, Kind or K3d should work.
k3d cluster create spark-cluster --api-port 0.0.0.0:6443 # verify that a cluster is created kubectl get nodes
Create a namespace for spark jobs
kubectl create namespace spark
Create a service account for spark
kubectl create serviceaccount spark --namespace spark kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark:spark --namespace=spark
Install local spark (so you can run
brew install temurin brew install apache-spark
Run a demo spark job
To use support for EC Keys error: https://stackoverflow.com/questions/75796747/spark-submit-error-to-use-support-for-ec-keys-you-must-explicitly-add-this-depe
# get kubernetes control plane url via `kubectl cluster-info` spark-submit \ --master k8s://https://0.0.0.0:6443 \ --deploy-mode cluster \ --name spark-pi \ --conf spark.kubernetes.namespace=spark \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.executor.instances=3 \ --conf spark.kubernetes.container.image=spark:3.4.1 \ local:///opt/spark/examples/src/main/python/pi.py
Tearing down k3d
k3d cluster delete spark-cluster
Since spark-submit on kubernetes is akin to serverless spark, we will be comparing costs with serverless spark runtime.
- CPU: 8
- Memory: 16
- Runtime: 15 Hours
- Ephemeral storage: 20 GB
- Region: Singapore
Remarks: spark on kubernetes in this article assumes you’re using GKE Autopilot, which has a management fee of $60/month (unless it’s your only cluster). Compute is spot.
|Serverless Runtime||Reference||Total Cost||Break even point if you run at least 33x workloads|
|AWS EMR Serverless||https://aws.amazon.com/emr/pricing/||9.61 USD||317.22 USD|
|GCP Dataproc Serverless||https://cloud.google.com/products/calculator||4.26 USD||140.58 USD|
|Azure Databricks Serverless||https://www.databricks.com/product/pricing/product-pricing/instance-types||5.61 USD||185.13 USD|
|(Bonus) spark on kubernetes||https://cloud.google.com/kubernetes-engine/pricing||2.417 USD||139.76 USD (Including cluster management fee)|
The script I used to find a break event point:
aws = 9.61272 gcp = 4.26 databricks = 5.61 k8s = 2.417 found_break_even_point = False multiplier = 0 while not found_break_even_point: multiplier += 1 print(multiplier) aws_multiplied = aws * multiplier gcp_multiplied = gcp * multiplier databricks_multiplied = databricks * multiplier k8s_reference = (k8s * multiplier) + 60 # add gke autopilot management fee if k8s_reference < min([aws_multiplied, gcp_multiplied, databricks_multiplied]): found_break_even_point = True
Why GKE Autopilot
Traditionally, when you’re creating a kubernetes cluster, you’ll have to attach nodes yourself. This means, if you’re running spark jobs where the compute requirement exceeds the available capacity, your jobs won’t be able to start. By using GKE Autopilot, GCP automatically provision nodes and attach to your GKE cluster automatically, which means you won’t ever face “insufficient cpu / memory” errors.
If you’re already using kubernetes, and also use spark for data processing, migrating workloads to k8s can reduce a significant amount of cost. For our case, existing spark jobs are run on a 4vCPU/16GB VM (
109.79 USD / Month), and this cause a lot of dent in our bills. But if we migrate to spark on k8s, it would only cost us around
16.41 USD / month 😱. So that’s a whopping 85% price reduction 🤯.
For more advanced use cases, check out https://github.com/kahnwong/spark-on-k8s.