Faster spark workloads with comet

For big data processing, spark is still king. Over the years, many improvements have been made to improve spark performance. Databricks themselves created photon, a spark engine that can accelerate spark queries, but this is proprietary to Databricks. Other alternatives do exist (see here for more details), but they are not trivial to setup. But if you use Apache Arrow DataFusion Comet, surprisingly it does not take much time at all to setup....

April 7, 2024 · 2 min · Karn Wong

Slim down python docker image size with poetry and pip

Python package management is not straightforward, seeing default package manager (pip) does not behave like node’s npm, in a sense that it doesn’t track dependencies versions. This is why you should use poetry to manage python packages, since it creates a lock file, so you can be sure that on every re-install, the versions would be the same. However, this poses a challenge when you want to create a docker image with poetry, because you need to do an extra pip install poetry (unless you bake this into your base python image)....

April 7, 2024 · 2 min · Karn Wong

Dataframe write performance to Postgres

Previously, I talked about dataframe performance, but this doesn’t include writing data to destination part. At a large scale, big data means you need to use spark for data processing (unless you prefer SQL, in which this post is irrelevant). But not many orgs need big data, so small data frameworks should work, since they are easier to setup and use compared to spark. Initially I wanted to include pandas as well, but sadly it performs significantly worse than polars, so only spark and polars remain on the benchmark....

March 17, 2024 · 2 min · Karn Wong

How to connect to Cloud SQL from Cloud Run (no, you don't need a VPC)

A minimal application architecture would compose of a database, and an application backend. Serverless database is still in its infancy, but thankfully container-based runtime is very much alive and doing well. On GCP, a serverless container-based runtime do exist, known as Cloud Run. Standard database access pattern Per standard security practices, you should not expose your database to public, this means you should use a proxy/tunnel or private network to reach your database....

February 10, 2024 · 3 min · Karn Wong

What is platform engineering?

Back in 2017-2018, everyone wanted to be a data scientist. Then reality hits, that they need a data engineer for a successful machine learning project. Things didn’t end there, since they also need a machine learning engineer to create production-ready code. Some people think you only need an MLE and suddenly your ML project would become a reality, sadly the reality begs to differ, because you also need to find someone to deploy and scale it, enter DevOps engineer (who understands ML, this is very important)....

January 21, 2024 · 2 min · Karn Wong