What SQL can't do for data engineering

I often hear people ask “if you can do data engineering with SQL, then what’s the point of learning spark or python?” Data ingestion Let’s circle back at bit. I think we all can agree that: there’s a point in time where there’s no data in the data warehouse (which DE-who-use-SQL’s use as base of operation). The source data could be anything from hard CSV/Excel or API endpoints. No data in datawarehouse, DE can’t use SQL to do stuff with the data....

May 15, 2022 · 3 min · Karn Wong

Use SSH key during docker build without embedding the key via ssh-agent

Imagine working in a company, and they have a super cool internal module! The module works great, except that it is a private module, which means you need to install it by cloning the source repo and install it from source. That shouldn’t be an issue if you work on your local machine. But for production usually this means you somehow need to bundle this awesome module into your docker image....

February 6, 2022 · 2 min · Karn Wong

Use pyspark locally with docker

For data that doesn’t fit into memory, spark is often a recommened solution, since it can utilize map-reduce to work with data in a distributed manner. However, setting up local spark development from scratch involves multiple steps, and definitely not for a faint of heart. Thankfully using docker means you can skip a lot of steps 😃 Instructions Install Docker Desktop Create docker-compose.yml in a directory somewhere version: '3....

December 21, 2021 · 3 min · Karn Wong

Reduce docker image size with alpine

Creating scripts are easy. But creating a small docker image is not 😅. Not all Linux flavors are created equal, some are bigger than others, etc. But this difference is very crucial when it comes to reducing docker image size. A simple bash script docker image Given a Dockerfile (change apk to apt for ubuntu): FROMalpine:3WORKDIR/appRUN apk update && apk add jq curlCOPY water-cut-notify.sh ./ENTRYPOINT ["sh", "/app/water-cut-notify.sh"] Base image Docker image size alpine 11....

December 19, 2021 · 1 min · Karn Wong

Secrets management with SOPS, AWS SSM and Terraform

At my organization we use sops to check in encrypted secrets into git repos. This solves plaintext credentials in version control. However, say, you have 5 repos using the same database credentials, rotating secrets means you have to go into each repo and update the SOPS credentials manually. Also worth nothing that, for GitHub actions, authenticating AWS means you have to add repo secrets. This means for all the repos you have CI enabled, you have to populate the repo secrets with AWS credentials....

November 30, 2021 · 4 min · Karn Wong