Data engineer archtypes

I have been working in the data industry since almost half a decade ago. Over time I have noticed so-called archetypes within various data engineering roles. Below are main skills and combinations I have seen over the years. This is by no means an exhaustive list, rather what I often see. SQL + SSIS Using SQL to manipulate data via SSIS, in which data engine is Microsoft SQL Server. Commonly found in enterprise organizations that use Microsoft stack....

August 26, 2022 · 2 min · Karn Wong

What SQL can't do for data engineering

I often hear people ask “if you can do data engineering with SQL, then what’s the point of learning spark or python?” Data ingestion Let’s circle back at bit. I think we all can agree that: there’s a point in time where there’s no data in the data warehouse (which DE-who-use-SQL’s use as base of operation). The source data could be anything from hard CSV/Excel or API endpoints. No data in datawarehouse, DE can’t use SQL to do stuff with the data....

May 15, 2022 · 3 min · Karn Wong

Use pyspark locally with docker

For data that doesn’t fit into memory, spark is often a recommened solution, since it can utilize map-reduce to work with data in a distributed manner. However, setting up local spark development from scratch involves multiple steps, and definitely not for a faint of heart. Thankfully using docker means you can skip a lot of steps 😃 Instructions Install Docker Desktop Create docker-compose.yml in a directory somewhere version: "3.3" services: pyspark: container_name: pyspark image: jupyter/pyspark-notebook:latest ports: - "8888:8888" volumes: - ....

December 21, 2021 · 3 min · Karn Wong

Don't write large table to postgres with pandas

We have a few tables where the data size is > 3GB (in parquet, so around 10 GB uncompressed). Loading it into postgres takes an hour. (Most of our tables are pretty small, hence the reason why we don’t use columnar database). I want to explore whether there’s a faster way or not. The conclusion is writing to postgres with spark seems to be fastest, given we can’t use COPY since our data contain free text, which means it would make CSV parsing impossible....

June 27, 2021 · 1 min · Karn Wong

Data engineering toolset (that I use) glossary

Big data Spark: Map-reduce framework for dealing with big data, especially for data that doesn’t fit into memory. Utilizes parallelization. Cloud AWS: Cloud platform for many tools used in software engineering. AWS Fargate: A task launch mode for ECS task, where it automatically shuts down once a container exits. With EC2 launch mode, you’ll have to turn off the machine yourself. AWS Lambda: Serverless function, can be used with docker image too....

June 4, 2021 · 2 min · Karn Wong