DuckDB vs Polars vs Spark!

I think everyone who has worked with data, in any role or function, used pandas 馃惣 at certain point. I first used pandas in 2017, so it鈥檚 6 years already. Things have come a long way, and so is data size I鈥檓 working with! Pandas has its own issues, namely no native support for nested schema. In addition, it鈥檚 very heavy-handed regarding data types inference. It can be a blessing, but it鈥檚 a bane for data engineering work, where you have to make sure that your data conforms to agreed-upon schema (hello data contracts!...

April 7, 2023 路 3 min 路 Karn Wong

Google Analytics v4 ingestion via BigQuery

Background You want to track who access your site, Google Analytics can do that. To see the data, you can use Google Analytics dashboard, the default settings is good enough for most use cases. But what if you have a lot of tracking data, and you want to streamline a way to analyze it? You could use Data Studio for this, so it鈥檚 cool for the moment. But what if you want to use Google Analytics data in conjunction with other data?...

March 19, 2023 路 7 min 路 Karn Wong

Data transformation - python vs sql showdown

For most people, using SQL to transform data is a no-brainer, seeing it鈥檚 a very versatile language, and doesn鈥檛 have quite a steep learning curve compared to python. There are some cases where SQL is more suitable for a task, but the reverse can also happen as well. For instance, given a string conversion problem: if a string occurs only one time, replace it with # if a string occurs multiple times, replace it with & > one ### > three ###&& > Heartbreak hotel &&&&&#&&&##&#&&# A solution in python would be:...

March 18, 2023 路 1 min 路 Karn Wong

Intro to Dagster Cloud

Imagine you have a few data pipelines to schedule. Simplest solution would be cronjob. Time goes by and next thing you know, you have around 50 pipelines to manage. The fun starts when you have to hunt down which pipeline doesn鈥檛 run normally. And by then it would be super hard to do tracing if you haven鈥檛 set up logging and monitoring. Luckily there are tools we can use to improve the situation....

September 27, 2022 路 3 min 路 Karn Wong

Data engineer archtypes

I have been working in the data industry since almost half a decade ago. Over time I have noticed so-called archetypes within various data engineering roles. Below are main skills and combinations I have seen over the years. This is by no means an exhaustive list, rather what I often see. SQL + SSIS Using SQL to manipulate data via SSIS, in which data engine is Microsoft SQL Server. Commonly found in enterprise organizations that use Microsoft stack....

August 26, 2022 路 2 min 路 Karn Wong