Data Engineering

DuckDB vs Polars vs Spark!

I think everyone who has worked with data, in any role or function, used pandas 🐼 at certain point. I first used pandas in 2017, so it’s 6 years already. Things have come a long way, and so is data size I’m working with! Pandas has its own issues, namely no native support for nested schema. In addition, it’s very heavy-handed regarding data types inference. It can be a blessing, but it’s a bane for data engineering work, where you have to make sure that your data conforms to agreed-upon schema (hello data contracts!...

Google Analytics v4 ingestion via BigQuery

Background You want to track who access your site, Google Analytics can do that. To see the data, you can use Google Analytics dashboard, the default settings is good enough for most use cases. But what if you have a lot of tracking data, and you want to streamline a way to analyze it? You could use Data Studio for this, so it’s cool for the moment. But what if you want to use Google Analytics data in conjunction with other data?...

Data transformation - python vs sql showdown

For most people, using SQL to transform data is a no-brainer, seeing it’s a very versatile language, and doesn’t have quite a steep learning curve compared to python. There are some cases where SQL is more suitable for a task, but the reverse can also happen as well. For instance, given a string conversion problem: if a string occurs only one time, replace it with # if a string occurs multiple times, replace it with & > one ### > three ###&& > Heartbreak hotel &&&&&#&&&##&#&&# A solution in python would be:...

Intro to Dagster Cloud

Imagine you have a few data pipelines to schedule. Simplest solution would be cronjob. Time goes by and next thing you know, you have around 50 pipelines to manage. The fun starts when you have to hunt down which pipeline doesn’t run normally. And by then it would be super hard to do tracing if you haven’t set up logging and monitoring. Luckily there are tools we can use to improve the situation....

Data engineer archtypes

I have been working in the data industry since almost half a decade ago. Over time I have noticed so-called archetypes within various data engineering roles. Below are main skills and combinations I have seen over the years. This is by no means an exhaustive list, rather what I often see. SQL + SSIS Using SQL to manipulate data via SSIS, in which data engine is Microsoft SQL Server. Commonly found in enterprise organizations that use Microsoft stack....