Dataframe write performance to Postgres
Previously, I talked about dataframe performance, but this doesn’t include writing data to destination part. At a large scale, big data means you need to use spark for data processing (unless you prefer SQL, in which this post is irrelevant). But not many orgs need big data, so small data frameworks should work, since they are easier to setup and use compared to spark. Initially I wanted to include pandas as well, but sadly it performs significantly worse than polars, so only spark and polars remain on the benchmark....
How to connect to Cloud SQL from Cloud Run (no, you don't need a VPC)
A minimal application architecture would compose of a database, and an application backend. Serverless database is still in its infancy, but thankfully container-based runtime is very much alive and doing well. On GCP, a serverless container-based runtime do exist, known as Cloud Run. Standard database access pattern Per standard security practices, you should not expose your database to public, this means you should use a proxy/tunnel or private network to reach your database....
What is platform engineering?
Back in 2017-2018, everyone wanted to be a data scientist. Then reality hits, that they need a data engineer for a successful machine learning project. Things didn’t end there, since they also need a machine learning engineer to create production-ready code. Some people think you only need an MLE and suddenly your ML project would become a reality, sadly the reality begs to differ, because you also need to find someone to deploy and scale it, enter DevOps engineer (who understands ML, this is very important)....
Collaboration model for data science projects
Many data science teams are struggling with implementing end-to-end machine learning projects. It’s a very common phenomenon, so if you are experiencing this, you are not alone. Having worked in every stage of data science project lifecycle, in addition to normal web services deployments, this is what I think how we should collaborate. Collaboration model between teams Note: The diagram does not signify order of communication. Rather, it states the communication pathways between teams....
Should data scientists deploy models to production?
Over the years I’ve heard stories of data teams struggling with deploying machine learning models to production. Clearly there is a pattern here. This article is my reflection on the matter. So what’s the problem? Data scientists, by definition, create mathematical models from data so some unknowns can become known. This is colloquially known as “prediction.” For example, if you have sales data from last year, you can use it to forecast sales performance of next year....