Use pyspark locally with docker

For data that doesn’t fit into memory, spark is often a recommended solution, since it can utilize map-reduce to work with data in a distributed manner. However, setting up local spark development from scratch involves multiple steps, and definitely not for a faint of heart. Thankfully using docker means you can skip a lot of steps 😃 Instructions Install Docker Desktop Create docker-compose.yml in a directory somewhere version: "3.3" services: pyspark: container_name: pyspark image: jupyter/pyspark-notebook:latest ports: - "8888:8888" volumes: - ....

December 21, 2021 · 3 min · Karn Wong

Impute pipelines

Imagine having a dataset that you need to use for training a prediction model, but some of the features are missing. The good news is you don’t need to throw some data away, just have to impute them. Below are steps you can take in order to create an imputation pipeline. Github link here! from random import randint import pandas as pd import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn....

May 22, 2020 · 8 min · Karn Wong