Shapefile to data lake

Background: we use spark to read/write to data lake. For dealing with spatial data & analysis, we use sedona. Shapefile is converted to TSV then read by spark for further processing & archival. Recently I had to archive shapefiles in our data lake. It wasn’t rosy for the following reasons: Invalid geometries Sedona (and geopandas too) whines if it encounters invalid geometry during geometry casting. The invalid geometries could be from many reasons, one of them being unclean polygon clipping....

April 23, 2021 · 2 min · Karn Wong

Workarounds for archiving large shapefile in data lake

If you work with spatial data, chances are you are familiar with shapefile, a file format for viewing / editing spatial data. Essentially, shapefile is just a tabular data like csv, but it does thingamajig with geometry data type, where any gis tools like qgis or arcgis can understand right away. If you have a csv file with geometry column in wkt format (something like POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))), you’ll have to specify which column is to be used for geometry....

January 31, 2021 · 2 min · Karn Wong

Visualizing map region prefix/suffix

import pandas as pd import numpy as np import geopandas as gpd import geoplot as gplt import matplotlib.pyplot as plt from geoplot import polyplot from pythainlp.tokenize import word_tokenize, syllable_tokenize Data structure name: target region name geometry: spatial column *: parent region name, e.g. in “district” dataset it would have a “province” column Dissolving dataset in case you have multiple region level in the same file ## assuming you have a district dataset and want to dissolve to province only district_filename = "FILE_PATH_HERE" gdf = gpd....

September 3, 2020 · 4 min · Karn Wong