Energy Justice Analysis of Climate Data with ClimRR

Sam Foreman

Climate Analysis with ClimRR

Sam Foreman

Argonne National Laboratory

August 7, 2023

Getting Started

Useful links + References

Useful links + References:

Learning Goals

The goal of this project is to teach students to¹:

Use Unix commands to work with files and navigate directories
Use JupyterHub + basic familiarity with how to use Jupyter notebooks on HPC systems
Identify some of the common file types and data formats for geospatial data
- (+ common python libraries for geospatial data analysis)
Perform exploratory data analysis on geospatial data
- (+ understand different operations for manipulating and interacting with this data)
- tabular vs. gridded data
Perform simple visualizations in Python to display different types of data (e.g. maps, line charts, interactive visualizations, etc)
- using figures to illustrate a point or idea
- know what types of plots to use for which situations / data types
Understand control flow / basic structure of a Python script
- using Python in Jupyter²
- import-ing libraries, etc

Reading and Writing Files

We use GeoPandas, an open source project to make working with geospatial data in python easier.

GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types.

Geometric operations are performed by shapely.

GeoPandas further depends on fiona and matplotlib for plotting.

GeoPandas can read almost any vector-based spatial data format including ESRI shapefile, GeoJSON files and more using the command

import geopandas as gpd
import geodatasets
gdf = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))
gdf.head(n=2)

/Users/samforeman/miniconda3/envs/ClimRR/lib/python3.11/site-packages/pyproj/__init__.py:91: UserWarning: Valid PROJ data directory not found. Either set the path using the environmental variable PROJ_DATA (PROJ 9.1+) | PROJ_LIB (PROJ<9.1) or with `pyproj.datadir.set_data_dir`.
  warnings.warn(str(err))

	community	NID	POP2010	POP2000	POPCH	POPPERCH	popplus	popneg	geometry
0	DOUGLAS	35	18238	26470	-8232	-31.099358	0	1	MULTIPOLYGON (((-87.60914 41.84469, -87.60915 ...
1	OAKLAND	36	5918	6110	-192	-3.142390	0	1	MULTIPOLYGON (((-87.59215 41.81693, -87.59231 ...

which returns a GeoDataFrame object.

A GeoDataFrame is a tabular data structure that contains a GeoSeries.

The most important property of a GeoDataFrame is that it always has one GeoSeries column that holds a special status.

This GeoSeries is referred to as the GeoDataFrame’s “geometry”. When a spatial method is applied to a GeoDataFrame (or a spatial attribute like area is called), this commands will always act on the “geometry” column³.

Imports / setup

We’ve prepared some useful functions in src/ClimRR/data.py to simplify the process of loading and working with our data.

Imports

import matplotlib_inline
import matplotlib.pyplot as plt
import geopandas as gpd
import warnings

import matplotlib.pyplot as plt

from enrich.style import STYLES
from rich.theme import Theme
from rich.console import Console as Console
from ClimRR import get_logger, set_plot_style
from ClimRR.data import DATA_DIR

matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

set_plot_style()
theme = Theme(STYLES)
log = get_logger()
console = Console(
    theme=Theme(STYLES),
    log_path=False,
    markup=True,
    width=512
)

Data Organization

The ClimRR Data can be downloaded as a *.zip file, and contains:

Shapefile (as an additional *.zip inside the original)
Individual *.csv’s for each of the climate variables
- {AnnualTemperatureMaximum,...,HeatingDegreeDays}.csv

*.zip contents:

📂 ClimRR Data Download/
┣━━ 📂 GridCells2Shapefile/
┃   ┣━━ 📄 GridCells2.cpg
┃   ┣━━ 📄 GridCells2.dbf
┃   ┣━━ 📄 GridCells2.prj
┃   ┣━━ 📄 GridCells2.sbn
┃   ┣━━ 📄 GridCells2.sbx
┃   ┣━━ 📄 GridCells2.shp
┃   ┣━━ 📄 GridCells2.shp.xml
┃   ┗━━ 📄 GridCells2.shx
┣━━ 📂 GridCellsShapefile/
┃   ┣━━ 📄 GridCells.cpg
┃   ┣━━ 📄 GridCells.dbf
┃   ┣━━ 📄 GridCells.prj
┃   ┣━━ 📄 GridCells.sbn
┃   ┣━━ 📄 GridCells.sbx
┃   ┣━━ 📄 GridCells.shp
┃   ┣━━ 📄 GridCells.shp.xml
┃   ┗━━ 📄 GridCells.shx
┣━━ 📄 AnnualTemperatureMaximum.csv
┣━━ 📄 AnnualTemperatureMinimum.csv
┣━━ 📄 ClimRR Metadata and Data Dictionary.pdf
┣━━ 📄 ConsecutiveDayswithNoPrecipitation.csv
┣━━ 📄 FireWeatherIndex_Wildfire.csv
┣━━ 📄 GridCells2Shapefile.zip
┣━━ 📄 GridCellsShapefile.zip
┣━━ 📄 HeatingDegreeDays.csv
┣━━ 📄 Precipitation_inches_AnnualTotal.csv
┣━━ 📄 README.txt
┣━━ 📄 SeasonalTemperatureMaximum.csv
┣━━ 📄 SeasonalTemperatureMinimum.csv
┗━━ 📄 WindSpeed.csv

Types of Data

Figure 1: (Image credit: National Ecological Observatory Network (NEON))

One of the most common file formats for vector data is the ESRI shapefile, which is what we will be working with for this project.

Metadata

Metadata is “data about the data”⁴ and is (by design) meant to give additional information or provide context about a dataset.

Examples might include:

When was this data created?
- By who? For what? Where at? When? Why??
How is this data licensed?
Is there a reference for this data? (DOI ? URL ? etc.)
What variables or fields are contained in this data?
- What do they represent? Are there units?
If the data is geospatial, what geographical or temporal area is included?
Additional (often contextual) information about the data
- e.g. “this data was created to inform a population about upcoming weather events” or similar

Metadata is often expected to be of a certain form, or to follow specific conventions / guidelines.

This is important to keep in mind and will allow others to understand your data without needing an explanation (e.g. “what does this abbreviation mean?”, “how is this variable defined?”, etc.)

Some common metadata conventions for GIS data include:

Schema.org is another useful reference and provides a general framework for dealing with metadata.

FAIR Data

FAIR stands for “Findable, Accessible, Interoperable, Reusable”, and provides a set of guidelines for data sharing.

In the age of “big data”, its important that we use (and promote!) tools that facilitate the effective sharing of data.

Ideally, our data would be completely self-contained and provide, via metadata, all of the information required to understand and work with it.

Load Shapefile and inspect

A shapefile is provided in the ClimRR Data Download (ANL) and can be loaded using geopandas.read_file(...) which will return a geopandas.GeoDataFrame:

shpfile = DATA_DIR.joinpath(
    "GridCellsShapefile/GridCells.shp"
)
shape = gpd.read_file(shpfile)

Each entry in this table defines a single grid cell (12km x 12 km) which collectively tile the United States.

We can get a better understanding of whats going on by looking at the first few entries:

shape.head(n=2)

	OBJECTID	Crossmodel	Shape_Leng	Shape_Area	geometry
0	1	R161C438	63614.764866	2.529273e+08	POLYGON ((-9530601.177 4726046.614, -9534793.8...
1	2	R125C222	61384.219597	2.355013e+08	POLYGON ((-12959076.287 4395610.472, -12974301...

We see that each row has the following columns: {OBJECTID, Crossmodel, Shape_Leng, Shape_Area, geometry}.

In particular, the Crossmodel⁵ column is a text ID that uniquely identifies an individual cell.

To be explicit, let’s look at the WindSpeed.csv file.

Dealing with Geometry

Our shapefile contains a grid of cells (12km x 12km) which tile the continental US.

We can inspect a single cell:

cell = shape[shape["Crossmodel"] == 'R146C497']
cell.head()

	OBJECTID	Crossmodel	Shape_Leng	Shape_Area	geometry
4	5	R146C497	60142.919468	2.260731e+08	POLYGON ((-8733007.764 4224658.634, -8738250.3...

ax = cell.boundary.plot()
ax.set_axis_off()
_ = ax.set_title('Grid from shapefile: 12 x 12 km')
plt.tight_layout()

cell.explore()

Make this Notebook Trusted to load map: File -> Trust Notebook

Load data from `*.csv` files

Each entry (row) in the .csv has a Crossmodel column (e.g. R146C497) which corresponds to a row in our shapefile that uniquely determines its location on the Earth.

We can associate with each of the .csvs the geometry used in our shapefile to position our data on the globe.

import pandas as pd
csvs = [i for i in DATA_DIR.rglob('*.csv')]
data = {}
for f in csvs:
    key = f.stem
    tmp = pd.read_csv(f.as_posix())
    gdf = shape.merge(tmp, on='Crossmodel')
    gdf['boundary'] = gdf.boundary
    gdf['centroid'] = gdf.centroid
    data[key] = gdf
    console.log(f"data['{key}'].shape={data[key].shape}")

[20:42:54] data['FireWeatherIndex_Wildfire'].shape=(62834, 35)

           data['HeatingDegreeDays'].shape=(62834, 10)

[20:42:55] data['AnnualTemperatureMinimum'].shape=(62834, 18)

           data['SeasonalTemperatureMaximum'].shape=(62834, 27)

           data['ConsecutiveDayswithNoPrecipitation'].shape=(55896, 19)

           data['SeasonalTemperatureMinimum'].shape=(62834, 27)

           data['WindSpeed'].shape=(62834, 18)

           data['AnnualTemperatureMaximum'].shape=(62834, 18)

           data['Precipitation_inches_AnnualTotal'].shape=(55896, 18)

Look at the WindSpeed data

Lets inspect one of the entries in our data[(...)] dictionary, WindSpeed, for example:

data["WindSpeed"].head()

	OBJECTID	Crossmodel	Shape_Leng	Shape_Area	geometry	hist	rcp45_midc	rcp45_endc	rcp85_midc	rcp85_endc	mid45_hist	end45_hist	mid85_hist	end85_hist	mid85_45	end85_45	boundary	centroid
0	1	R161C438	63614.764866	2.529273e+08	POLYGON ((-9530601.177 4726046.614, -9534793.8...	7.21540	7.19415	7.38917	7.30470	7.22690	-0.021256	0.173764	0.089297	0.011499	0.110553	-0.162264	LINESTRING (-9530601.177 4726046.614, -9534793...	POINT (-9540369.710 4720470.575)
1	2	R125C222	61384.219597	2.355013e+08	POLYGON ((-12959076.287 4395610.472, -12974301...	8.32612	8.11360	8.26028	8.17420	8.02081	-0.212523	-0.065843	-0.151919	-0.305307	0.060603	-0.239465	LINESTRING (-12959076.287 4395610.472, -129743...	POINT (-12967596.341 4402326.143)
2	3	R121C235	61111.892875	2.334164e+08	POLYGON ((-12754805.395 4355815.951, -12770000...	8.58573	8.59828	8.56058	8.54483	8.55421	0.012547	-0.025149	-0.040898	-0.031519	-0.053446	-0.006370	LINESTRING (-12754805.395 4355815.951, -127700...	POINT (-12763132.114 4362694.465)
3	4	R169C431	64716.234995	2.617618e+08	POLYGON ((-9605729.481 4879238.815, -9609863.1...	9.17284	9.21681	9.44966	9.26548	9.14917	0.043968	0.276813	0.092635	-0.023674	0.048667	-0.300487	LINESTRING (-9605729.481 4879238.815, -9609863...	POINT (-9615619.029 4873482.586)
4	5	R146C497	60142.919468	2.260731e+08	POLYGON ((-8733007.764 4224658.634, -8738250.3...	8.25430	8.19130	8.34417	8.29698	8.29411	-0.062996	0.089874	0.042684	0.039807	0.105680	-0.050067	LINESTRING (-8733007.764 4224658.634, -8738250...	POINT (-8742676.917 4220233.536)

We see that each entry has a geometry column, as well as columns for {hist,rcp45_midc, rcp45_endc, rcp85_midc, rcp85_endc, ...} which contains the numerical value of the WindSpeed in each cell under different scenarios at different points in time.

Let’s look at the WindSpeed for our individual cell:

cell_wind = data["WindSpeed"][data["WindSpeed"]["Crossmodel"] == 'R146C497']

ax = cell_wind.plot(column='hist', legend=True)
ax.set_axis_off()
_ = ax.set_title("WindSpeed [Hist] for CELL: R146C497")

Visualizing our Data

Lets inspect the first few entries from our Shapefile:

Figure 2: As we include more cells, we see the outline of the US beginning to take shape.

Footnotes

Building on ideas from Earth and Environmental Science ↩︎
Getting Python setup correctly can be surprisingly difficult↩︎
The “geometry” column – no matter its name – can be accessed through the geometry attribute (gdf.geometry), and the name of the geometry column can be found by typing gdf.geometry.name.↩︎
Metadata ↩︎
Truncated name for “Crossmodel_CellName”.↩︎

Citation

BibTeX citation:

@online{foreman2023,
  author = {Foreman, Sam},
  title = {Energy {Justice} {Analysis} of {Climate} {Data} with
    {ClimRR}},
  date = {2023-08-07},
  url = {https://saforem2.github.io/climate-analysis},
  langid = {en}
}

For attribution, please cite this work as:

Foreman, Sam. 2023. “Energy Justice Analysis of Climate Data with ClimRR.” Intro to HPC Bootcamp @ NERSC: ClimRR. August 7, 2023. https://saforem2.github.io/climate-analysis.