====== Python/Pandas Quick Reference Card ======
A "cheat sheet" of commonly used Python data tools, control structures and their syntax.
Full pandas [[http://pandas.pydata.org/pandas-docs/stable/index.html|documentation]].
Python 2.7 [[https://docs.python.org/2/contents.html|documentation]].
===== Comments =====
# single line comment
'''
multi
line
comments
'''
===== Common library imports =====
import geopandas as gpd
from geopandas.tools import sjoin
from geopandas.tools import overlay
import pandas as pd
from pandas.stats.api import ols
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import StringIO
import os
import sys
Note that on Anaconda, the typical path for packages is something like:
/opt/conda/envs/myenv/lib/python3.6/site-packages
===== Data import and export =====
read and write: csv, excel, shapefile, etc FIXME
filename = communityName + '\\output\\df_nonRes_bldgs.csv'
df_general = pd.read_csv(filename)
specify a single worksheet to read
df_tripsOD = pd.read_excel(filename, sheet_name='HBS_24')
specify multiple worksheets to read
l_purps = ['HBS-24','HBW-24','HBO-24','NHB-24']
d_df_tripsOD = pd.read_excel(filename, sheet_name=l_purps)
df_nonRes_bldgs.to_csv(filename, index=False)
df_structure.to_excel(filename)
df_parcels_0 = gpd.GeoDataFrame.from_file(myParcels.shp)
df_structure = gpd.read_file(myLayers.gdb, layer='structure')
Round trip export/import for json files, as Python dictionaries or lists. Requires ''import json''.
filename = 'folderX/list_CEUDSector_ind.json'
with open(filename, 'w', encoding='utf-8') as f:
json.dump(l_CEUDSector_ind, f, ensure_ascii=False, indent=4)
filename = 'folderX/list_CEUDSector_ind.json'
with open(filename) as data_file:
l_CEUDSector_ind= json.load(data_file)
===== Data interrogation =====
df_buildings.info()
option ''null_counts=True''
To see the individual cell types
df_bld_0.DFRPPropertyNumber.map(type)
df_buildings.describe()
df_buildings.head()
df_buildings.tail()
df_buildings.sample(10)
df_buildings.StrucType.value_counts()
pt_structure = pd.pivot_table(df_structure,
values='OBJECTID',
index=['propType'],
columns=['strucType'],
margins=False,
aggfunc=np.sum)
[[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html#pandas.pivot_table|ref]] |
Common aggfunc params: ''len'', ''np.sum'', ''np.mean''
df_buildings.StrucType.unique()
df_buildings.StrucType.nunique()
===== Selecting data =====
Selecting rows
df_farm = df_structure[
(df_structure.PropertyCode >= 200) &
(df_structure.PropertyCode < 300)]
Can use bitwise operators: AND ''&'', OR ''|'', NOT ''~''
Other boolean operators: ''=='', ''>='', ''>'', ''<='', ''<'',''!='', ''isin(myList)'', ''isnull()'', ''notnull()''
Selecting columns
df_condosLUT = df_condos.loc[:,['RollNumber','CondoRollN']]
===== Setting and filling data =====
df_buildings.loc[df_buildings.UNIQUEID == '6554354',
'ASMTCODE'] = '303 - Warehouses (Including Cold Storage Plants)'
**fillna()** [[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.fillna.html|ref]]
df_building['YearBuilt'].fillna(2011, inplace=True)
Conditional if-then-else column-wise logic
df_structure['finalID'] = np.where(df_structure.condoID.isnull(),
df_structure.normalID, df_structure.condoID)
**where()**, **mask()**
Replace if condition is false and true respectively.
df_8.mask(df_8 > 5.0, 5.0)
===== Multiindexing =====
Turns columns A and B the the multiindex for the output dataframe.
df.set_index(['A','B'])
Remove multiindexing.
df.reset_index()
Remove an unwanted level
df.columns = df.columns.get_level_values('CEUDEndUse_res')
===== Dropping/deleting data =====
**drop()** column
df_parcels.drop(['index_right'], axis=1, inplace=True)
**drop_duplicates()**
s_parcelARNs_noDups = df_parcels_2a.PID_str.drop_duplicates(keep='first')
keep options: ''first'', ''last'', ''False''
===== Plotting =====
df_buildings.boxplot(column=['area'],
by=['CATEGORY'], figsize=(12,12))
df_look['bsmtArea_sf'].plot(kind='hist', bins=40, title='bsmt. area',
xlim=[0,max_footprint])
plt.show()
df_aptByParcel.plot.scatter(x='PhysDwPerBldg',
y='numStoreys_avg', sharex=False)
df_aptByParcel.plot.hexbin(x='PhysDwPerBldg',
y='numStoreys_avg', sharex=False, gridsize=25)
histograms, maps, KDEs, boxplot, etc FIXME
===== Pivottable.js =====
from pivottablejs import pivot_ui
url = 'http://servername.xyz.com:8889/files/city_share/'
filename = 'pivottable/df_nonRes.html'
pivot_ui(df_CAPEX_1,
rows=['buildingID'],
rowOrder='value_z_to_a',
cols=['year'],
vals=['capexInt_dlrPerSqft'],
exclusions={'buildingID':['']},
aggregatorName='Sum',
rendererName='Heatmap',
outfile_path=filename, url=url + filename)
documentation: [[https://github.com/nicolaskruchten/pivottable/wiki/Parameters|parameters]] and [[https://github.com/nicolaskruchten/jupyter_pivottablejs|python/jupyter]]
===== Plotly =====
import plotly.express as px
import plotly.io as pio
fig = px.scatter(pt_peo_adjFactors_bid_3,
x='electricity',
y='naturalGas',
hover_name=pt_peo_adjFactors_bid_3.index,
marginal_x='histogram',
marginal_y='histogram')
fig.show()
filename = "pivottable/baseline_adjFactors_electricity-naturalGas.html"
pio.write_html(fig, file=filename)
===== Geo-processing data =====
overlay, centroid, area, crs transform, etc FIXME
**sjoin()**
df_parcels_1c = sjoin(df_parcels_1b, df_trafficZones_2, how="left")
===== Axis =====
It specifies the axis along which a function is computed. By default axis=0. This is consistent with the numpy.mean usage when axis is specified explicitly (in numpy.mean, axis==None by default, which computes the mean value over the flattened array) , in which axis=0 along the rows (namely, index in pandas), and axis=1 along the columns.
+------------+---------+--------+
| | A | B |
+------------+---------+---------
| 0 | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
| |
| axis=0 |
===== Aggregating data =====
**groupby()**
[[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby|ref]]
gb_single = df_single.groupby(
['ID_TAZ','buildingClass', 'yearbuiltClass'], as_index=False)
df_singleUnits = gb_single.NosUnits.sum()
other groupby convenience functions: ''size()'', ''mean()'', ''first()'', ''last()'', ''nth()''
===== Statistical models =====
**OLS regression**
from pandas.stats.api import ols
regr_1 = ols(y=df_apt['footprint_avg_sqft'], x=df_apt['PhysDwPerBldg'],
intercept=False)
===== Strings =====
String concatenation
df_structure['propType'] = df_structure.PropertyCode + ' - ' + df_structure.propCodeDesc
String length
df_parcels.ROLLNO.str.len().value_counts()
String slice
df_parcels_2b['ARN'] = df_parcels_2b.ARN.str.slice(0,15)
where first and second indices are inclusive and exclusive respectively.
String (and number) formatting example
x = df_pop2016_0.value_total.sum()
print("{:,}".format(x))
formats number with commas (e.g. ''29,232,957.0''). See [[https://mkaz.blog/code/python-string-format-cookbook/ | resource]]
===== Python data and control structures =====
**traversal**
for var in traversable_object:
stmt ...
for a dictionary
for k,v in d_test.items():
print(k)
print(v)
or
for k in d_test:
print(k)
print(d_test[k])
**dictionaries**
[[https://www.w3schools.com/python/python_ref_dictionary.asp|handy dictionary methods]]: ''get()'', ''keys()'', ''values()''
**lists**
[[https://www.w3schools.com/python/python_ref_list.asp|handy list methods]]: ''append()''
**function**
def function_name(params):
stmt ...
[return var]
Function used in DataFrame calculations with **apply()** example:
def getFootprintSizeClass(row):
if (row['dwellType'] == 'single'):
if (row['footprintClass'] == '(0, 500]'):
return 'tiny'
elif (
(row['footprintClass'] == '(500, 750]') |
(row['footprintClass'] == '(750, 1000]')):
return 'small'
elif (row['footprintClass'] == '(1000, 1500]'):
return 'medium'
elif (row['dwellType'] == 'double'):
if (
(row['footprintClass'] == '(0, 500]') |
(row['footprintClass'] == '(500, 750]')):
return 'small'
elif (
(row['footprintClass'] == '(750, 1000]') |
(row['footprintClass'] == '(1000, 1500]') |
(row['footprintClass'] == '(1500, 10000]') |
(row['footprintClass'] == '(10000, 100000]')):
return 'large'
else:
return
df_building['footprintSizeClass'] = df_building.apply(getFootprintSizeClass, axis=1)
===== Other useful Python syntax =====
increment integer
int_i += 1
===== Github =====
Prior to commit, strip ipynb notebook files with linux shell as per this [[http://timstaley.co.uk/posts/making-git-and-jupyter-notebooks-play-nice/|blog post]]
nbstrip_jq myNotebook.ipynb > myNotebook_stripped.ipynb
And then
git status
git add
git add --all
git commit -m “description”
git push [origin master]
Get commit history
git reflog
Pull
git pull origin master
===== Handy Matrix & Vector Operations =====
**shares / norm**
each row sums to 1
df_shr = df_quantity.div(df_quantity.sum(axis='columns'), axis='index'))
each column sums to 1
df_shr = df_quantity.div(df_quantity.sum(axis='index'), axis='columns'))
===== Jupyter Notebook Markdown Tricks =====
Links
[Github](http://github.com)
Embed an image
![title](img/S_res_1.png)
===== Python Debugger (PDB) =====
[[https://docs.python.org/3/library/pdb.html|ref]]
import pdb; pdb.set_trace()
''n'' for next line; ''c'' to continue; ''s'' for step; ''exit'' for exit
===== Other functions =====
==== A ====
**astype()** - cast pandas Series from one dtype type to another.
df_propCodes['propType'] = df_propCodes['propCode'].astype(str) \
+ ' - ' + df_propCodes_1.propCodeDesc
types: ''str'', ''int'', ''float''
also see: ''pd.to_numeric()''
df ['elec'] = pd.to_numeric(df['elec'], errors='coerce')
==== C ====
**cut()**
[[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html#pandas.cut|ref]]
footprintBins = [0, 500, 750, 1000, 1500, 10000, 100000]
df_building['footprintClass'] = pd.cut(
df_building.footprintPerUnit, footprintBins)
more [[python:cut]] examples
**concat()** - combine blocks of records into a single DataFrame. Default axis is 0 (vertical); rows are added. See [[https://stackoverflow.com/questions/50501787/python-pandas-user-warning-sorting-because-non-concatenation-axis-is-not-aligne|article]] on sort parameter
df_double = pd.concat([
df_semi,
df_duplex],
sort=True,
axis=0)
==== M ====
**melt()** [[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html?highlight=melt#pandas.DataFrame.melt|ref]]
df_emissions_2 = pd.melt(df_emissions_1,
id_vars=['department','scope','fuelType'],
value_vars=['2016','2017','2018'],
var_name='Time',
value_name='Values')
Note that //value_vars// is optional. If omitted uses all columns that are not set as //id_vars//.
**merge()**
[[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html#pandas.merge|ref]]
df_building_2 = pd.merge(df_building_1,
df_parcel, left_on='ParcelID', right_on='PID', how='left')
Note that merge destroys GeoDataFrame attributes so they need to be reconstructed:
df_building_2 = gpd.GeoDataFrame(pd.merge(df_building_1,
df_parcel, left_on='ParcelID', right_on='PID', how='left'))
df_building_2.crs = df_building_1.crs
**map()**
[[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html|ref]]
d_storeysClass = {
0 : '1Story',
1 : '1Story',
1.5 : '2Story',
2 : '2Story',
3 : '3Story'
}
df_building['storeysClass'] = df_building.numStoreys.map(d_storeysClass)
**modulo**
x % y
gives the remainder after integer division.
==== R ====
**rename()**
df_structure.rename(columns={
'oldName1' : 'newName1',
'oldName2' : 'newName2'},
inplace=True)
**replace()** - string replace
df_bldgEnergy_6b['buildingID'] = (df_bldgEnergy_6b.Site_Address
.str.replace ('\s', '_', regex=True) #whitespace replace
.str.replace ('.', '', regex=False)
.str.replace (',', '_', regex=False)
.str.replace ('&', 'and', regex=False)
.str.replace ('[()]', '', regex=True) #brackets
.str.replace ('/', '_', regex=False))
**reset_index()** - turn multi-index columns back into regular columns
df.reset_index()
==== S ====
s_strucCounts.sort_index()
s_strucCounts.sort_values()
df_a.sort_values('fieldX', ascending=False)
**set()** [[https://stackoverflow.com/questions/6486450/python-compute-list-difference|ref]]
Useful for list differences
set([1,2,3]) - set([2,3,4])
See other set [[https://www.w3schools.com/python/python_ref_set.asp|operations]] including:
set1.union(set2)
set1.intersection(set2)
==== T ====
**%time** (once) and **%timeit** (multiple iterations) magic functions for execution runtime.
%time df_footprints = sjoin(df_footprints, df_parcels, how="left")
%timeit df_footprints = sjoin(df_footprints, df_parcels, how="left")
==== Z ====
**zfill** - pads string with leading 0s to reach specified string length
df_building_1a['PID'] = df_building_1a['PID'].str.zfill(8)