====== Python/Pandas Quick Reference Card ======

A "cheat sheet" of commonly used Python data tools, control structures and their syntax.

Full pandas [[http://pandas.pydata.org/pandas-docs/stable/index.html|documentation]].

Python 2.7 [[https://docs.python.org/2/contents.html|documentation]].

===== Comments =====
<code>
# single line comment

'''
multi
line
comments
'''
</code>

===== Common library imports =====

<code>
import geopandas as gpd
from geopandas.tools import sjoin
from geopandas.tools import overlay
import pandas as pd
from pandas.stats.api import ols
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import StringIO
import os
import sys
</code>

Note that on Anaconda, the typical path for packages is something like:
<code>
/opt/conda/envs/myenv/lib/python3.6/site-packages
</code>
===== Data import and export =====

read and write: csv, excel, shapefile, etc FIXME

<code>
filename = communityName + '\\output\\df_nonRes_bldgs.csv'
</code>

<code>
df_general = pd.read_csv(filename)
</code>


specify a single worksheet to read
<code>
df_tripsOD = pd.read_excel(filename, sheet_name='HBS_24')
</code>

specify multiple worksheets to read
<code>
l_purps = ['HBS-24','HBW-24','HBO-24','NHB-24']
d_df_tripsOD = pd.read_excel(filename, sheet_name=l_purps)
</code>

<code>
df_nonRes_bldgs.to_csv(filename, index=False)
</code>

<code>
df_structure.to_excel(filename)
</code>

<code>
df_parcels_0 = gpd.GeoDataFrame.from_file(myParcels.shp)
</code>

<code>
df_structure = gpd.read_file(myLayers.gdb, layer='structure')
</code>

Round trip export/import for json files, as Python dictionaries or lists. Requires ''import json''.

<code>
filename = 'folderX/list_CEUDSector_ind.json'
with open(filename, 'w', encoding='utf-8') as f:
    json.dump(l_CEUDSector_ind, f, ensure_ascii=False, indent=4)
</code>

<code>
filename = 'folderX/list_CEUDSector_ind.json'
with open(filename) as data_file:
    l_CEUDSector_ind= json.load(data_file)
</code>

===== Data interrogation =====
<code>
df_buildings.info()
</code>
option ''null_counts=True''

To see the individual cell types
<code>
df_bld_0.DFRPPropertyNumber.map(type)
</code>

<code>
df_buildings.describe()
</code>

<code>
df_buildings.head()
df_buildings.tail()
df_buildings.sample(10)
</code>

<code>
df_buildings.StrucType.value_counts()
</code>

<code>
pt_structure = pd.pivot_table(df_structure,
    values='OBJECTID',
    index=['propType'],
    columns=['strucType'],
    margins=False,
    aggfunc=np.sum)
</code>
[[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html#pandas.pivot_table|ref]] |
Common aggfunc params: ''len'', ''np.sum'', ''np.mean''

<code>
df_buildings.StrucType.unique()
df_buildings.StrucType.nunique()
</code>

===== Selecting data =====

Selecting rows
<code>
df_farm = df_structure[
    (df_structure.PropertyCode >= 200) &
    (df_structure.PropertyCode < 300)]
</code>
Can use bitwise operators: AND ''&'', OR ''|'', NOT ''~''

Other boolean operators: ''=='', ''>='', ''>'', ''<='', ''<'',''!='', ''isin(myList)'', ''isnull()'', ''notnull()''

Selecting columns
<code>
df_condosLUT = df_condos.loc[:,['RollNumber','CondoRollN']]
</code>


===== Setting and filling data =====

<code>
df_buildings.loc[df_buildings.UNIQUEID == '6554354',
    'ASMTCODE'] = '303 - Warehouses (Including Cold Storage Plants)'
</code>

**fillna()** [[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.fillna.html|ref]]
<code>
df_building['YearBuilt'].fillna(2011, inplace=True)
</code>

Conditional if-then-else column-wise logic
<code>
df_structure['finalID'] = np.where(df_structure.condoID.isnull(),
    df_structure.normalID, df_structure.condoID)
</code>

**where()**, **mask()**
Replace if condition is false and true respectively.
<code>
df_8.mask(df_8 > 5.0, 5.0)
</code>

===== Multiindexing =====

Turns columns A and B the the multiindex for the output dataframe.
<code>
df.set_index(['A','B'])
</code>

Remove multiindexing.
<code>
df.reset_index()
</code>

Remove an unwanted level
<code>
df.columns = df.columns.get_level_values('CEUDEndUse_res')
</code>

===== Dropping/deleting data =====

**drop()** column
<code>
df_parcels.drop(['index_right'], axis=1, inplace=True)
</code>

**drop_duplicates()**
<code>
s_parcelARNs_noDups = df_parcels_2a.PID_str.drop_duplicates(keep='first')
</code>
keep options: ''first'', ''last'', ''False''
===== Plotting =====

<code>
df_buildings.boxplot(column=['area'],
                       by=['CATEGORY'], figsize=(12,12))
</code>

<code>
df_look['bsmtArea_sf'].plot(kind='hist', bins=40, title='bsmt. area',
        xlim=[0,max_footprint])
plt.show()
</code>

<code>
df_aptByParcel.plot.scatter(x='PhysDwPerBldg',
    y='numStoreys_avg', sharex=False)
</code>

<code>
df_aptByParcel.plot.hexbin(x='PhysDwPerBldg',
    y='numStoreys_avg', sharex=False, gridsize=25)
</code>

histograms, maps, KDEs, boxplot, etc FIXME

===== Pivottable.js =====

<code>
from pivottablejs import pivot_ui

url = 'http://servername.xyz.com:8889/files/city_share/'
filename = 'pivottable/df_nonRes.html'

pivot_ui(df_CAPEX_1,
        rows=['buildingID'],
        rowOrder='value_z_to_a',
        cols=['year'],
        vals=['capexInt_dlrPerSqft'],
        exclusions={'buildingID':['']},
        aggregatorName='Sum',
        rendererName='Heatmap',
        outfile_path=filename, url=url + filename)
</code>

documentation: [[https://github.com/nicolaskruchten/pivottable/wiki/Parameters|parameters]] and [[https://github.com/nicolaskruchten/jupyter_pivottablejs|python/jupyter]]

===== Plotly =====

<code>
import plotly.express as px
import plotly.io as pio

fig = px.scatter(pt_peo_adjFactors_bid_3,
                 x='electricity',
                 y='naturalGas',
                hover_name=pt_peo_adjFactors_bid_3.index,
                marginal_x='histogram',
                marginal_y='histogram')
fig.show()

filename = "pivottable/baseline_adjFactors_electricity-naturalGas.html"
pio.write_html(fig, file=filename)
</code>
===== Geo-processing data =====

overlay, centroid, area, crs transform, etc FIXME

**sjoin()**
<code>
df_parcels_1c = sjoin(df_parcels_1b, df_trafficZones_2, how="left")
</code>
===== Axis =====
It specifies the axis along which a function is computed. By default axis=0. This is consistent with the numpy.mean usage when axis is specified explicitly (in numpy.mean, axis==None by default, which computes the mean value over the flattened array) , in which axis=0 along the rows (namely, index in pandas), and axis=1 along the columns.

<code>
+------------+---------+--------+
|            |  A      |  B     |
+------------+---------+---------
|      0     | 0.626386| 1.52325|----axis=1----->
+------------+---------+--------+
                |         |
                | axis=0  |
</code>

===== Aggregating data =====

**groupby()**
[[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby|ref]]
<code>
gb_single = df_single.groupby(
    ['ID_TAZ','buildingClass', 'yearbuiltClass'], as_index=False)

df_singleUnits = gb_single.NosUnits.sum()
</code>
other groupby convenience functions: ''size()'', ''mean()'', ''first()'', ''last()'', ''nth()''

===== Statistical models =====

**OLS regression**

<code>
from pandas.stats.api import ols

regr_1 = ols(y=df_apt['footprint_avg_sqft'], x=df_apt['PhysDwPerBldg'],
    intercept=False)
</code>

===== Strings =====

String concatenation
<code>
df_structure['propType'] = df_structure.PropertyCode + ' - ' + df_structure.propCodeDesc
</code>

String length
<code>
df_parcels.ROLLNO.str.len().value_counts()
</code>

String slice
<code>
df_parcels_2b['ARN'] = df_parcels_2b.ARN.str.slice(0,15)
</code>
where first and second indices are inclusive and exclusive respectively.

String (and number) formatting example
<code>
x = df_pop2016_0.value_total.sum()
print("{:,}".format(x))
</code>
formats number with commas (e.g. ''29,232,957.0''). See [[https://mkaz.blog/code/python-string-format-cookbook/ | resource]]
===== Python data and control structures =====

**traversal**
<code>
for var in traversable_object:
    stmt ...
</code>

for a dictionary
<code>
for k,v in d_test.items():
    print(k)
    print(v)
</code>
or
<code>
for k in d_test:
    print(k)
    print(d_test[k])
</code>

**dictionaries**
[[https://www.w3schools.com/python/python_ref_dictionary.asp|handy dictionary methods]]: ''get()'', ''keys()'', ''values()''

**lists**
[[https://www.w3schools.com/python/python_ref_list.asp|handy list methods]]: ''append()''


**function**
<code>
def function_name(params):
    stmt ...
    [return var]
</code>

Function used in DataFrame calculations with **apply()** example:
<code>
def getFootprintSizeClass(row):
    if (row['dwellType'] == 'single'):
        if (row['footprintClass'] == '(0, 500]'):
            return 'tiny'
        elif (
            (row['footprintClass'] == '(500, 750]') |
            (row['footprintClass'] == '(750, 1000]')):
            return 'small'
        elif (row['footprintClass'] == '(1000, 1500]'):
            return 'medium'
    elif (row['dwellType'] == 'double'):
        if (
            (row['footprintClass'] == '(0, 500]') |
            (row['footprintClass'] == '(500, 750]')):
            return 'small'            
        elif (
            (row['footprintClass'] == '(750, 1000]') |
            (row['footprintClass'] == '(1000, 1500]') |
            (row['footprintClass'] == '(1500, 10000]') |
            (row['footprintClass'] == '(10000, 100000]')):
            return 'large' 
    else:
        return

df_building['footprintSizeClass'] = df_building.apply(getFootprintSizeClass, axis=1)
</code>

===== Other useful Python syntax =====

increment integer
<code>
int_i += 1
</code>

===== Github =====

Prior to commit, strip ipynb notebook files with linux shell as per this [[http://timstaley.co.uk/posts/making-git-and-jupyter-notebooks-play-nice/|blog post]]

<code>
nbstrip_jq myNotebook.ipynb > myNotebook_stripped.ipynb
</code>

And then
<code>
git status

git add <filename>
git add --all

git commit -m “description”
git push [origin master]
</code>

Get commit history
<code>
git reflog
</code>

Pull
<code>
git pull origin master
</code>

===== Handy Matrix & Vector Operations =====

**shares / norm**

each row sums to 1
<code>
df_shr = df_quantity.div(df_quantity.sum(axis='columns'), axis='index'))
</code>

each column sums to 1
<code>
df_shr = df_quantity.div(df_quantity.sum(axis='index'), axis='columns'))
</code>  

===== Jupyter Notebook Markdown Tricks ===== 
Links
<code>
[Github](http://github.com)
</code>


Embed an image
<code>
![title](img/S_res_1.png)
</code>

===== Python Debugger (PDB) ===== 
[[https://docs.python.org/3/library/pdb.html|ref]]

<code>
import pdb; pdb.set_trace()
</code>

''n'' for next line; ''c'' to continue; ''s'' for step; ''exit'' for exit
===== Other functions =====
==== A ====

**astype()** - cast pandas Series from one dtype type to another.

<code>
df_propCodes['propType'] = df_propCodes['propCode'].astype(str) \
    + ' - ' + df_propCodes_1.propCodeDesc
</code>
types: ''str'', ''int'', ''float''

also see: ''pd.to_numeric()'' 
<code>
df ['elec'] = pd.to_numeric(df['elec'], errors='coerce')
</code>
==== C ====

**cut()**
[[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html#pandas.cut|ref]]
<code>
footprintBins = [0, 500, 750, 1000, 1500, 10000, 100000]

df_building['footprintClass'] = pd.cut(
    df_building.footprintPerUnit, footprintBins)
</code>

more [[python:cut]] examples

**concat()** - combine blocks of records into a single DataFrame. Default axis is 0 (vertical); rows are added. See [[https://stackoverflow.com/questions/50501787/python-pandas-user-warning-sorting-because-non-concatenation-axis-is-not-aligne|article]] on sort parameter

<code>
df_double = pd.concat([
    df_semi,
    df_duplex],
    sort=True,
    axis=0)
</code>

==== M ====

**melt()** [[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html?highlight=melt#pandas.DataFrame.melt|ref]]
<code>
df_emissions_2 = pd.melt(df_emissions_1,
                         id_vars=['department','scope','fuelType'],
                         value_vars=['2016','2017','2018'],
                         var_name='Time',
                         value_name='Values')
</code>
Note that //value_vars// is optional. If omitted uses all columns that are not set as //id_vars//.

**merge()**
[[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html#pandas.merge|ref]]
<code>
df_building_2 = pd.merge(df_building_1,
    df_parcel, left_on='ParcelID', right_on='PID', how='left')
</code>

Note that merge destroys GeoDataFrame attributes so they need to be reconstructed:

<code>
df_building_2 = gpd.GeoDataFrame(pd.merge(df_building_1,
    df_parcel, left_on='ParcelID', right_on='PID', how='left'))
df_building_2.crs = df_building_1.crs
</code>

**map()**
[[http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html|ref]]
<code>
d_storeysClass = {
    0 : '1Story',
    1 : '1Story',
    1.5 : '2Story',
    2 : '2Story',
    3 : '3Story'
}
    
df_building['storeysClass'] = df_building.numStoreys.map(d_storeysClass)
</code>

**modulo**
<code>
x % y
</code>
gives the remainder after integer division.

==== R ====

**rename()**
<code>
df_structure.rename(columns={
    'oldName1' : 'newName1',
    'oldName2' : 'newName2'},
    inplace=True)
</code>


**replace()** - string replace
<code>
df_bldgEnergy_6b['buildingID'] = (df_bldgEnergy_6b.Site_Address
    .str.replace ('\s', '_', regex=True) #whitespace replace
    .str.replace ('.', '', regex=False)
    .str.replace (',', '_', regex=False)
    .str.replace ('&', 'and', regex=False)
    .str.replace ('[()]', '', regex=True) #brackets
    .str.replace ('/', '_', regex=False))
</code>

**reset_index()** - turn multi-index columns back into regular columns
<code>
df.reset_index()
</code>
==== S ====

<code>
s_strucCounts.sort_index()
s_strucCounts.sort_values()

df_a.sort_values('fieldX', ascending=False)
</code>

**set()** [[https://stackoverflow.com/questions/6486450/python-compute-list-difference|ref]]

Useful for list differences
<code>
set([1,2,3]) - set([2,3,4])
</code>

See other set [[https://www.w3schools.com/python/python_ref_set.asp|operations]] including:
<code>
set1.union(set2)
set1.intersection(set2)
</code>
==== T ====
**%time** (once) and **%timeit** (multiple iterations) magic functions for execution runtime.

<code>
%time df_footprints = sjoin(df_footprints, df_parcels, how="left")
%timeit df_footprints = sjoin(df_footprints, df_parcels, how="left")
</code>

==== Z ====
**zfill** - pads string with leading 0s to reach specified string length
<code>
df_building_1a['PID'] = df_building_1a['PID'].str.zfill(8)
</code>