Pandas DataFrames

Overview

Teaching: 15 min
Exercises: 15 min

Questions

How can I do statistical analysis of tabular data?

Objectives

Select individual values from a Pandas dataframe.

Select entire rows or entire columns from a dataframe.

Select a subset of both rows and columns from a dataframe in a single operation.

Select a subset of a dataframe by a single Boolean criterion.

Note about Pandas DataFrames/Series

A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

Selecting values

To access a value at the position [i,j] of a DataFrame, we have two options, depending on what is the meaning of i in use. Remember that a DataFrame provides a index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

Can specify location by numerical index analogously to 2D version of character selection in strings.

import pandas
data_all = pandas.read_csv('data/jarvis_all.csv', index_col='formula')
print(data_all.iloc[0, 0])

21.0673

Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

Can specify location by row name analogously to 2D version of dictionary keys.

data_subset = pandas.read_csv('data/jarvis_subset.csv', index_col='formula')
print(data_subset.loc["As", "epsx"])

21.0673

Use `:` on its own to mean all columns or all rows.

Just like Python’s usual slicing notation.

print(data_subset.loc["As", :])

epsx            21.0673
epsy            29.1228
epsz            17.6066
fin_en         -12.2422
form_enp          0.025
gv               19.993
icsd             609832
kp_leng              65
kv               34.556
mbj_gap          0.6495
mepsx           15.0619
mepsy           14.9875
mepsz           12.4788
mpid             mp-158
op_gap           0.0221
jid         JVASP-11997
Name: As, dtype: object

Would get the same result printing data_subset.loc["As"] (without a second index).

print(data_subset.loc[:, "epsx"])

formula
As       21.0673
MoO3      6.3933
AgI       6.0838
ZrBrN     7.0316
InBr3     4.6738
Name: epsx, dtype: float64

Would get the same result printing data_subset["epsx"]
Also get the same result printing data_subset.epsx (since it’s a column name)

Select multiple columns or rows using `DataFrame.loc` and a named slice.

data_all_sorted = data_all.sort_index()
print(data_all_sorted.loc['BaO':'BaS2', 'epsx':'epsz'])

            epsx     epsy     epsz
formula
BaO       5.0126   5.0126   4.4529
BaO2      3.6134   3.6709   3.9081
BaP3     11.5224  11.1754  10.7349
BaPbO3    8.9412   9.0321   8.9705
BaPdS2   10.5437   9.2255   7.9498
BaS2      5.1928   6.7944   4.8265

In the above code, we discover that slicing using loc is inclusive at both ends, which differs from slicing using iloc, where slicing indicates everything up to but not including the final index.

Result of slicing can be used in further operations.

Usually don’t just print a slice.
All the statistical operators that work on entire dataframes work the same way on slices.
E.g., calculate max of a slice.

print(data_all_sorted.loc['BaO':'BaS2', 'epsx':'epsz'].max())

epsx    11.5224
epsy    11.1754
epsz    10.7349
dtype: float64

print(data_all_sorted.loc['BaO':'BaS2', 'epsx':'epsz'].min())

epsx    3.6134
epsy    3.6709
epsz    3.9081
dtype: float64

Use comparisons to select data based on value.

Comparison is applied element by element.
Returns a similarly-shaped dataframe of True and False.

# Use a subset of data to keep output readable.
subset = data_all_sorted.loc['BaO':'BaS2', 'epsx':'epsz']
print('Subset of data:\n', subset)

# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 6)

ubset of data:
             epsx     epsy     epsz
formula
BaO       5.0126   5.0126   4.4529
BaO2      3.6134   3.6709   3.9081
BaP3     11.5224  11.1754  10.7349
BaPbO3    8.9412   9.0321   8.9705
BaPdS2   10.5437   9.2255   7.9498
BaS2      5.1928   6.7944   4.8265

Where are values large?
           epsx   epsy   epsz
formula
BaO      False  False  False
BaO2     False  False  False
BaP3      True   True   True
BaPbO3    True   True   True
BaPdS2    True   True   True
BaS2     False   True  False

Select values or NaN using a Boolean mask.

A frame full of Booleans is sometimes called a mask because of how it can be used.

mask = subset > 6
print(subset[mask])

            epsx     epsy     epsz
formula
BaO          NaN      NaN      NaN
BaO2         NaN      NaN      NaN
BaP3     11.5224  11.1754  10.7349
BaPbO3    8.9412   9.0321   8.9705
BaPdS2   10.5437   9.2255   7.9498
BaS2         NaN   6.7944      NaN

Get the value where the mask is true, and NaN (Not a Number) where it is false.
Useful because NaNs are ignored by operations like max, min, average, etc.

print(subset[subset > 6].describe())

            epsx       epsy       epsz
count   3.000000   4.000000   3.000000
mean   10.335767   9.056850   9.218400
std     1.303102   1.792204   1.409002
min     8.941200   6.794400   7.949800
25%     9.742450   8.472675   8.460150
50%    10.543700   9.128800   8.970500
75%    11.033050   9.712975   9.852700
max    11.522400  11.175400  10.734900

Select-Apply-Combine operations

Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyze their data.

For instance, let’s say we want to calculate the birefringence and determine materials that are optically isotropic.

Firstly we isolate a dataframe with just the dielectric constants.
We then take the difference of the square root and find the maximum. An optically isotropic material should be 0. For example, Silicon should be optically isotropic.

import numpy

dielectric_rt = data_all[['epsx', 'epsy', 'epsz', 'epsx']].apply(numpy.sqrt)
print(dielectric_rt.iloc[:5])

            epsx      epsy      epsz      epsx
formula
As       4.589913  5.396554  4.196022  4.589913
MoO3     2.528498  2.635640  2.525787  2.528498
AgI      2.466536  2.466536  2.562265  2.466536
ZrBrN    2.651716  2.512509  2.286351  2.651716
InBr3    2.161897  2.178532  2.184537  2.161897

Two of the entries for Silicon have low birefringence.

birefringence = dielectric_rt.diff(axis=1).apply('abs').aggregate('max', axis=1)
print(birefringence.iloc[10:20])
print()
print(birefringence.loc["Si"])

formula
VOF3         0.193174
ZnCl2        0.002078
ScHO2        0.026299
Zr(MoO4)2    0.168051
CoF2         0.984077
RbPS3        0.111686
BeO          0.000000
CaCl2        0.030760
SrClF        0.018030
SrBrF        0.028253
dtype: float64

formula
Si    0.122102
Si    0.000000
Si    0.000010
Si    0.427006
Si    0.200976
dtype: float64

To find the optically isotropic materials use

isotropic = birefringence[birefringence < 1e-5]
print(isotropic[:10])

formula
BeO         0.0
SrCl2       0.0
NbSbRu      0.0
K3MoF6      0.0
HfVF6       0.0
MoF6        0.0
Rb2SnBr6    0.0
Sr3BiN      0.0
LiMgBi      0.0
CsK2Sb      0.0
dtype: float64

Selection of Individual Values

Assume Pandas has been imported into your notebook and the JARVIS data has been loaded:
import pandas

df = pandas.read_csv('data/jarvis_all.csv', index_col='formula')
Write an expression to find the formation energy per atom for BeO

Solution

The selection can be done by using the labels for both the row (“BeO”) and the column (“form_enp”):
print(df.loc['BeO', 'form_enp'])
The output is
-3.025

Extent of Slicing

Do the two statements below produce the same output?

Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?
print(data_all.iloc[0:2, 0:2])
print(data_all.loc['As':'MoO3', 'epsx':'epsz'])

Solution

No, they do not produce the same output! The output of the first statement is:
            epsx     epsy
formula
As       21.0673  29.1228
MoO3      6.3933   6.9466
The second statement gives:
            epsx     epsy     epsz
formula
As       21.0673  29.1228  17.6066
MoO3      6.3933   6.9466   6.3796
Clearly, the second statement produces an additional column and an additional row compared to the first statement. What conclusion can we draw? We see that a numerical slice, 0:2, omits the final index (i.e. index 2) in the range provided, while a named slice, ‘epsx’:’epsz’, includes the final element.

Reconstructing Data

Explain what each line in the following short program does: what is in first, second, etc.?
first = pandas.read_csv('data/jarvis_all.csv', index_col='formula')
second = first[first.index.str.contains('Si')]
third = second.drop('Si')
third.to_csv('result.csv')

Solution

Read in data

Only include materials with silicon

Remove materials that are pure silicon

Save data to a CSV file

Selecting Indices

Explain in simple terms what idxmin and idxmax do in the short program below. When would you use these methods?
data_all = pandas.read_csv('data/jarvis_all.csv')
dielec = data_all.loc[:, 'epsx':'epsz']
print(dielec.idxmin())
print(dielec.idxmax())

Solution

For each column in data, idxmin will return the index value corresponding to each column’s minimum; idxmax will do accordingly the same for each column’s maximum value.

You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.

Key Points

Use DataFrame.iloc[..., ...] to select values by integer location.

Use : on its own to mean all columns or all rows.

Select multiple columns or rows using DataFrame.loc and a named slice.

Result of slicing can be used in further operations.

Use comparisons to select data based on value.

Select values or NaN using a Boolean mask.

previous episode

Plotting and Programming in Python

next episode

Pandas DataFrames

Overview

Note about Pandas DataFrames/Series

Selecting values

Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

Use `:` on its own to mean all columns or all rows.

Select multiple columns or rows using `DataFrame.loc` and a named slice.

Result of slicing can be used in further operations.

Use comparisons to select data based on value.

Select values or NaN using a Boolean mask.

Select-Apply-Combine operations

Selection of Individual Values

Solution

Extent of Slicing

Solution

Reconstructing Data

Solution

Selecting Indices

Solution

Key Points

previous episode

next episode

previous episode

Plotting and Programming in Python

next episode

Pandas DataFrames

Overview

Note about Pandas DataFrames/Series

Selecting values

Use DataFrame.iloc[..., ...] to select values by their (entry) position

Use DataFrame.loc[..., ...] to select values by their (entry) label.

Use : on its own to mean all columns or all rows.

Select multiple columns or rows using DataFrame.loc and a named slice.

Result of slicing can be used in further operations.

Use comparisons to select data based on value.

Select values or NaN using a Boolean mask.

Select-Apply-Combine operations

Selection of Individual Values

Solution

Extent of Slicing

Solution

Reconstructing Data

Solution

Selecting Indices

Solution

Key Points

previous episode

next episode

Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

Use `:` on its own to mean all columns or all rows.

Select multiple columns or rows using `DataFrame.loc` and a named slice.