Looping Over Data Sets

Overview

Teaching: 5 min
Exercises: 10 min

Questions

How can I process many data sets with a single command?

Objectives

Be able to read and write globbing expressions that match sets of files.

Use glob to create lists of files.

Write for loops to perform operations on files given their names in a list.

Use a `for` loop to process files given a list of their names.

A filename is just a character string.
And lists can contain character strings.

import pandas

for filename in ['data/jarvis_all.csv', 'data/jarvis_subset.csv']:
    data = pandas.read_csv(filename, index_col='formula')
    print(filename, data.min())

data/jarvis_all.csv epsx             1.0681
epsy             1.0681
epsz             1.0681
fin_en         -319.593
form_enp         -4.135
gv                -9.62
icsd             100028
kp_leng              40
kv               -0.822
mbj_gap          0.0002
mepsx          -426.369
mepsy          -425.156
mepsz            1.0578
mpid           mp-10010
op_gap           0.0001
jid         JVASP-10010
dtype: object
data/jarvis_subset.csv epsx             4.6738
epsy              4.746
epsz             4.7722
fin_en         -103.757
form_enp         -2.086
gv                5.947
icsd              27393
kp_leng              50
kv               16.378
mbj_gap          0.6495
mepsx            3.6685
mepsy            3.7637
mepsz            3.7977
mpid             mp-158
op_gap           0.0221
jid         JVASP-11997
dtype: object

Use `glob.glob` to find sets of files whose names match a pattern.

In Unix, the term “globbing” means “matching a set of files with a pattern”.
The most common patterns are:
- * meaning “match zero or more characters”
- ? meaning “match exactly one character”
Python contains the glob library to provide pattern matching functionality
The glob library contains a function also called glob to match file patterns
E.g., glob.glob('*.txt') matches all files in the current directory whose names end with .txt.
Result is a (possibly empty) list of character strings.

import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))

all csv files in data directory: ['data/jarvis_all.csv', 'data/jarvis_subset.csv']

print('all PDB files:', glob.glob('*.pdb'))

all PDB files: []

Use `glob` and `for` to process batches of files.

Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.

for filename in glob.glob('data/jarvis_*.csv'):
    data = pandas.read_csv(filename)
    print(filename, data['gv'].min())

data/jarvis_all.csv -9.62
data/jarvis_subset.csv 5.947

This includes all data, as well as per-region data.
Use a more specific pattern in the exercises to exclude the whole data set.
But note that the minimum of the entire data set is also the minimum of one of the data sets, which is a nice check on correctness.

Determining Matches

Which of these files is not matched by the expression glob.glob('data/*as*.csv')?

data/gapminder_gdp_africa.csv

data/gapminder_gdp_americas.csv

data/gapminder_gdp_asia.csv

1 and 2 are not matched.

Solution

1 is not matched by the glob.

Minimum File Size

Modify this program so that it prints the number of records in the file that has the fewest records.
import glob
import pandas
fewest = ____
for filename in glob.glob('data/*.csv'):
    dataframe = pandas.____(filename)
    fewest = min(____, dataframe.shape[0])
print('smallest file has', fewest, 'records')
Notice that the shape method returns a tuple with the number of rows and columns of the data frame.
Solution
import glob
import pandas
fewest = float('Inf')
for filename in glob.glob('data/*.csv'):
    dataframe = pandas.read_csv(filename)
    fewest = min(fewest, dataframe.shape[0])
print('smallest file has', fewest, 'records')

Key Points

Use a for loop to process files given a list of their names.

Use glob.glob to find sets of files whose names match a pattern.

Use glob and for to process batches of files.

previous episode

Plotting and Programming in Python

next episode

Looping Over Data Sets

Overview

Use a `for` loop to process files given a list of their names.

Use `glob.glob` to find sets of files whose names match a pattern.

Use `glob` and `for` to process batches of files.

Determining Matches

Solution

Minimum File Size

Solution

Key Points

previous episode

next episode

previous episode

Plotting and Programming in Python

next episode

Looping Over Data Sets

Overview

Use a for loop to process files given a list of their names.

Use glob.glob to find sets of files whose names match a pattern.

Use glob and for to process batches of files.

Determining Matches

Solution

Minimum File Size

Solution

Key Points

previous episode

next episode

Use a `for` loop to process files given a list of their names.

Use `glob.glob` to find sets of files whose names match a pattern.

Use `glob` and `for` to process batches of files.