Looping Over Data Sets

Overview

Teaching: 5 min
Exercises: 10 min
Questions
  • How can I process many data sets with a single command?

Objectives
  • Be able to read and write globbing expressions that match sets of files.

  • Use glob to create lists of files.

  • Write for loops to perform operations on files given their names in a list.

Use a for loop to process files given a list of their names.

import pandas

for filename in ['data/jarvis_all.csv', 'data/jarvis_subset.csv']:
    data = pandas.read_csv(filename, index_col='formula')
    print(filename, data.min())
data/jarvis_all.csv epsx             1.0681
epsy             1.0681
epsz             1.0681
fin_en         -319.593
form_enp         -4.135
gv                -9.62
icsd             100028
kp_leng              40
kv               -0.822
mbj_gap          0.0002
mepsx          -426.369
mepsy          -425.156
mepsz            1.0578
mpid           mp-10010
op_gap           0.0001
jid         JVASP-10010
dtype: object
data/jarvis_subset.csv epsx             4.6738
epsy              4.746
epsz             4.7722
fin_en         -103.757
form_enp         -2.086
gv                5.947
icsd              27393
kp_leng              50
kv               16.378
mbj_gap          0.6495
mepsx            3.6685
mepsy            3.7637
mepsz            3.7977
mpid             mp-158
op_gap           0.0221
jid         JVASP-11997
dtype: object

Use glob.glob to find sets of files whose names match a pattern.

import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))
all csv files in data directory: ['data/jarvis_all.csv', 'data/jarvis_subset.csv']
print('all PDB files:', glob.glob('*.pdb'))
all PDB files: []

Use glob and for to process batches of files.

for filename in glob.glob('data/jarvis_*.csv'):
    data = pandas.read_csv(filename)
    print(filename, data['gv'].min())
data/jarvis_all.csv -9.62
data/jarvis_subset.csv 5.947

Determining Matches

Which of these files is not matched by the expression glob.glob('data/*as*.csv')?

  1. data/gapminder_gdp_africa.csv
  2. data/gapminder_gdp_americas.csv
  3. data/gapminder_gdp_asia.csv
  4. 1 and 2 are not matched.

Solution

1 is not matched by the glob.

Minimum File Size

Modify this program so that it prints the number of records in the file that has the fewest records.

import glob
import pandas
fewest = ____
for filename in glob.glob('data/*.csv'):
    dataframe = pandas.____(filename)
    fewest = min(____, dataframe.shape[0])
print('smallest file has', fewest, 'records')

Notice that the shape method returns a tuple with the number of rows and columns of the data frame.

Solution

import glob
import pandas
fewest = float('Inf')
for filename in glob.glob('data/*.csv'):
    dataframe = pandas.read_csv(filename)
    fewest = min(fewest, dataframe.shape[0])
print('smallest file has', fewest, 'records')

Key Points

  • Use a for loop to process files given a list of their names.

  • Use glob.glob to find sets of files whose names match a pattern.

  • Use glob and for to process batches of files.