Data collection and preparation#

Introduction to Pandas#

Pandas is a powerful library especially designed to work with tabular data and has powerful time-series analysis tools. It leans on many ideas from R and is built on top of NumPy. It is a very powerful tool for data manipulation and analysis.

These slides lean heavily on a fantastic Pandas Workshop put together by Stefanie Molin.

We will begin by introducing the Series, DataFrame, and Index classes, which are the basic building blocks of the pandas library, and showing how to work with them. By the end of this section, you will be able to create DataFrames and perform operations on them to inspect and filter the data.

import pandas as pd

Anatomy of a DataFrame#

A DataFrame is composed of one or more Series. The names of the Series form the column names, and the row labels form the Index.

meteorites = pd.read_csv('_data/Meteorite_Landings.csv', nrows=5)
meteorites
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
0 Aachen 1 Valid L5 21 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333)
1 Aarhus 2 Valid H6 720 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333)
2 Abee 6 Valid EH4 107000 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0)
3 Acapulco 10 Valid Acapulcoite 1914 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9)
4 Achiras 370 Valid L6 780 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95)

Source: NASA’s Open Data Portal

Series:#

meteorites.name
0      Aachen
1      Aarhus
2        Abee
3    Acapulco
4     Achiras
Name: name, dtype: object

Columns:#

meteorites.columns
Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year',
       'reclat', 'reclong', 'GeoLocation'],
      dtype='object')

Index:#

meteorites.index
RangeIndex(start=0, stop=5, step=1)

Creating DataFrames#

We can create DataFrames from a variety of sources such as other Python objects, flat files, webscraping, and API requests. Here, we will see just a couple of examples, but be sure to check out this page in the documentation for a complete list.

Using a flat file#

import pandas as pd

meteorites = pd.read_csv('_data/Meteorite_Landings.csv')

From scratch#

Another common approach is using a dictionary as the argument to pd.DataFrame()

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

pd.DataFrame(data)
apples oranges
0 3 0
1 2 3
2 0 7
3 1 2

We can also specify an index:

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

pd.DataFrame(data, index = ['Ke', 'Julian', 'Duong', 'Andreas'])
apples oranges
Ke 3 0
Julian 2 3
Duong 0 7
Andreas 1 2

If we have an existing DataFrame we can also specify the index using set_index()

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2],
    'names': ['Ke', 'Julian', 'Duong', 'Andreas']
}

df = pd.DataFrame(data)
df
apples oranges names
0 3 0 Ke
1 2 3 Julian
2 0 7 Duong
3 1 2 Andreas
df.set_index('names')
apples oranges
names
Ke 3 0
Julian 2 3
Duong 0 7
Andreas 1 2

Tip: df.to_csv('data.csv') writes this data to a new file called data.csv.

Inspecting the data#

Now that we have some data, we need to perform an initial inspection of it. This gives us information on what the data looks like, how many rows/columns there are, and how much data we have.

Let’s inspect the meteorites data.

What does the data look like?#

meteorites.head()
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
0 Aachen 1 Valid L5 21.0 Fell 01/01/1880 12:00:00 AM 50.77500 6.08333 (50.775, 6.08333)
1 Aarhus 2 Valid H6 720.0 Fell 01/01/1951 12:00:00 AM 56.18333 10.23333 (56.18333, 10.23333)
2 Abee 6 Valid EH4 107000.0 Fell 01/01/1952 12:00:00 AM 54.21667 -113.00000 (54.21667, -113.0)
3 Acapulco 10 Valid Acapulcoite 1914.0 Fell 01/01/1976 12:00:00 AM 16.88333 -99.90000 (16.88333, -99.9)
4 Achiras 370 Valid L6 780.0 Fell 01/01/1902 12:00:00 AM -33.16667 -64.95000 (-33.16667, -64.95)

Sometimes there may be extraneous data at the end of the file, so checking the bottom few rows is also important:

meteorites.tail()
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
45711 Zillah 002 31356 Valid Eucrite 172.0 Found 01/01/1990 12:00:00 AM 29.03700 17.01850 (29.037, 17.0185)
45712 Zinder 30409 Valid Pallasite, ungrouped 46.0 Found 01/01/1999 12:00:00 AM 13.78333 8.96667 (13.78333, 8.96667)
45713 Zlin 30410 Valid H4 3.3 Found 01/01/1939 12:00:00 AM 49.25000 17.66667 (49.25, 17.66667)
45714 Zubkovsky 31357 Valid L6 2167.0 Found 01/01/2003 12:00:00 AM 49.78917 41.50460 (49.78917, 41.5046)
45715 Zulu Queen 30414 Valid L3.7 200.0 Found 01/01/1976 12:00:00 AM 33.98333 -115.68333 (33.98333, -115.68333)

Get some information about the DataFrame#

meteorites.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45716 entries, 0 to 45715
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         45716 non-null  object 
 1   id           45716 non-null  int64  
 2   nametype     45716 non-null  object 
 3   recclass     45716 non-null  object 
 4   mass (g)     45585 non-null  float64
 5   fall         45716 non-null  object 
 6   year         45425 non-null  object 
 7   reclat       38401 non-null  float64
 8   reclong      38401 non-null  float64
 9   GeoLocation  38401 non-null  object 
dtypes: float64(3), int64(1), object(6)
memory usage: 3.5+ MB

Extracting subsets#

A crucial part of working with DataFrames is extracting subsets of the data: finding rows that meet a certain set of criteria, isolating columns/rows of interest, etc. After narrowing down our data, we are closer to discovering insights. This section will be the backbone of many analysis tasks.

Selecting columns#

We can select columns as attributes if their names would be valid Python variables:

meteorites.name
0            Aachen
1            Aarhus
2              Abee
3          Acapulco
4           Achiras
            ...    
45711    Zillah 002
45712        Zinder
45713          Zlin
45714     Zubkovsky
45715    Zulu Queen
Name: name, Length: 45716, dtype: object

If they aren’t, we have to select them as keys. However, we can select multiple columns at once this way:

meteorites['mass (g)']
0            21.0
1           720.0
2        107000.0
3          1914.0
4           780.0
           ...   
45711       172.0
45712        46.0
45713         3.3
45714      2167.0
45715       200.0
Name: mass (g), Length: 45716, dtype: float64

Selecting rows#

meteorites[100:104]
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
100 Benton 5026 Valid LL6 2840.0 Fell 01/01/1949 12:00:00 AM 45.95000 -67.55000 (45.95, -67.55)
101 Berduc 48975 Valid L6 270.0 Fell 01/01/2008 12:00:00 AM -31.91000 -58.32833 (-31.91, -58.32833)
102 Béréba 5028 Valid Eucrite-mmict 18000.0 Fell 01/01/1924 12:00:00 AM 11.65000 -3.65000 (11.65, -3.65)
103 Berlanguillas 5029 Valid L6 1440.0 Fell 01/01/1811 12:00:00 AM 41.68333 -3.80000 (41.68333, -3.8)

Indexing#

We use iloc[] to select rows and columns by their position:

meteorites.iloc[100:104, [0, 3, 4, 6]]
name recclass mass (g) year
100 Benton LL6 2840.0 01/01/1949 12:00:00 AM
101 Berduc L6 270.0 01/01/2008 12:00:00 AM
102 Béréba Eucrite-mmict 18000.0 01/01/1924 12:00:00 AM
103 Berlanguillas L6 1440.0 01/01/1811 12:00:00 AM

We use loc[] to select by name:

meteorites.loc[100:104, 'mass (g)':'year']
mass (g) fall year
100 2840.0 Fell 01/01/1949 12:00:00 AM
101 270.0 Fell 01/01/2008 12:00:00 AM
102 18000.0 Fell 01/01/1924 12:00:00 AM
103 1440.0 Fell 01/01/1811 12:00:00 AM
104 960.0 Fell 01/01/2004 12:00:00 AM

Filtering with Boolean masks#

A Boolean mask is a array-like structure of Boolean values – it’s a way to specify which rows/columns we want to select (True) and which we don’t (False).

Here’s an example of a Boolean mask for meteorites weighing more than 50 grams that were found on Earth (i.e., they were not observed falling):

(meteorites['mass (g)'] > 50) & (meteorites.fall == 'Found')
0        False
1        False
2        False
3        False
4        False
         ...  
45711     True
45712    False
45713    False
45714     True
45715     True
Length: 45716, dtype: bool

Important: Take note of the syntax here. We surround each condition with parentheses, and we use bitwise operators (&, |, ~) instead of logical operators (and, or, not).

We can use a Boolean mask to select the subset of meteorites weighing more than 1 million grams (1,000 kilograms or roughly 2,205 pounds) that were observed falling:

meteorites[(meteorites['mass (g)'] > 1e6) & (meteorites.fall == 'Fell')]
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
29 Allende 2278 Valid CV3 2000000.0 Fell 01/01/1969 12:00:00 AM 26.96667 -105.31667 (26.96667, -105.31667)
419 Jilin 12171 Valid H5 4000000.0 Fell 01/01/1976 12:00:00 AM 44.05000 126.16667 (44.05, 126.16667)
506 Kunya-Urgench 12379 Valid H5 1100000.0 Fell 01/01/1998 12:00:00 AM 42.25000 59.20000 (42.25, 59.2)
707 Norton County 17922 Valid Aubrite 1100000.0 Fell 01/01/1948 12:00:00 AM 39.68333 -99.86667 (39.68333, -99.86667)
920 Sikhote-Alin 23593 Valid Iron, IIAB 23000000.0 Fell 01/01/1947 12:00:00 AM 46.16000 134.65333 (46.16, 134.65333)

Tip: Boolean masks can be used with loc[] and iloc[].

Calculating summary statistics#

In the next section of this workshop, we will discuss data cleaning for a more meaningful analysis of our datasets; however, we can already extract some interesting insights from the meteorites data by calculating summary statistics.

How many of the meteorites were found versus observed falling?#

meteorites.fall.value_counts(normalize=True)
fall
Found    0.975785
Fell     0.024215
Name: proportion, dtype: float64

Tip: Pass in normalize=True to see this result as percentages. Check the documentation for additional functionality.

What was the mass of the average meterorite?#

meteorites['mass (g)'].mean()
13278.078548601512

Important: The mean isn’t always the best measure of central tendency. If there are outliers in the distribution, the mean will be skewed. Here, the mean is being pulled higher by some very heavy meteorites – the distribution is right-skewed.

Taking a look at some quantiles at the extremes of the distribution shows that the mean is between the 95th and 99th percentile of the distribution, so it isn’t a good measure of central tendency here:

meteorites['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])
0.01        0.44
0.05        1.10
0.50       32.60
0.95     4000.00
0.99    50600.00
Name: mass (g), dtype: float64

A better measure in this case is the median (50th percentile), since it is robust to outliers:

meteorites['mass (g)'].median()
32.6

What was the mass of the heaviest meteorite?#

meteorites['mass (g)'].idxmax()
16392

Let’s extract the information on this meteorite:

meteorites.loc[meteorites['mass (g)'].idxmax()]
name                             Hoba
id                              11890
nametype                        Valid
recclass                    Iron, IVB
mass (g)                   60000000.0
fall                            Found
year           01/01/1920 12:00:00 AM
reclat                      -19.58333
reclong                      17.91667
GeoLocation     (-19.58333, 17.91667)
Name: 16392, dtype: object

Fun fact: This meteorite landed in Namibia and is a tourist attraction.

Hoba meteorite (source: Wikipedia)
Source: Wikipedia

How many different types of meteorite classes are represented in this dataset?#

meteorites.recclass.nunique()
466

Some examples:

meteorites.recclass.unique()[:14]
array(['L5', 'H6', 'EH4', 'Acapulcoite', 'L6', 'LL3-6', 'H5', 'L',
       'Diogenite-pm', 'Unknown', 'H4', 'H', 'Iron, IVA', 'CR2-an'],
      dtype=object)

Note: All fields preceded with “rec” are the values recommended by The Meteoritical Society. Check out this Wikipedia article for some information on meteorite classes.

Get some summary statistics on the data itself#

We can get common summary statistics for all columns at once. By default, this will only be numeric columns, but here, we will summarize everything together:

meteorites.describe(include='all')
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
count 45716 45716.000000 45716 45716 4.558500e+04 45716 45425 38401.000000 38401.000000 38401
unique 45716 NaN 2 466 NaN 2 266 NaN NaN 17100
top Aachen NaN Valid L6 NaN Found 01/01/2003 12:00:00 AM NaN NaN (0.0, 0.0)
freq 1 NaN 45641 8285 NaN 44609 3323 NaN NaN 6214
mean NaN 26889.735104 NaN NaN 1.327808e+04 NaN NaN -39.122580 61.074319 NaN
std NaN 16860.683030 NaN NaN 5.749889e+05 NaN NaN 46.378511 80.647298 NaN
min NaN 1.000000 NaN NaN 0.000000e+00 NaN NaN -87.366670 -165.433330 NaN
25% NaN 12688.750000 NaN NaN 7.200000e+00 NaN NaN -76.714240 0.000000 NaN
50% NaN 24261.500000 NaN NaN 3.260000e+01 NaN NaN -71.500000 35.666670 NaN
75% NaN 40656.750000 NaN NaN 2.026000e+02 NaN NaN 0.000000 157.166670 NaN
max NaN 57458.000000 NaN NaN 6.000000e+07 NaN NaN 81.166670 354.473330 NaN

Important: NaN values signify missing data. For instance, the fall column contains strings, so there is no value for mean; likewise, mass (g) is numeric, so we don’t have entries for the categorical summary statistics (unique, top, freq).

Check out the documentation for more descriptive statistics:#

Group-by operations#

Rather than perform aggregations, like mean() or describe(), on the full dataset at once, we can perform these calculations per group by first calling groupby():

import numpy as np
meteorites.groupby('recclass').describe(include=np.number)
id mass (g) ... reclat reclong
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
recclass
Acapulcoite 54.0 27540.851852 17364.454850 10.0 16427.75 31242.0 38500.75 56689.0 54.0 490.424407 ... 12.662498 35.23600 38.0 40.802177 96.847026 -141.50000 0.000000 30.833335 159.361125 160.41700
Acapulcoite/Lodranite 6.0 29920.333333 21322.728209 8041.0 11263.00 29408.5 48659.50 52373.0 6.0 31.793333 ... 0.000000 0.00000 5.0 95.803296 87.469926 0.00000 0.000000 157.176110 160.491390 161.34898
Acapulcoite/lodranite 3.0 46316.000000 8881.227167 36290.0 42876.00 49462.0 51329.00 53196.0 3.0 44.933333 ... 0.000000 0.00000 2.0 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.00000
Achondrite-prim 9.0 36680.000000 8512.903647 31230.0 31269.00 33299.0 33574.00 53843.0 9.0 1078.000000 ... 0.000000 0.00000 3.0 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.00000
Achondrite-ung 57.0 41284.561404 15758.680448 4257.0 33357.00 47712.0 53609.00 57268.0 57.0 895.845614 ... 0.000000 45.70000 37.0 15.765216 32.217549 -4.33333 0.000000 0.000000 26.000000 162.56681
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Unknown 7.0 11619.285714 6561.200594 425.0 8808.00 12348.0 16417.50 18111.0 0.0 NaN ... -7.216670 50.66667 5.0 44.340000 79.543348 -58.43333 2.333330 29.433330 122.000000 126.36667
Ureilite 300.0 28814.100000 17559.021207 285.0 11327.75 30954.0 45589.75 57374.0 300.0 490.014900 ... 19.216908 60.24556 214.0 55.952609 70.744550 -141.50000 0.000000 35.666670 156.233510 175.00000
Ureilite-an 4.0 28119.750000 16106.130662 14511.0 15866.25 24526.5 36780.00 48915.0 4.0 1287.125000 ... -29.468790 20.74575 3.0 117.873990 74.016539 32.41275 96.081375 159.750000 160.604610 161.45922
Ureilite-pmict 23.0 20649.652174 16273.007538 5617.0 6364.00 16978.0 33546.00 52585.0 23.0 262.685652 ... 27.033873 27.05417 18.0 65.349957 67.814906 0.00000 16.365830 16.390080 135.975000 160.53778
Winonaite 25.0 30857.640000 13966.904948 10169.0 20167.00 25346.0 44930.00 54512.0 25.0 1129.013200 ... 20.499997 53.03639 18.0 7.926038 62.545438 -111.40000 -3.239580 0.000000 35.666670 168.00000

466 rows × 32 columns

meteorites.groupby('recclass')['mass (g)'].mean().head()
recclass
Acapulcoite               490.424407
Acapulcoite/Lodranite      31.793333
Acapulcoite/lodranite      44.933333
Achondrite-prim          1078.000000
Achondrite-ung            895.845614
Name: mass (g), dtype: float64
meteorites.groupby('recclass').agg({'mass (g)':['mean', 'std'], 'name': ['count']}).head()
mass (g) name
mean std count
recclass
Acapulcoite 490.424407 1279.406632 54
Acapulcoite/Lodranite 31.793333 26.215771 6
Acapulcoite/lodranite 44.933333 73.678106 3
Achondrite-prim 1078.000000 1246.035613 9
Achondrite-ung 895.845614 2217.850529 57

We are only scratching the surface; some additional functionalities to be aware of include the following:

  • We can group by multiple columns – this creates a hierarchical index.

  • Groups can be excluded from calculations with the filter() method.

  • We can group on content in the index using the level or name parameters e.g., groupby(level=0) or groupby(name='year').

  • We can group by date ranges if we use a pd.Grouper() object.

Data Wrangling#

To prepare our data for analysis, we need to perform data wrangling.

In this lecture, we will learn how to clean and reformat data (e.g., renaming columns and fixing data type mismatches), restructure/reshape it, and enrich it (e.g., discretizing columns, calculating aggregations, and combining data sources).

Data cleaning#

In this section, we will take a look at creating, renaming, and dropping columns; type conversion; and sorting – all of which make our analysis easier. We will be working with the 2019 Yellow Taxi Trip Data provided by NYC Open Data.

import pandas as pd

taxis = pd.read_csv('_data/2019_Yellow_Taxi_Trip_Data.csv')
taxis.head()
vendorid tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance ratecodeid store_and_fwd_flag pulocationid dolocationid payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge
0 2 2019-10-23T16:39:42.000 2019-10-23T17:14:10.000 1 7.93 1 N 138 170 1 29.5 1.0 0.5 7.98 6.12 0.3 47.90 2.5
1 1 2019-10-23T16:32:08.000 2019-10-23T16:45:26.000 1 2.00 1 N 11 26 1 10.5 1.0 0.5 0.00 0.00 0.3 12.30 0.0
2 2 2019-10-23T16:08:44.000 2019-10-23T16:21:11.000 1 1.36 1 N 163 162 1 9.5 1.0 0.5 2.00 0.00 0.3 15.80 2.5
3 2 2019-10-23T16:22:44.000 2019-10-23T16:43:26.000 1 1.00 1 N 170 163 1 13.0 1.0 0.5 4.32 0.00 0.3 21.62 2.5
4 2 2019-10-23T16:45:11.000 2019-10-23T16:58:49.000 1 1.96 1 N 163 236 1 10.5 1.0 0.5 0.50 0.00 0.3 15.30 2.5

Source: NYC Open Data collected via SODA.

Dropping columns#

Let’s start by dropping the ID columns and the store_and_fwd_flag column, which we won’t be using.

mask = taxis.columns.str.contains('id$|store_and_fwd_flag', regex=True)
columns_to_drop = taxis.columns[mask]
columns_to_drop
Index(['vendorid', 'ratecodeid', 'store_and_fwd_flag', 'pulocationid',
       'dolocationid'],
      dtype='object')
taxis = taxis.drop(columns=columns_to_drop)
taxis.head()
tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge
0 2019-10-23T16:39:42.000 2019-10-23T17:14:10.000 1 7.93 1 29.5 1.0 0.5 7.98 6.12 0.3 47.90 2.5
1 2019-10-23T16:32:08.000 2019-10-23T16:45:26.000 1 2.00 1 10.5 1.0 0.5 0.00 0.00 0.3 12.30 0.0
2 2019-10-23T16:08:44.000 2019-10-23T16:21:11.000 1 1.36 1 9.5 1.0 0.5 2.00 0.00 0.3 15.80 2.5
3 2019-10-23T16:22:44.000 2019-10-23T16:43:26.000 1 1.00 1 13.0 1.0 0.5 4.32 0.00 0.3 21.62 2.5
4 2019-10-23T16:45:11.000 2019-10-23T16:58:49.000 1 1.96 1 10.5 1.0 0.5 0.50 0.00 0.3 15.30 2.5

Tip: Another way to do this is to select the columns we want to keep: taxis.loc[:,~mask].

Renaming columns#

Next, let’s rename the datetime columns:

taxis = taxis.rename(
    columns={
        'tpep_pickup_datetime': 'pickup', 
        'tpep_dropoff_datetime': 'dropoff'
    }
)
taxis.columns
Index(['pickup', 'dropoff', 'passenger_count', 'trip_distance', 'payment_type',
       'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'congestion_surcharge'],
      dtype='object')

Type conversion#

Notice anything off with the data types?

taxis.dtypes
pickup                    object
dropoff                   object
passenger_count            int64
trip_distance            float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
congestion_surcharge     float64
dtype: object

Both pickup and dropoff should be stored as datetimes. Let’s fix this:

taxis[['pickup', 'dropoff']] = taxis[['pickup', 'dropoff']].apply(pd.to_datetime)
taxis.dtypes
pickup                   datetime64[ns]
dropoff                  datetime64[ns]
passenger_count                   int64
trip_distance                   float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object

Tip: There are other ways to perform type conversion. For numeric values, we can use the pd.to_numeric() function, and we will see the astype() method, which is a more generic method, a little later.

Creating new columns#

Let’s calculate the following for each row:

  1. elapsed time of the trip

  2. the tip percentage

  3. the total taxes, tolls, fees, and surcharges

  4. the average speed of the taxi

taxis = taxis.assign(
    elapsed_time=lambda x: x.dropoff - x.pickup, # 1
    cost_before_tip=lambda x: x.total_amount - x.tip_amount,
    tip_pct=lambda x: x.tip_amount / x.cost_before_tip, # 2
    fees=lambda x: x.cost_before_tip - x.fare_amount, # 3
    avg_speed=lambda x: x.trip_distance.div(
        x.elapsed_time.dt.total_seconds() / 60 / 60
    ) # 4
)

Our new columns get added to the right:

taxis.head(2)
pickup dropoff passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge elapsed_time cost_before_tip tip_pct fees avg_speed
0 2019-10-23 16:39:42 2019-10-23 17:14:10 1 7.93 1 29.5 1.0 0.5 7.98 6.12 0.3 47.9 2.5 0 days 00:34:28 39.92 0.1999 10.42 13.804642
1 2019-10-23 16:32:08 2019-10-23 16:45:26 1 2.00 1 10.5 1.0 0.5 0.00 0.00 0.3 12.3 0.0 0 days 00:13:18 12.30 0.0000 1.80 9.022556

Some things to note:

  • We used lambda functions to 1) avoid typing taxis repeatedly and 2) be able to access the cost_before_tip and elapsed_time columns in the same method that we create them.

  • To create a single new column, we can also use df['new_col'] = <values>.

Sorting by values#

We can use the sort_values() method to sort based on any number of columns:

taxis.sort_values(['passenger_count', 'pickup'], ascending=[False, True]).head()
pickup dropoff passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge elapsed_time cost_before_tip tip_pct fees avg_speed
5997 2019-10-23 15:55:19 2019-10-23 16:08:25 6 1.58 2 10.0 1.0 0.5 0.0 0.0 0.3 14.3 2.5 0 days 00:13:06 14.3 0.000000 4.3 7.236641
443 2019-10-23 15:56:59 2019-10-23 16:04:33 6 1.46 2 7.5 1.0 0.5 0.0 0.0 0.3 11.8 2.5 0 days 00:07:34 11.8 0.000000 4.3 11.577093
8722 2019-10-23 15:57:33 2019-10-23 16:03:34 6 0.62 1 5.5 1.0 0.5 0.7 0.0 0.3 10.5 2.5 0 days 00:06:01 9.8 0.071429 4.3 6.182825
4198 2019-10-23 15:57:38 2019-10-23 16:05:07 6 1.18 1 7.0 1.0 0.5 1.0 0.0 0.3 12.3 2.5 0 days 00:07:29 11.3 0.088496 4.3 9.461024
8238 2019-10-23 15:58:31 2019-10-23 16:29:29 6 3.23 2 19.5 1.0 0.5 0.0 0.0 0.3 23.8 2.5 0 days 00:30:58 23.8 0.000000 4.3 6.258342

To pick out the largest/smallest rows, use nlargest() / nsmallest() instead. Looking at the 3 trips with the longest elapsed time, we see some possible data integrity issues:

taxis.nlargest(3, 'elapsed_time')
pickup dropoff passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge elapsed_time cost_before_tip tip_pct fees avg_speed
7576 2019-10-23 16:52:51 2019-10-24 16:51:44 1 3.75 1 17.5 1.0 0.5 0.0 0.0 0.3 21.8 2.5 0 days 23:58:53 21.8 0.0 4.3 0.156371
6902 2019-10-23 16:51:42 2019-10-24 16:50:22 1 11.19 2 39.5 1.0 0.5 0.0 0.0 0.3 41.3 0.0 0 days 23:58:40 41.3 0.0 1.8 0.466682
4975 2019-10-23 16:18:51 2019-10-24 16:17:30 1 0.70 2 7.0 1.0 0.5 0.0 0.0 0.3 11.3 2.5 0 days 23:58:39 11.3 0.0 4.3 0.029194

Exercise 1.#

Read in the meteorite data from the Meteorite_Landings.csv file, rename the mass (g) column to mass, and drop all the latitude and longitude columns. Sort the result by mass in descending order.

Working with the index#

So far, we haven’t really worked with the index because it’s just been a row number; however, we can change the values we have in the index to access additional features of the pandas library.

Setting and sorting the index#

Currently, we have a RangeIndex, but we can switch to a DatetimeIndex by specifying a datetime column when calling set_index():

taxis = taxis.set_index('pickup')
taxis.head(3)
dropoff passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge elapsed_time cost_before_tip tip_pct fees avg_speed
pickup
2019-10-23 16:39:42 2019-10-23 17:14:10 1 7.93 1 29.5 1.0 0.5 7.98 6.12 0.3 47.9 2.5 0 days 00:34:28 39.92 0.199900 10.42 13.804642
2019-10-23 16:32:08 2019-10-23 16:45:26 1 2.00 1 10.5 1.0 0.5 0.00 0.00 0.3 12.3 0.0 0 days 00:13:18 12.30 0.000000 1.80 9.022556
2019-10-23 16:08:44 2019-10-23 16:21:11 1 1.36 1 9.5 1.0 0.5 2.00 0.00 0.3 15.8 2.5 0 days 00:12:27 13.80 0.144928 4.30 6.554217

Since we have a sample of the full dataset, let’s sort the index to order by pickup time:

taxis = taxis.sort_index()

Tip: taxis.sort_index(axis=1) will sort the columns by name. The axis parameter is present throughout the pandas library: axis=0 targets rows and axis=1 targets columns.

Resetting the index#

We will be working with time series later this section, but sometimes we want to reset our index to row numbers and restore the columns. We can make pickup a column again with the reset_index() method:

taxis = taxis.reset_index()
taxis.head()
pickup dropoff passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge elapsed_time cost_before_tip tip_pct fees avg_speed
0 2019-10-23 07:05:34 2019-10-23 08:03:16 3 14.68 1 50.0 1.0 0.5 4.0 0.0 0.3 55.8 0.0 0 days 00:57:42 51.8 0.077220 1.8 15.265165
1 2019-10-23 07:48:58 2019-10-23 07:52:09 1 0.67 2 4.5 1.0 0.5 0.0 0.0 0.3 8.8 2.5 0 days 00:03:11 8.8 0.000000 4.3 12.628272
2 2019-10-23 08:02:09 2019-10-24 07:42:32 1 8.38 1 32.0 1.0 0.5 5.5 0.0 0.3 41.8 2.5 0 days 23:40:23 36.3 0.151515 4.3 0.353989
3 2019-10-23 08:18:47 2019-10-23 08:36:05 1 2.39 2 12.5 1.0 0.5 0.0 0.0 0.3 16.8 2.5 0 days 00:17:18 16.8 0.000000 4.3 8.289017
4 2019-10-23 09:27:16 2019-10-23 09:33:13 2 1.11 2 6.0 1.0 0.5 0.0 0.0 0.3 7.8 0.0 0 days 00:05:57 7.8 0.000000 1.8 11.193277

Exercise 2#

Using the meteorite data from the Meteorite_Landings.csv file, update the year column to only contain the year, convert it to a numeric data type, and create a new column indicating whether the meteorite was observed falling before 1970.

Set the index to the id column and extract all the rows with IDs between 10,036 and 10,040 (inclusive) with loc[].

Hint 1: Use year.str.slice() to grab a substring.

Hint 2: Make sure to sort the index before using loc[] to select the range.

Bonus: There’s a data entry error in the year column. Can you find it? (Don’t spend too much time on this.)

Time series#

  • When working with time series data, pandas provides us with additional functionality to not just compare the observations in our dataset, but to use their relationship in time to analyze the data.

  • In this section, we will see a few such operations for selecting date/time ranges, calculating changes over time, performing window calculations, and resampling the data to different date/time intervals.

Selecting based on date and time#

We will continue to use the taxis dataset and this time we will set the dropoff column as the index and sort the data:

taxis = taxis.set_index('dropoff').sort_index()

We can now select ranges from our data based on the datetime the same way we did with row numbers:

taxis['2019-10-24 12':'2019-10-24 13']
pickup passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge elapsed_time cost_before_tip tip_pct fees avg_speed
dropoff
2019-10-24 12:30:08 2019-10-23 13:25:42 4 0.76 2 5.0 1.0 0.5 0.00 0.0 0.3 9.30 2.5 0 days 23:04:26 9.3 0.0 4.3 0.032938
2019-10-24 12:42:01 2019-10-23 13:34:03 2 1.58 1 7.5 1.0 0.5 2.36 0.0 0.3 14.16 2.5 0 days 23:07:58 11.8 0.2 4.3 0.068301

We can also represent this range with shorthand. Note that we must use loc[] here:

taxis.loc['2019-10-24 12']
pickup passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge elapsed_time cost_before_tip tip_pct fees avg_speed
dropoff
2019-10-24 12:30:08 2019-10-23 13:25:42 4 0.76 2 5.0 1.0 0.5 0.00 0.0 0.3 9.30 2.5 0 days 23:04:26 9.3 0.0 4.3 0.032938
2019-10-24 12:42:01 2019-10-23 13:34:03 2 1.58 1 7.5 1.0 0.5 2.36 0.0 0.3 14.16 2.5 0 days 23:07:58 11.8 0.2 4.3 0.068301

However, if we want to look at this time range across days, we need another strategy.

We can pull out the dropoffs that happened between a certain time range on any day with the between_time() method:

taxis.between_time('12:00', '13:00')
pickup passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge elapsed_time cost_before_tip tip_pct fees avg_speed
dropoff
2019-10-23 12:53:49 2019-10-23 12:35:27 5 2.49 1 13.5 1.0 0.5 2.20 0.0 0.3 20.00 2.5 0 days 00:18:22 17.8 0.123596 4.3 8.134301
2019-10-24 12:30:08 2019-10-23 13:25:42 4 0.76 2 5.0 1.0 0.5 0.00 0.0 0.3 9.30 2.5 0 days 23:04:26 9.3 0.000000 4.3 0.032938
2019-10-24 12:42:01 2019-10-23 13:34:03 2 1.58 1 7.5 1.0 0.5 2.36 0.0 0.3 14.16 2.5 0 days 23:07:58 11.8 0.200000 4.3 0.068301

Tip: The at_time() method can be used to extract all entries at a given time (e.g., 12:35:27).

Finally, head() and tail() limit us to a number of rows, but we may be interested in rows within the first/last 2 hours (or any other time interval) of the data, in which case, we should use first() / last():

taxis.first('1h')
/var/folders/x3/vhgm155938j8q97fv4kz3mv80000gn/T/ipykernel_30177/2449197755.py:1: FutureWarning: first is deprecated and will be removed in a future version. Please create a mask and filter using `.loc` instead
  taxis.first('1h')
pickup passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge elapsed_time cost_before_tip tip_pct fees avg_speed
dropoff
2019-10-23 07:52:09 2019-10-23 07:48:58 1 0.67 2 4.5 1.0 0.5 0.0 0.0 0.3 8.8 2.5 0 days 00:03:11 8.8 0.00000 4.3 12.628272
2019-10-23 08:03:16 2019-10-23 07:05:34 3 14.68 1 50.0 1.0 0.5 4.0 0.0 0.3 55.8 0.0 0 days 00:57:42 51.8 0.07722 1.8 15.265165
2019-10-23 08:36:05 2019-10-23 08:18:47 1 2.39 2 12.5 1.0 0.5 0.0 0.0 0.3 16.8 2.5 0 days 00:17:18 16.8 0.00000 4.3 8.289017

Tip: Available date/time offsets can be found in the pandas documentation here.

For the rest of this section, we will be working with a modified version the TSA traveler throughput data (see the notes for how we made it):

tsa_melted_holiday_travel = pd.read_csv('_data/tsa_melted_holiday_travel.csv', parse_dates=True, index_col='date')
tsa_melted_holiday_travel
year travelers holiday
date
2019-01-01 2019 2126398.0 New Year's Day
2019-01-02 2019 2345103.0 New Year's Day
2019-01-03 2019 2202111.0 NaN
2019-01-04 2019 2150571.0 NaN
2019-01-05 2019 1975947.0 NaN
... ... ... ...
2021-05-10 2021 1657722.0 NaN
2021-05-11 2021 1315493.0 NaN
2021-05-12 2021 1424664.0 NaN
2021-05-13 2021 1743515.0 NaN
2021-05-14 2021 1716561.0 NaN

864 rows × 3 columns

Calculating change over time#

tsa_melted_holiday_travel.loc['2020'].assign(
    one_day_change=lambda x: x.travelers.diff(),
    seven_day_change=lambda x: x.travelers.diff(7),
).head(10)
year travelers holiday one_day_change seven_day_change
date
2020-01-01 2020 2311732.0 New Year's Day NaN NaN
2020-01-02 2020 2178656.0 New Year's Day -133076.0 NaN
2020-01-03 2020 2422272.0 NaN 243616.0 NaN
2020-01-04 2020 2210542.0 NaN -211730.0 NaN
2020-01-05 2020 1806480.0 NaN -404062.0 NaN
2020-01-06 2020 1815040.0 NaN 8560.0 NaN
2020-01-07 2020 2034472.0 NaN 219432.0 NaN
2020-01-08 2020 2072543.0 NaN 38071.0 -239189.0
2020-01-09 2020 1687974.0 NaN -384569.0 -490682.0
2020-01-10 2020 2183734.0 NaN 495760.0 -238538.0

Tip: To perform operations other than subtraction, take a look at the shift() method. It also makes it possible to perform operations across columns.

Resampling#

We can use resampling to aggregate time series data to a new frequency:

tsa_melted_holiday_travel['2019':'2021-Q1'].select_dtypes(include='number')\
    .resample('Q').agg(['sum', 'mean', 'std'])
/var/folders/x3/vhgm155938j8q97fv4kz3mv80000gn/T/ipykernel_30177/1496422836.py:2: FutureWarning: 'Q' is deprecated and will be removed in a future version, please use 'QE' instead.
  .resample('Q').agg(['sum', 'mean', 'std'])
year travelers
sum mean std sum mean std
date
2019-03-31 181710 2019.0 0.0 189281658.0 2.103130e+06 282239.618354
2019-06-30 183729 2019.0 0.0 221756667.0 2.436886e+06 212600.697665
2019-09-30 185748 2019.0 0.0 220819236.0 2.400209e+06 260140.242892
2019-12-31 185748 2019.0 0.0 211103512.0 2.294603e+06 260510.040655
2020-03-31 181800 2020.0 0.0 155354148.0 1.726157e+06 685094.277420
2020-06-30 183820 2020.0 0.0 25049083.0 2.752646e+05 170127.402046
2020-09-30 185840 2020.0 0.0 63937115.0 6.949686e+05 103864.705739
2020-12-31 185840 2020.0 0.0 77541248.0 8.428397e+05 170245.484185
2021-03-31 181890 2021.0 0.0 86094635.0 9.566071e+05 280399.809061

Window calculations#

  • Window calculations are similar to group by calculations except the group over which the calculation is performed isn’t static – it can move or expand.

  • Pandas provides functionality for constructing a variety of windows, including moving/rolling windows, expanding windows (e.g., cumulative sum or mean up to the current date in a time series), and exponentially weighted moving windows (to weight closer observations more than further ones).

  • We will only look at rolling and expanding calculations here.

Performing a window calculation is very similar to a group by calculation – we first define the window, and then we specify the aggregation:

tsa_melted_holiday_travel.loc['2020'].assign(
    **{
        '7D MA': lambda x: x.rolling('7D').travelers.mean(),
        'YTD mean': lambda x: x.expanding().travelers.mean()
      }
).head(10)
year travelers holiday 7D MA YTD mean
date
2020-01-01 2020 2311732.0 New Year's Day 2.311732e+06 2.311732e+06
2020-01-02 2020 2178656.0 New Year's Day 2.245194e+06 2.245194e+06
2020-01-03 2020 2422272.0 NaN 2.304220e+06 2.304220e+06
2020-01-04 2020 2210542.0 NaN 2.280800e+06 2.280800e+06
2020-01-05 2020 1806480.0 NaN 2.185936e+06 2.185936e+06
2020-01-06 2020 1815040.0 NaN 2.124120e+06 2.124120e+06
2020-01-07 2020 2034472.0 NaN 2.111313e+06 2.111313e+06
2020-01-08 2020 2072543.0 NaN 2.077144e+06 2.106467e+06
2020-01-09 2020 1687974.0 NaN 2.007046e+06 2.059968e+06
2020-01-10 2020 2183734.0 NaN 1.972969e+06 2.072344e+06

To understand what’s happening, it’s best to visualize the original data and the result, so here’s a sneak peek of plotting with pandas.

_ = tsa_melted_holiday_travel.loc['2020'].assign(
    **{
        '7D MA': lambda x: x.rolling('7D').travelers.mean(),
        'YTD mean': lambda x: x.expanding().travelers.mean()
      }
).plot(title='2020 TSA Traveler Throughput', ylabel='travelers', alpha=.8)
../_images/90647b132553f3164a745560994a06426bbd6e845ae11e447c8b680b4b0fb3c6.png

Other types of windows:

Exercise 3#

Using the taxi trip data in the 2019_Yellow_Taxi_Trip_Data.csv file, resample the data to an hourly frequency based on the dropoff time. Calculate the total trip_distance, fare_amount, tolls_amount, and tip_amount, then find the 5 hours with the most tips.

Training, test and validation data#

Data-driven models fundamentally rely on held-back data for to ensure against over-fitting

Over-fitting occurs when you fit really well to the training data but then fail to generalize to unseen data - deafting the purpose of a predictive model

This becomes particularly dangerous when the number of parameters is large, such as when training neural networks, because it is easy for these expressive models to fit almost any data.

Training, test and validation data#

We therefore need to keep some data back to test how well our model generalizes - this is our test data

wokflow

Validation data#

Note, that the inner cross-validation loop also requires some data that is (ideally) distinct from our training data - this is our validation data

Validation data#

There are many ways to choose this data, but a common approach is k-fold cross validation:

wokflow

Accounting for correlations and data balance#

Choosing the appropriate splits can be complicated if your data is inbalanced or correlated in some way

wokflow

Accounting for correlations and data balance#

Ideally you want to split the data so that the validation data equally samples different classes and groups. scikit-learn provides many options for achieving this (see the docs).

It might be easy to shuffle the data, but for large datasets this might need to be done with just indices.

In general, be careful of e.g. seasonality, long term trends and spatial covariances.

Golden rule: Treat your test data as sacred!