pandas Part I#

This document includes a sequence of notebooks to introduce data manipulation in Python using the pandas library.

Basic building blocks#

pandas works with tabular data (rows and columns) through Series, DataFrame objects.

import pandas as pd

meteor = pd.read_csv('../data/Meteorite_Landings.csv')

Source: NASA Open Data Portal

type(meteor)

A DataFrame is the basic “spreadsheet” or “table” used in Python. A DataFrame object is composed of one or more Series objects (columns), and indexed by Index (rows).

# head.() shows us the first several rows
meteor.head()
# investigate a column (note its type)
print(type(meteor.name))

meteor.name.head()
# investigate how the columns are labeled
meteor.columns
# investigate how the rows are indexed
meteor.index

DataFrame sources#

DataFrames can be created from reading a file, scraping the web, and/or API requests.

Reading from a file#

import pandas as pd

meteor = pd.read_csv('../data/Meteorite_Landings.csv')

API requests (more details later)#

import requests

response = requests.get(
    'https://data.nasa.gov/resource/gh4g-9sfh.json',
    params={'$limit': 50000}  # Depending on the API, there may be a default limit of records one can obtain
)  
response  # A 200 exit code indicates success
<Response [200]>

Tip: A list of HTTP GET exit codes is available at https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

response.ok  # checks ok flag
True
# Extract data if request is successful
if response.ok:
    payload = response.json()
else:
    print(f'Request failed with exit code {response.status_code}.')
    payload = None
# Load into DataFrame
meteor_json = pd.DataFrame(payload)
meteor_json.head()
name id nametype recclass mass fall year reclat reclong geolocation :@computed_region_cbhk_fwbd :@computed_region_nnqa_25f4
0 Aachen 1 Valid L5 21 Fell 1880-01-01T00:00:00.000 50.775000 6.083330 {'latitude': '50.775', 'longitude': '6.08333'} NaN NaN
1 Aarhus 2 Valid H6 720 Fell 1951-01-01T00:00:00.000 56.183330 10.233330 {'latitude': '56.18333', 'longitude': '10.23333'} NaN NaN
2 Abee 6 Valid EH4 107000 Fell 1952-01-01T00:00:00.000 54.216670 -113.000000 {'latitude': '54.21667', 'longitude': '-113.0'} NaN NaN
3 Acapulco 10 Valid Acapulcoite 1914 Fell 1976-01-01T00:00:00.000 16.883330 -99.900000 {'latitude': '16.88333', 'longitude': '-99.9'} NaN NaN
4 Achiras 370 Valid L6 780 Fell 1902-01-01T00:00:00.000 -33.166670 -64.950000 {'latitude': '-33.16667', 'longitude': '-64.95'} NaN NaN
# Removing auto-computed columns
mask = meteor_json.columns.str.contains('@computed_region', regex=True)

columns_to_drop = meteor_json.columns[mask]
meteor_json = meteor_json.drop(columns=columns_to_drop)
# storing downloaded data into files
# meteor_json.to_csv('meteor.csv')  

Basic inspection#

What type of data are available in the dataframe? Are there missing data?#

meteor.info()

How much data are available?#

meteor.shape

Subsetting and indexing#

Effectively extracting data from a full dataset requires fluency in how the DataFrame can be subsetted and how it is indexed.

Calling a column by attributes (if valid)#

meteor.recclass

Calling a column by keys#

meteor['mass (g)']

Multiple columns by name#

meteor[['name', 'mass (g)']]
name mass (g)
0 Aachen 21.0
1 Aarhus 720.0
2 Abee 107000.0
3 Acapulco 1914.0
4 Achiras 780.0
... ... ...
45711 Zillah 002 172.0
45712 Zinder 46.0
45713 Zlin 3.3
45714 Zubkovsky 2167.0
45715 Zulu Queen 200.0

45716 rows × 2 columns

Selecting rows#

meteor[5:10]  # end-exclusive
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
5 Adhi Kot 379 Valid EH4 4239.0 Fell 1919.0 32.10000 71.80000 (32.1, 71.8)
6 Adzhi-Bogdo (stone) 390 Valid LL3-6 910.0 Fell 1949.0 44.83333 95.16667 (44.83333, 95.16667)
7 Agen 392 Valid H5 30000.0 Fell 1814.0 44.21667 0.61667 (44.21667, 0.61667)
8 Aguada 398 Valid L6 1620.0 Fell 1930.0 -31.60000 -65.23333 (-31.6, -65.23333)
9 Aguila Blanca 417 Valid L 1440.0 Fell 1920.0 -30.86667 -64.55000 (-30.86667, -64.55)

Indexing with .loc[], iloc[]#

  • .loc[] indexes by row labels

  • .iloc[] indexes by indices

meteor.loc[0:4, 'name':'mass (g)']
name id nametype recclass mass (g)
0 Aachen 1 Valid L5 21.0
1 Aarhus 2 Valid H6 720.0
2 Abee 6 Valid EH4 107000.0
3 Acapulco 10 Valid Acapulcoite 1914.0
4 Achiras 370 Valid L6 780.0
meteor.iloc[0:4, 0:5]
name id nametype recclass mass (g)
0 Aachen 1 Valid L5 21.0
1 Aarhus 2 Valid H6 720.0
2 Abee 6 Valid EH4 107000.0
3 Acapulco 10 Valid Acapulcoite 1914.0

Filtering or subsetting by condition#

Selection by condition can be performed by creating a Boolean mask with True/False values to specify which rows/columns to select.

# select records with heavy meteor (mass > 10^7) that are found (fall = 'Found')
mask = (meteor['mass (g)'] > 1e7) & (meteor.fall == 'Found')
mask
0        False
1        False
2        False
3        False
4        False
         ...  
45711    False
45712    False
45713    False
45714    False
45715    False
Length: 45716, dtype: bool

Note: Each condition is surrounded by parentheses, and we use bitwise operator (&, |, ~) instead of logical operators (and, or, not).

meteor[mask]
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
3455 Armanty 2335 Valid Iron, IIIE 28000000.0 Found 1898.0 47.00000 88.00000 (47.0, 88.0)
5016 Bacubirito 4919 Valid Iron, ungrouped 22000000.0 Found 1863.0 26.20000 -107.83333 (26.2, -107.83333)
5365 Campo del Cielo 5247 Valid Iron, IAB-MG 50000000.0 Found 1575.0 -27.46667 -60.58333 (-27.46667, -60.58333)
5370 Canyon Diablo 5257 Valid Iron, IAB-MG 30000000.0 Found 1891.0 35.05000 -111.03333 (35.05, -111.03333)
5373 Cape York 5262 Valid Iron, IIIAB 58200000.0 Found 1818.0 76.13333 -64.93333 (76.13333, -64.93333)
5468 Chupaderos 5363 Valid Iron, IIIAB 24300000.0 Found 1852.0 27.00000 -105.10000 (27.0, -105.1)
12613 Gibeon 10912 Valid Iron, IVA 26000000.0 Found 1836.0 -25.50000 18.00000 (-25.5, 18.0)
16392 Hoba 11890 Valid Iron, IVB 60000000.0 Found 1920.0 -19.58333 17.91667 (-19.58333, 17.91667)
22921 Mbosi 15456 Valid Iron, ungrouped 16000000.0 Found 1930.0 -9.11667 33.06667 (-9.11667, 33.06667)
26174 Morito 16745 Valid Iron, IIIAB 10100000.0 Found 1600.0 27.05000 -105.43333 (27.05, -105.43333)
26297 Mundrabilla 16852 Valid Iron, IAB-ung 24000000.0 Found 1911.0 -30.78333 127.55000 (-30.78333, 127.55)
38317 Willamette 24269 Valid Iron, IIIAB 15500000.0 Found 1902.0 45.36667 -122.58333 (45.36667, -122.58333)
# negation of a mask
meteor[~mask]

Note: Boolean masks can be used with loc[] and iloc[] as well.

Calculating summary statistics#

This section discusses preliminary calculations before conducting further data analysis.

How many of the meteorites were observed falling vs found?#

meteor.fall.value_counts()
fall
Found    44609
Fell      1107
Name: count, dtype: int64
meteor.fall.value_counts(normalize=True)
fall
Found    0.975785
Fell     0.024215
Name: proportion, dtype: float64

Behavior of mass of a meterorite?#

meteor['mass (g)'].mean()
meteor['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])
meteor['mass (g)'].max()
# sometimes helpful to locate the other information related to a particular entry
meteor['mass (g)'].idxmax()
meteor.loc[meteor['mass (g)'].idxmax()]  # note the "index" of max()
name                            Hoba
id                             11890
nametype                       Valid
recclass                   Iron, IVB
mass (g)                  60000000.0
fall                           Found
year                          1920.0
reclat                     -19.58333
reclong                     17.91667
GeoLocation    (-19.58333, 17.91667)
Name: 16392, dtype: object

How many unique classes are in this dataset?#

meteor.recclass.unique()
array(['L5', 'H6', 'EH4', 'Acapulcoite', 'L6', 'LL3-6', 'H5', 'L',
       'Diogenite-pm', 'Unknown', 'H4', 'H', 'Iron, IVA', 'CR2-an', 'LL5',
       'CI1', 'L/LL4', 'Eucrite-mmict', 'CV3', 'Ureilite-an',
       'Stone-uncl', 'L3', 'Angrite', 'LL6', 'L4', 'Aubrite',
       'Iron, IIAB', 'Iron, IAB-sLL', 'Iron, ungrouped', 'CM2', 'OC',
       'Mesosiderite-A1', 'LL4', 'C2-ung', 'LL3.8', 'Howardite',
       'Eucrite-pmict', 'Diogenite', 'LL3.15', 'LL3.9', 'Iron, IAB-MG',
       'H/L3.9', 'Iron?', 'Eucrite', 'H4-an', 'L/LL6', 'Iron, IIIAB',
       'H/L4', 'H4-5', 'L3.7', 'LL3.4', 'Martian (chassignite)', 'EL6',
       'H3.8', 'H3-5', 'H5-6', 'Mesosiderite', 'H5-7', 'L3-6', 'H4-6',
       'Ureilite', 'Iron, IID', 'Mesosiderite-A3/4', 'CO3.3', 'H3',
       'EH3/4-an', 'Iron, IIE', 'L/LL5', 'H3.7', 'CBa', 'H4/5', 'H3/4',
       'H?', 'H3-6', 'L3.4', 'Iron, IAB-sHL', 'L3.7-6', 'EH7-an', 'Iron',
       'CR2', 'CO3.2', 'K3', 'L5/6', 'CK4', 'Iron, IIE-an', 'L3.6',
       'LL3.2', 'Pallasite', 'CO3.5', 'Lodranite', 'Mesosiderite-A3',
       'L3-4', 'H5/6', 'Pallasite, PMG', 'Eucrite-cm', 'L5-6', 'CO3.6',
       'Martian (nakhlite)', 'LL3.6', 'C3-ung', 'H3-4', 'CO3.4', 'EH3',
       'Iron, IAB-ung', 'Winonaite', 'LL', 'Eucrite-br', 'Iron, IIF',
       'R3.8-6', 'L4-6', 'EH5', 'LL3.00', 'H3.4', 'Martian (shergottite)',
       'Achondrite-ung', 'LL3.3', 'C', 'H/L3.6', 'Iron, IIIAB-an', 'LL7',
       'Mesosiderite-B2', 'LL4-6', 'CO3.7', 'L/LL6-an',
       'Iron, IAB complex', 'Pallasite, PMG-an', 'H3.9/4', 'L3.8',
       'LL5-6', 'LL3.8-6', 'L3.9', 'L4-5', 'L3-5', 'LL4/5', 'L4/5',
       'H3.9', 'H3.6-6', 'H3.8-5', 'H3.8/4', 'H3.9-5', 'CH3', 'R3.8-5',
       'L3.9/4', 'E4', 'CO3', 'Chondrite-ung', 'H~5', 'H~6', 'L/LL3.10',
       'EL5', 'LL3', 'L~6', 'L~3', 'H~4', 'L(LL)3.5-3.7', 'Iron, IIIE-an',
       'H3.6', 'L3.4-3.7', 'L3.5', 'CM1/2', 'Martian (OPX)', 'Brachinite',
       'LL7(?)', 'LL6(?)', 'Eucrite-Mg rich', 'H3.5-4', 'EL3', 'R3.6',
       'H3.5', 'CM1', 'L/LL3', 'H7', 'L(?)3', 'L3.2', 'L3.7-3.9',
       'Mesosiderite-B1', 'Eucrite-unbr', 'LL3.7', 'CO3.0', 'LL3.5',
       'L3.7-4', 'CV3-an', 'Lunar (anorth)', 'L3.3', 'Iron, IAB-sLM',
       'Lunar', 'Iron, IC', 'Iron, IID-an', 'Iron, IIIE', 'Iron, IVA-an',
       'CK6', 'L3.1', 'CK5', 'H3.3', 'H3.7-6', 'E6', 'H3.0', 'H3.1',
       'L3.0', 'L/LL3.4', 'C6', 'LL3.0', 'Lunar (gabbro)', 'R4', 'C4',
       'Iron, IIG', 'Iron, IIC', 'C1-ung', 'H5-an', 'EH4/5', 'Iron, IIIF',
       'R3-6', 'Mesosiderite-B4', 'L6/7', 'Relict H', 'L-imp melt', 'CK3',
       'H3-an', 'Iron, IVB', 'R3.8', 'L~5', 'Mesosiderite-an',
       'Mesosiderite-A2', 'Pallasite, PES', 'C4-ung', 'Iron, IAB?',
       'Mesosiderite-A', 'R3.5-6', 'H3.9-6', 'Ureilite-pmict', 'LL~6',
       'CK4/5', 'EL4', 'Lunar (feldsp. breccia)', 'L3.9-6', 'H-an',
       'L/LL3-6', 'L/LL3-5', 'H/L3.5', 'H/L3', 'R3-4', 'CK3-an', 'LL4-5',
       'H/L6', 'L3/4', 'H-imp melt', 'CR', 'Chondrite-fusion crust',
       'Iron, IAB-sLH', 'H(L)3-an', 'L(LL)3', 'H(L)3', 'R3', 'L7',
       'CM-an', 'L/LL~6', 'L/LL~5', 'L~4', 'L/LL~4', 'LL(L)3', 'H3.2',
       'L-melt breccia', 'H6-melt breccia', 'H5-melt breccia',
       'H-melt rock', 'Eucrite-an', 'Lunar (bas/anor)', 'LL5/6', 'LL3/4',
       'H3.4/3.5', 'Lunar (basalt)', 'H/L5', 'H(5?)', 'LL-imp melt',
       'Mesosiderite?', 'H~4/5', 'L6-melt breccia', 'L3.5-3.7',
       'Iron, IIAB-an', 'L3.3-3.7', 'L3.2-3.6', 'L3.3-3.6',
       'Acapulcoite/Lodranite', 'Mesosiderite-B', 'CK5/6', 'L3.05', 'C2',
       'C4/5', 'L/LL3.2', 'Iron, IIIAB?', 'L3.5-5', 'L/LL(?)3', 'H4(?)',
       'Iron, IAB-sHH', 'Relict iron', 'EL4/5', 'L5-7', 'Diogenite-an',
       'L-melt rock', 'CR1', 'H5 ', 'L5 ', 'H4 ', 'L4 ', 'E', 'L6 ',
       'H3 ', 'LL6 ', 'H-metal', 'H6 ', 'L-metal', 'Relict OC', 'EH',
       'Mesosiderite-A4', 'L/LL5/6', 'H3.8-4', 'CBb', 'EL6/7', 'EL7',
       'CH/CBb', 'CO3.8', 'H/L~4', 'Mesosiderite-C2', 'R5', 'H4/6',
       'H3.7-5', 'LL3.7-6', 'H3.7/3.8', 'L3.7/3.8', 'EH-imp melt', 'R',
       'Fusion crust', 'Aubrite-an', 'R6', 'LL-melt rock', 'L3.5-3.9',
       'L3.2-3.5', 'L3.3-3.5', 'L3.0-3.7', 'E3-an', 'K', 'E3',
       'Acapulcoite/lodranite', 'CK4-an', 'L(LL)3.05', 'L3.10', 'CB',
       'Diogenite-olivine', 'EL-melt rock', 'EH6', 'Pallasite, ungrouped',
       'L/LL4/5', 'L3.8-an', 'Iron, IAB-an', 'C5/6-ung', 'CV2',
       'Iron, IC-an', 'Lunar (bas. breccia)', 'L3.8-6', 'R3/4', 'R3.9',
       'CK', 'LL3.10', 'R4/5', 'L3.8-5', 'Mesosiderite-C', 'Enst achon',
       'H/L3-4', 'L(H)3', 'LL6/7', 'LL3.1', 'OC3', 'R3.7', 'CO3 ', 'CH3 ',
       'LL~4', 'LL~4/5', 'L(LL)~4', 'H3.05', 'H3.10',
       'Impact melt breccia', 'LL3-5', 'H/L3.7', 'LL3-4', 'CK3/4',
       'Martian', 'CO3.1', 'Lunar (bas/gab brec)', 'Achondrite-prim',
       'LL<3.5', 'CK3.8', 'L/LL-melt rock', 'H6/7', 'EL6 ',
       'Iron, IAB-sHL-an', 'CM2-an', 'R3-5', 'L4-melt rock',
       'L6-melt rock', 'H/L4/5', 'EL3/4', 'H/L6-melt rock',
       'Enst achon-ung', 'L3-7', 'R3.4', 'LL3.05', 'LL4/6', 'LL3.8-4',
       'H3.15', 'C3.0-ung', 'LL-melt breccia', 'LL6-melt breccia',
       'L5-melt breccia', 'LL(L)3.1', 'LL6-an', 'L4-melt breccia',
       'Howardite-an', 'H4-melt breccia', 'Martian (basaltic breccia)',
       'L3-melt breccia', 'L~4-6', 'LL~5', 'R3.5-4', 'CR7',
       'H-melt breccia', 'Lunar (norite)', 'L3.00', 'H3.0-3.4', 'L/LL4-6',
       'CM', 'EH7', 'L4-an', 'E-an', 'H3.8/3.9', 'L3.9-5', 'H3.8-6',
       'H3.4-5', 'L3.0-3.9', 'L3.5-3.8', 'H3.2-3.7', 'L3.6-4',
       'Iron, IIE?', 'C3/4-ung', 'L/LL3.5', 'L/LL3.6/3.7', 'H/L4-5',
       'LL~3', 'Pallasite?', 'LL5-7', 'LL3.9/4', 'H3.8-an', 'CR-an',
       'L/LL5-6', 'L(LL)5', 'L(LL)6', 'LL3.1-3.5', 'E5', 'Lodranite-an',
       'H3.2-6', 'H(?)4', 'E5-an', 'H3.2-an', 'EH6-an', 'Stone-ung',
       'C1/2-ung', 'L/LL'], dtype=object)

General statistics#

The .describe() method includes numeric columns by default. Here we can force it to include all columns.

meteor.describe(include='all')
name id nametype recclass mass (g) fall year reclat reclong GeoLocation
count 45716 45716.000000 45716 45716 4.558500e+04 45716 45425.000000 38401.000000 38401.000000 38401
unique 45716 NaN 2 466 NaN 2 NaN NaN NaN 17100
top Aachen NaN Valid L6 NaN Found NaN NaN NaN (0.0, 0.0)
freq 1 NaN 45641 8285 NaN 44609 NaN NaN NaN 6214
mean NaN 26889.735104 NaN NaN 1.327808e+04 NaN 1991.828817 -39.122580 61.074319 NaN
std NaN 16860.683030 NaN NaN 5.749889e+05 NaN 25.052766 46.378511 80.647298 NaN
min NaN 1.000000 NaN NaN 0.000000e+00 NaN 860.000000 -87.366670 -165.433330 NaN
25% NaN 12688.750000 NaN NaN 7.200000e+00 NaN 1987.000000 -76.714240 0.000000 NaN
50% NaN 24261.500000 NaN NaN 3.260000e+01 NaN 1998.000000 -71.500000 35.666670 NaN
75% NaN 40656.750000 NaN NaN 2.026000e+02 NaN 2003.000000 0.000000 157.166670 NaN
max NaN 57458.000000 NaN NaN 6.000000e+07 NaN 2101.000000 81.166670 354.473330 NaN

Note: NaN values signify missing data. For instance, the fall column contains strings, so there is no value for mean; likewise, mass (g) is numeric, so we don’t have entries for the categorical summary statistics (unique, top, freq).

Practice: Yellow cab data#

The following command downloads a .parquet file containing NYC Yellow Taxi data, a common storage format for moderate to large datasets.

!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
taxi = pd.read_parquet('yellow_tripdata_2024-01.parquet')
  1. Examine the first five rows of the dataset.

  1. How much data are included in this dataset?

  1. Calculate summary statistics for the fare_amount, tolls_amount, and tip_amount. Do they add up to the total_amount?

  1. Find the trip that has the longest trip by distance (trip_distance).

  1. Compare the average total_amount for short versus long trips (short trip has trip_distance < 10). Make sure we do not include zero-distance trips.


Data cleaning#

We will cover some common transformations that facilitate data analysis, including rearranging columns, type conversion, and sorting.

minitaxi = taxi.sample(10000)  # down-sample our dataset for illustration
minitaxi.shape
minitaxi.head()

Dropping columns#

# drop all id columns and the store_and_fwd_flag column
mask = minitaxi.columns.str.contains('ID$|store_and_fwd_flag', regex=True)
columns_to_drop = minitaxi.columns[mask]
columns_to_drop
minitaxi = minitaxi.drop(columns=columns_to_drop)
minitaxi.head()

Renaming columns#

minitaxi = minitaxi.rename(
    columns={
        'tpep_pickup_datetime': 'pickup', 
        'tpep_dropoff_datetime': 'dropoff'
    }
)
minitaxi.columns

Examine the correct data types (type conversion)#

minitaxi.dtypes
minitaxi.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9524 entries, 597270 to 2512071
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   pickup                 9524 non-null   datetime64[us]
 1   dropoff                9524 non-null   datetime64[us]
 2   passenger_count        9524 non-null   float64       
 3   trip_distance          9524 non-null   float64       
 4   payment_type           9524 non-null   int64         
 5   fare_amount            9524 non-null   float64       
 6   extra                  9524 non-null   float64       
 7   mta_tax                9524 non-null   float64       
 8   tip_amount             9524 non-null   float64       
 9   tolls_amount           9524 non-null   float64       
 10  improvement_surcharge  9524 non-null   float64       
 11  total_amount           9524 non-null   float64       
 12  congestion_surcharge   9524 non-null   float64       
 13  Airport_fee            9524 non-null   float64       
dtypes: datetime64[us](2), float64(11), int64(1)
memory usage: 1.1 MB
# cast passenger_count to integers
minitaxi['passenger_count'] = minitaxi['passenger_count'].astype(int)
minitaxi.dtypes
pickup                   datetime64[us]
dropoff                  datetime64[us]
passenger_count                   int32
trip_distance                   float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
Airport_fee                     float64
dtype: object

Dropping rows with NA values#

minitaxi = minitaxi.dropna()
minitaxi.info()

Creating new columns#

Let’s calculate the following for each row:

  1. elapsed time of the trip

  2. the tip percentage

  3. the total taxes, tolls, fees, and surcharges

  4. the average speed of the taxi

minitaxi = minitaxi.assign(
    elapsed_time=lambda x: x.dropoff - x.pickup,
    cost_before_tip=lambda x: x.total_amount - x.tip_amount,
    tip_pct=lambda x: x.tip_amount / x.cost_before_tip, 
    fees=lambda x: x.cost_before_tip - x.fare_amount, 
    avg_speed=lambda x: x.trip_distance.div(
        x.elapsed_time.dt.total_seconds() / 60 / 60
    )
)
minitaxi.head(2)

Notes:

  • We used lambda functions to 1) avoid typing taxis repeatedly and 2) be able to access the cost_before_tip and elapsed_time columns in the same method that we create them.

  • To create a single new column, we can also use df['new_col'] = <values>.

Sorting by values#

# sort by descending passenger count and pickups from earliest to latest
minitaxi.sort_values(['passenger_count', 'pickup'], ascending=[False, True]).head()
# pick out the 3 trips with largest timespan
minitaxi.nlargest(3, 'elapsed_time')

Working with index#

Currently the index is simply using the row numbers, but if we wish to work with the pickup times significantly, perhaps indexing by the datetime column is more effective.

minitaxi.index

Setting index#

minitaxi = minitaxi.set_index('pickup')
minitaxi.head(3)

Sorting by index#

# sorting by indices
minitaxi = minitaxi.sort_index()
minitaxi.head()
dropoff passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge Airport_fee elapsed_time cost_before_tip tip_pct fees avg_speed
pickup
2024-01-01 00:10:19 2024-01-01 00:22:58 1 1.92 1 13.5 1.0 0.5 3.70 0.0 1.0 22.20 2.5 0.0 0 days 00:12:39 18.5 0.200000 5.0 9.106719
2024-01-01 00:15:10 2024-01-01 00:22:51 1 1.64 1 10.0 1.0 0.5 3.00 0.0 1.0 18.00 2.5 0.0 0 days 00:07:41 15.0 0.200000 5.0 12.806941
2024-01-01 00:19:09 2024-01-01 00:41:03 3 9.12 1 38.7 1.0 0.5 8.74 0.0 1.0 52.44 2.5 0.0 0 days 00:21:54 43.7 0.200000 5.0 24.986301
2024-01-01 00:20:34 2024-01-01 00:25:54 1 0.80 1 6.5 3.5 0.5 2.30 0.0 1.0 13.80 2.5 0.0 0 days 00:05:20 11.5 0.200000 5.0 9.000000
2024-01-01 00:23:35 2024-01-01 00:49:13 1 3.08 1 23.3 1.0 0.5 7.08 0.0 1.0 35.38 2.5 0.0 0 days 00:25:38 28.3 0.250177 5.0 7.209363

Recall how we have used [0:4] to locate the rows indexed 0 to 4. Since we are indexing by datetime, we can select within time ranges.

Selecting by index#

# selecting the taxi rides in the first 6 hours of the new year
minitaxi['2024-01-01 00:00':'2024-01-01 06:00']
dropoff passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge Airport_fee elapsed_time cost_before_tip tip_pct fees avg_speed
pickup
2024-01-01 00:10:19 2024-01-01 00:22:58 1 1.92 1 13.5 1.0 0.5 3.70 0.0 1.0 22.20 2.5 0.0 0 days 00:12:39 18.5 0.200000 5.0 9.106719
2024-01-01 00:15:10 2024-01-01 00:22:51 1 1.64 1 10.0 1.0 0.5 3.00 0.0 1.0 18.00 2.5 0.0 0 days 00:07:41 15.0 0.200000 5.0 12.806941
2024-01-01 00:19:09 2024-01-01 00:41:03 3 9.12 1 38.7 1.0 0.5 8.74 0.0 1.0 52.44 2.5 0.0 0 days 00:21:54 43.7 0.200000 5.0 24.986301
2024-01-01 00:20:34 2024-01-01 00:25:54 1 0.80 1 6.5 3.5 0.5 2.30 0.0 1.0 13.80 2.5 0.0 0 days 00:05:20 11.5 0.200000 5.0 9.000000
2024-01-01 00:23:35 2024-01-01 00:49:13 1 3.08 1 23.3 1.0 0.5 7.08 0.0 1.0 35.38 2.5 0.0 0 days 00:25:38 28.3 0.250177 5.0 7.209363
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2024-01-01 04:53:26 2024-01-02 03:11:13 6 9.97 1 52.0 1.0 0.5 0.00 0.0 1.0 57.00 2.5 0.0 0 days 22:17:47 57.0 0.000000 5.0 0.447158
2024-01-01 05:06:58 2024-01-01 05:10:03 1 0.63 1 5.8 1.0 0.5 2.16 0.0 1.0 12.96 2.5 0.0 0 days 00:03:05 10.8 0.200000 5.0 12.259459
2024-01-01 05:16:59 2024-01-01 05:35:39 1 4.26 1 22.6 1.0 0.5 5.52 0.0 1.0 33.12 2.5 0.0 0 days 00:18:40 27.6 0.200000 5.0 13.692857
2024-01-01 05:17:26 2024-01-01 05:22:32 1 1.90 1 10.0 1.0 0.5 0.00 0.0 1.0 15.00 2.5 0.0 0 days 00:05:06 15.0 0.000000 5.0 22.352941
2024-01-01 05:29:34 2024-01-01 05:38:38 1 1.57 1 10.7 1.0 0.5 0.00 0.0 1.0 15.70 2.5 0.0 0 days 00:09:04 15.7 0.000000 5.0 10.389706

70 rows × 18 columns

Resetting index#

We can revert any specific column index back to row numbers, but notice that by setting and resetting index you lose the original row numbers.

minitaxi = minitaxi.reset_index()
minitaxi.head()
pickup dropoff passenger_count trip_distance payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge Airport_fee elapsed_time cost_before_tip tip_pct fees avg_speed
0 2024-01-01 00:10:19 2024-01-01 00:22:58 1 1.92 1 13.5 1.0 0.5 3.70 0.0 1.0 22.20 2.5 0.0 0 days 00:12:39 18.5 0.200000 5.0 9.106719
1 2024-01-01 00:15:10 2024-01-01 00:22:51 1 1.64 1 10.0 1.0 0.5 3.00 0.0 1.0 18.00 2.5 0.0 0 days 00:07:41 15.0 0.200000 5.0 12.806941
2 2024-01-01 00:19:09 2024-01-01 00:41:03 3 9.12 1 38.7 1.0 0.5 8.74 0.0 1.0 52.44 2.5 0.0 0 days 00:21:54 43.7 0.200000 5.0 24.986301
3 2024-01-01 00:20:34 2024-01-01 00:25:54 1 0.80 1 6.5 3.5 0.5 2.30 0.0 1.0 13.80 2.5 0.0 0 days 00:05:20 11.5 0.200000 5.0 9.000000
4 2024-01-01 00:23:35 2024-01-01 00:49:13 1 3.08 1 23.3 1.0 0.5 7.08 0.0 1.0 35.38 2.5 0.0 0 days 00:25:38 28.3 0.250177 5.0 7.209363

Practice: Meteorite indexing#

Using the meteor dataset,

  1. cast the year column to an integer column.

  2. create a new column indicating whether the meteorite was observed falling before 1970.

  3. set the index to the id column and extract all the rows with IDs between 10,036 and 10,040 (inclusive) with loc[].

  4. examine the year column to see if there are any data errors.

import pandas as pd

meteor = pd.read_csv('../data/Meteorite_Landings.csv')
#meteor = meteor.dropna()
meteor['year'] = meteor['year'].astype(int)
meteor.info()
<class 'pandas.core.frame.DataFrame'>
Index: 38115 entries, 0 to 45715
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         38115 non-null  object 
 1   id           38115 non-null  int64  
 2   nametype     38115 non-null  object 
 3   recclass     38115 non-null  object 
 4   mass (g)     38115 non-null  float64
 5   fall         38115 non-null  object 
 6   year         38115 non-null  int32  
 7   reclat       38115 non-null  float64
 8   reclong      38115 non-null  float64
 9   GeoLocation  38115 non-null  object 
dtypes: float64(3), int32(1), int64(1), object(5)
memory usage: 3.1+ MB
meteor = meteor.set_index('id')
meteor.set_index('id')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_26756\953505260.py in ?()
----> 1 meteor.set_index('id')

~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py in ?(self, keys, drop, append, inplace, verify_integrity)
   6118                     if not found:
   6119                         missing.append(col)
   6120 
   6121         if missing:
-> 6122             raise KeyError(f"None of {missing} are in the columns")
   6123 
   6124         if inplace:
   6125             frame = self

KeyError: "None of ['id'] are in the columns"
meteor.sort_index().loc[10036:10040]
<class 'pandas.core.frame.DataFrame'>
Index: 38115 entries, 1 to 30414
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         38115 non-null  object 
 1   nametype     38115 non-null  object 
 2   recclass     38115 non-null  object 
 3   mass (g)     38115 non-null  float64
 4   fall         38115 non-null  object 
 5   year         38115 non-null  int32  
 6   reclat       38115 non-null  float64
 7   reclong      38115 non-null  float64
 8   GeoLocation  38115 non-null  object 
dtypes: float64(3), int32(1), object(5)
memory usage: 2.8+ MB

For your reference, on the issue of indexing#

pandas 2.2 User Manual on Indexing