pandas Part I

`pandas` Part I#

This document includes a sequence of notebooks to introduce data manipulation in Python using the pandas library.

Basic building blocks#

pandas works with tabular data (rows and columns) through Series, DataFrame objects.

import pandas as pd

meteor = pd.read_csv('../data/Meteorite_Landings.csv')

Source: NASA Open Data Portal

type(meteor)

A DataFrame is the basic “spreadsheet” or “table” used in Python. A DataFrame object is composed of one or more Series objects (columns), and indexed by Index (rows).

# head.() shows us the first several rows
meteor.head()

# investigate a column (note its type)
print(type(meteor.name))

meteor.name.head()

# investigate how the columns are labeled
meteor.columns

# investigate how the rows are indexed
meteor.index

DataFrame sources#

DataFrames can be created from reading a file, scraping the web, and/or API requests.

Reading from a file#

import pandas as pd

meteor = pd.read_csv('../data/Meteorite_Landings.csv')

API requests (more details later)#

import requests

response = requests.get(
    'https://data.nasa.gov/resource/gh4g-9sfh.json',
    params={'$limit': 50000}  # Depending on the API, there may be a default limit of records one can obtain
)  

response  # A 200 exit code indicates success

<Response [200]>

Tip: A list of HTTP GET exit codes is available at https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

response.ok  # checks ok flag

True

# Extract data if request is successful
if response.ok:
    payload = response.json()
else:
    print(f'Request failed with exit code {response.status_code}.')
    payload = None

# Load into DataFrame
meteor_json = pd.DataFrame(payload)
meteor_json.head()

	name	id	nametype	recclass	mass	fall	year	reclat	reclong	geolocation	:@computed_region_cbhk_fwbd	:@computed_region_nnqa_25f4
0	Aachen	1	Valid	L5	21	Fell	1880-01-01T00:00:00.000	50.775000	6.083330	{'latitude': '50.775', 'longitude': '6.08333'}	NaN	NaN
1	Aarhus	2	Valid	H6	720	Fell	1951-01-01T00:00:00.000	56.183330	10.233330	{'latitude': '56.18333', 'longitude': '10.23333'}	NaN	NaN
2	Abee	6	Valid	EH4	107000	Fell	1952-01-01T00:00:00.000	54.216670	-113.000000	{'latitude': '54.21667', 'longitude': '-113.0'}	NaN	NaN
3	Acapulco	10	Valid	Acapulcoite	1914	Fell	1976-01-01T00:00:00.000	16.883330	-99.900000	{'latitude': '16.88333', 'longitude': '-99.9'}	NaN	NaN
4	Achiras	370	Valid	L6	780	Fell	1902-01-01T00:00:00.000	-33.166670	-64.950000	{'latitude': '-33.16667', 'longitude': '-64.95'}	NaN	NaN

# Removing auto-computed columns
mask = meteor_json.columns.str.contains('@computed_region', regex=True)

columns_to_drop = meteor_json.columns[mask]

meteor_json = meteor_json.drop(columns=columns_to_drop)

# storing downloaded data into files
# meteor_json.to_csv('meteor.csv')  

Basic inspection#

What type of data are available in the dataframe? Are there missing data?#

meteor.info()

How much data are available?#

meteor.shape

Subsetting and indexing#

Effectively extracting data from a full dataset requires fluency in how the DataFrame can be subsetted and how it is indexed.

Calling a column by attributes (if valid)#

meteor.recclass

Calling a column by keys#

meteor['mass (g)']

Multiple columns by name#

meteor[['name', 'mass (g)']]

	name	mass (g)
0	Aachen	21.0
1	Aarhus	720.0
2	Abee	107000.0
3	Acapulco	1914.0
4	Achiras	780.0
...	...	...
45711	Zillah 002	172.0
45712	Zinder	46.0
45713	Zlin	3.3
45714	Zubkovsky	2167.0
45715	Zulu Queen	200.0

45716 rows × 2 columns

Selecting rows#

meteor[5:10]  # end-exclusive

	name	id	nametype	recclass	mass (g)	fall	year	reclat	reclong	GeoLocation
5	Adhi Kot	379	Valid	EH4	4239.0	Fell	1919.0	32.10000	71.80000	(32.1, 71.8)
6	Adzhi-Bogdo (stone)	390	Valid	LL3-6	910.0	Fell	1949.0	44.83333	95.16667	(44.83333, 95.16667)
7	Agen	392	Valid	H5	30000.0	Fell	1814.0	44.21667	0.61667	(44.21667, 0.61667)
8	Aguada	398	Valid	L6	1620.0	Fell	1930.0	-31.60000	-65.23333	(-31.6, -65.23333)
9	Aguila Blanca	417	Valid	L	1440.0	Fell	1920.0	-30.86667	-64.55000	(-30.86667, -64.55)

Indexing with `.loc[]`, `iloc[]`#

.loc[] indexes by row labels
.iloc[] indexes by indices

meteor.loc[0:4, 'name':'mass (g)']

	name	id	nametype	recclass	mass (g)
0	Aachen	1	Valid	L5	21.0
1	Aarhus	2	Valid	H6	720.0
2	Abee	6	Valid	EH4	107000.0
3	Acapulco	10	Valid	Acapulcoite	1914.0
4	Achiras	370	Valid	L6	780.0

meteor.iloc[0:4, 0:5]

	name	id	nametype	recclass	mass (g)
0	Aachen	1	Valid	L5	21.0
1	Aarhus	2	Valid	H6	720.0
2	Abee	6	Valid	EH4	107000.0
3	Acapulco	10	Valid	Acapulcoite	1914.0

Filtering or subsetting by condition#

Selection by condition can be performed by creating a Boolean mask with True/False values to specify which rows/columns to select.

# select records with heavy meteor (mass > 10^7) that are found (fall = 'Found')
mask = (meteor['mass (g)'] > 1e7) & (meteor.fall == 'Found')
mask

      False
      False
      False
      False
      False
         ...  
  False
  False
  False
  False
  False
Length: 45716, dtype: bool

Note: Each condition is surrounded by parentheses, and we use bitwise operator (&, |, ~) instead of logical operators (and, or, not).

meteor[mask]

	name	id	nametype	recclass	mass (g)	fall	year	reclat	reclong	GeoLocation
3455	Armanty	2335	Valid	Iron, IIIE	28000000.0	Found	1898.0	47.00000	88.00000	(47.0, 88.0)
5016	Bacubirito	4919	Valid	Iron, ungrouped	22000000.0	Found	1863.0	26.20000	-107.83333	(26.2, -107.83333)
5365	Campo del Cielo	5247	Valid	Iron, IAB-MG	50000000.0	Found	1575.0	-27.46667	-60.58333	(-27.46667, -60.58333)
5370	Canyon Diablo	5257	Valid	Iron, IAB-MG	30000000.0	Found	1891.0	35.05000	-111.03333	(35.05, -111.03333)
5373	Cape York	5262	Valid	Iron, IIIAB	58200000.0	Found	1818.0	76.13333	-64.93333	(76.13333, -64.93333)
5468	Chupaderos	5363	Valid	Iron, IIIAB	24300000.0	Found	1852.0	27.00000	-105.10000	(27.0, -105.1)
12613	Gibeon	10912	Valid	Iron, IVA	26000000.0	Found	1836.0	-25.50000	18.00000	(-25.5, 18.0)
16392	Hoba	11890	Valid	Iron, IVB	60000000.0	Found	1920.0	-19.58333	17.91667	(-19.58333, 17.91667)
22921	Mbosi	15456	Valid	Iron, ungrouped	16000000.0	Found	1930.0	-9.11667	33.06667	(-9.11667, 33.06667)
26174	Morito	16745	Valid	Iron, IIIAB	10100000.0	Found	1600.0	27.05000	-105.43333	(27.05, -105.43333)
26297	Mundrabilla	16852	Valid	Iron, IAB-ung	24000000.0	Found	1911.0	-30.78333	127.55000	(-30.78333, 127.55)
38317	Willamette	24269	Valid	Iron, IIIAB	15500000.0	Found	1902.0	45.36667	-122.58333	(45.36667, -122.58333)

# negation of a mask
meteor[~mask]

Note: Boolean masks can be used with loc[] and iloc[] as well.

Calculating summary statistics#

This section discusses preliminary calculations before conducting further data analysis.

How many of the meteorites were observed falling vs found?#

meteor.fall.value_counts()

fall
Found    44609
Fell      1107
Name: count, dtype: int64

meteor.fall.value_counts(normalize=True)

fall
Found    0.975785
Fell     0.024215
Name: proportion, dtype: float64

Behavior of mass of a meterorite?#

meteor['mass (g)'].mean()

meteor['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])

meteor['mass (g)'].max()

# sometimes helpful to locate the other information related to a particular entry
meteor['mass (g)'].idxmax()
meteor.loc[meteor['mass (g)'].idxmax()]  # note the "index" of max()

name                            Hoba
id                             11890
nametype                       Valid
recclass                   Iron, IVB
mass (g)                  60000000.0
fall                           Found
year                          1920.0
reclat                     -19.58333
reclong                     17.91667
GeoLocation    (-19.58333, 17.91667)
Name: 16392, dtype: object

How many unique classes are in this dataset?#

meteor.recclass.unique()

array(['L5', 'H6', 'EH4', 'Acapulcoite', 'L6', 'LL3-6', 'H5', 'L',
       'Diogenite-pm', 'Unknown', 'H4', 'H', 'Iron, IVA', 'CR2-an', 'LL5',
       'CI1', 'L/LL4', 'Eucrite-mmict', 'CV3', 'Ureilite-an',
       'Stone-uncl', 'L3', 'Angrite', 'LL6', 'L4', 'Aubrite',
       'Iron, IIAB', 'Iron, IAB-sLL', 'Iron, ungrouped', 'CM2', 'OC',
       'Mesosiderite-A1', 'LL4', 'C2-ung', 'LL3.8', 'Howardite',
       'Eucrite-pmict', 'Diogenite', 'LL3.15', 'LL3.9', 'Iron, IAB-MG',
       'H/L3.9', 'Iron?', 'Eucrite', 'H4-an', 'L/LL6', 'Iron, IIIAB',
       'H/L4', 'H4-5', 'L3.7', 'LL3.4', 'Martian (chassignite)', 'EL6',
       'H3.8', 'H3-5', 'H5-6', 'Mesosiderite', 'H5-7', 'L3-6', 'H4-6',
       'Ureilite', 'Iron, IID', 'Mesosiderite-A3/4', 'CO3.3', 'H3',
       'EH3/4-an', 'Iron, IIE', 'L/LL5', 'H3.7', 'CBa', 'H4/5', 'H3/4',
       'H?', 'H3-6', 'L3.4', 'Iron, IAB-sHL', 'L3.7-6', 'EH7-an', 'Iron',
       'CR2', 'CO3.2', 'K3', 'L5/6', 'CK4', 'Iron, IIE-an', 'L3.6',
       'LL3.2', 'Pallasite', 'CO3.5', 'Lodranite', 'Mesosiderite-A3',
       'L3-4', 'H5/6', 'Pallasite, PMG', 'Eucrite-cm', 'L5-6', 'CO3.6',
       'Martian (nakhlite)', 'LL3.6', 'C3-ung', 'H3-4', 'CO3.4', 'EH3',
       'Iron, IAB-ung', 'Winonaite', 'LL', 'Eucrite-br', 'Iron, IIF',
       'R3.8-6', 'L4-6', 'EH5', 'LL3.00', 'H3.4', 'Martian (shergottite)',
       'Achondrite-ung', 'LL3.3', 'C', 'H/L3.6', 'Iron, IIIAB-an', 'LL7',
       'Mesosiderite-B2', 'LL4-6', 'CO3.7', 'L/LL6-an',
       'Iron, IAB complex', 'Pallasite, PMG-an', 'H3.9/4', 'L3.8',
       'LL5-6', 'LL3.8-6', 'L3.9', 'L4-5', 'L3-5', 'LL4/5', 'L4/5',
       'H3.9', 'H3.6-6', 'H3.8-5', 'H3.8/4', 'H3.9-5', 'CH3', 'R3.8-5',
       'L3.9/4', 'E4', 'CO3', 'Chondrite-ung', 'H~5', 'H~6', 'L/LL3.10',
       'EL5', 'LL3', 'L~6', 'L~3', 'H~4', 'L(LL)3.5-3.7', 'Iron, IIIE-an',
       'H3.6', 'L3.4-3.7', 'L3.5', 'CM1/2', 'Martian (OPX)', 'Brachinite',
       'LL7(?)', 'LL6(?)', 'Eucrite-Mg rich', 'H3.5-4', 'EL3', 'R3.6',
       'H3.5', 'CM1', 'L/LL3', 'H7', 'L(?)3', 'L3.2', 'L3.7-3.9',
       'Mesosiderite-B1', 'Eucrite-unbr', 'LL3.7', 'CO3.0', 'LL3.5',
       'L3.7-4', 'CV3-an', 'Lunar (anorth)', 'L3.3', 'Iron, IAB-sLM',
       'Lunar', 'Iron, IC', 'Iron, IID-an', 'Iron, IIIE', 'Iron, IVA-an',
       'CK6', 'L3.1', 'CK5', 'H3.3', 'H3.7-6', 'E6', 'H3.0', 'H3.1',
       'L3.0', 'L/LL3.4', 'C6', 'LL3.0', 'Lunar (gabbro)', 'R4', 'C4',
       'Iron, IIG', 'Iron, IIC', 'C1-ung', 'H5-an', 'EH4/5', 'Iron, IIIF',
       'R3-6', 'Mesosiderite-B4', 'L6/7', 'Relict H', 'L-imp melt', 'CK3',
       'H3-an', 'Iron, IVB', 'R3.8', 'L~5', 'Mesosiderite-an',
       'Mesosiderite-A2', 'Pallasite, PES', 'C4-ung', 'Iron, IAB?',
       'Mesosiderite-A', 'R3.5-6', 'H3.9-6', 'Ureilite-pmict', 'LL~6',
       'CK4/5', 'EL4', 'Lunar (feldsp. breccia)', 'L3.9-6', 'H-an',
       'L/LL3-6', 'L/LL3-5', 'H/L3.5', 'H/L3', 'R3-4', 'CK3-an', 'LL4-5',
       'H/L6', 'L3/4', 'H-imp melt', 'CR', 'Chondrite-fusion crust',
       'Iron, IAB-sLH', 'H(L)3-an', 'L(LL)3', 'H(L)3', 'R3', 'L7',
       'CM-an', 'L/LL~6', 'L/LL~5', 'L~4', 'L/LL~4', 'LL(L)3', 'H3.2',
       'L-melt breccia', 'H6-melt breccia', 'H5-melt breccia',
       'H-melt rock', 'Eucrite-an', 'Lunar (bas/anor)', 'LL5/6', 'LL3/4',
       'H3.4/3.5', 'Lunar (basalt)', 'H/L5', 'H(5?)', 'LL-imp melt',
       'Mesosiderite?', 'H~4/5', 'L6-melt breccia', 'L3.5-3.7',
       'Iron, IIAB-an', 'L3.3-3.7', 'L3.2-3.6', 'L3.3-3.6',
       'Acapulcoite/Lodranite', 'Mesosiderite-B', 'CK5/6', 'L3.05', 'C2',
       'C4/5', 'L/LL3.2', 'Iron, IIIAB?', 'L3.5-5', 'L/LL(?)3', 'H4(?)',
       'Iron, IAB-sHH', 'Relict iron', 'EL4/5', 'L5-7', 'Diogenite-an',
       'L-melt rock', 'CR1', 'H5 ', 'L5 ', 'H4 ', 'L4 ', 'E', 'L6 ',
       'H3 ', 'LL6 ', 'H-metal', 'H6 ', 'L-metal', 'Relict OC', 'EH',
       'Mesosiderite-A4', 'L/LL5/6', 'H3.8-4', 'CBb', 'EL6/7', 'EL7',
       'CH/CBb', 'CO3.8', 'H/L~4', 'Mesosiderite-C2', 'R5', 'H4/6',
       'H3.7-5', 'LL3.7-6', 'H3.7/3.8', 'L3.7/3.8', 'EH-imp melt', 'R',
       'Fusion crust', 'Aubrite-an', 'R6', 'LL-melt rock', 'L3.5-3.9',
       'L3.2-3.5', 'L3.3-3.5', 'L3.0-3.7', 'E3-an', 'K', 'E3',
       'Acapulcoite/lodranite', 'CK4-an', 'L(LL)3.05', 'L3.10', 'CB',
       'Diogenite-olivine', 'EL-melt rock', 'EH6', 'Pallasite, ungrouped',
       'L/LL4/5', 'L3.8-an', 'Iron, IAB-an', 'C5/6-ung', 'CV2',
       'Iron, IC-an', 'Lunar (bas. breccia)', 'L3.8-6', 'R3/4', 'R3.9',
       'CK', 'LL3.10', 'R4/5', 'L3.8-5', 'Mesosiderite-C', 'Enst achon',
       'H/L3-4', 'L(H)3', 'LL6/7', 'LL3.1', 'OC3', 'R3.7', 'CO3 ', 'CH3 ',
       'LL~4', 'LL~4/5', 'L(LL)~4', 'H3.05', 'H3.10',
       'Impact melt breccia', 'LL3-5', 'H/L3.7', 'LL3-4', 'CK3/4',
       'Martian', 'CO3.1', 'Lunar (bas/gab brec)', 'Achondrite-prim',
       'LL<3.5', 'CK3.8', 'L/LL-melt rock', 'H6/7', 'EL6 ',
       'Iron, IAB-sHL-an', 'CM2-an', 'R3-5', 'L4-melt rock',
       'L6-melt rock', 'H/L4/5', 'EL3/4', 'H/L6-melt rock',
       'Enst achon-ung', 'L3-7', 'R3.4', 'LL3.05', 'LL4/6', 'LL3.8-4',
       'H3.15', 'C3.0-ung', 'LL-melt breccia', 'LL6-melt breccia',
       'L5-melt breccia', 'LL(L)3.1', 'LL6-an', 'L4-melt breccia',
       'Howardite-an', 'H4-melt breccia', 'Martian (basaltic breccia)',
       'L3-melt breccia', 'L~4-6', 'LL~5', 'R3.5-4', 'CR7',
       'H-melt breccia', 'Lunar (norite)', 'L3.00', 'H3.0-3.4', 'L/LL4-6',
       'CM', 'EH7', 'L4-an', 'E-an', 'H3.8/3.9', 'L3.9-5', 'H3.8-6',
       'H3.4-5', 'L3.0-3.9', 'L3.5-3.8', 'H3.2-3.7', 'L3.6-4',
       'Iron, IIE?', 'C3/4-ung', 'L/LL3.5', 'L/LL3.6/3.7', 'H/L4-5',
       'LL~3', 'Pallasite?', 'LL5-7', 'LL3.9/4', 'H3.8-an', 'CR-an',
       'L/LL5-6', 'L(LL)5', 'L(LL)6', 'LL3.1-3.5', 'E5', 'Lodranite-an',
       'H3.2-6', 'H(?)4', 'E5-an', 'H3.2-an', 'EH6-an', 'Stone-ung',
       'C1/2-ung', 'L/LL'], dtype=object)

General statistics#

The .describe() method includes numeric columns by default. Here we can force it to include all columns.

meteor.describe(include='all')

	name	id	nametype	recclass	mass (g)	fall	year	reclat	reclong	GeoLocation
count	45716	45716.000000	45716	45716	4.558500e+04	45716	45425.000000	38401.000000	38401.000000	38401
unique	45716	NaN	2	466	NaN	2	NaN	NaN	NaN	17100
top	Aachen	NaN	Valid	L6	NaN	Found	NaN	NaN	NaN	(0.0, 0.0)
freq	1	NaN	45641	8285	NaN	44609	NaN	NaN	NaN	6214
mean	NaN	26889.735104	NaN	NaN	1.327808e+04	NaN	1991.828817	-39.122580	61.074319	NaN
std	NaN	16860.683030	NaN	NaN	5.749889e+05	NaN	25.052766	46.378511	80.647298	NaN
min	NaN	1.000000	NaN	NaN	0.000000e+00	NaN	860.000000	-87.366670	-165.433330	NaN
25%	NaN	12688.750000	NaN	NaN	7.200000e+00	NaN	1987.000000	-76.714240	0.000000	NaN
50%	NaN	24261.500000	NaN	NaN	3.260000e+01	NaN	1998.000000	-71.500000	35.666670	NaN
75%	NaN	40656.750000	NaN	NaN	2.026000e+02	NaN	2003.000000	0.000000	157.166670	NaN
max	NaN	57458.000000	NaN	NaN	6.000000e+07	NaN	2101.000000	81.166670	354.473330	NaN

Note: NaN values signify missing data. For instance, the fall column contains strings, so there is no value for mean; likewise, mass (g) is numeric, so we don’t have entries for the categorical summary statistics (unique, top, freq).

Practice: Yellow cab data#

The following command downloads a .parquet file containing NYC Yellow Taxi data, a common storage format for moderate to large datasets.

!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

taxi = pd.read_parquet('yellow_tripdata_2024-01.parquet')

Examine the first five rows of the dataset.

How much data are included in this dataset?

Calculate summary statistics for the fare_amount, tolls_amount, and tip_amount. Do they add up to the total_amount?

Find the trip that has the longest trip by distance (trip_distance).

Compare the average total_amount for short versus long trips (short trip has trip_distance < 10). Make sure we do not include zero-distance trips.

Data cleaning#

We will cover some common transformations that facilitate data analysis, including rearranging columns, type conversion, and sorting.

minitaxi = taxi.sample(10000)  # down-sample our dataset for illustration

minitaxi.shape

minitaxi.head()

Dropping columns#

# drop all id columns and the store_and_fwd_flag column
mask = minitaxi.columns.str.contains('ID$|store_and_fwd_flag', regex=True)
columns_to_drop = minitaxi.columns[mask]
columns_to_drop

minitaxi = minitaxi.drop(columns=columns_to_drop)
minitaxi.head()

Renaming columns#

minitaxi = minitaxi.rename(
    columns={
        'tpep_pickup_datetime': 'pickup', 
        'tpep_dropoff_datetime': 'dropoff'
    }
)
minitaxi.columns

Examine the correct data types (type conversion)#

minitaxi.dtypes

minitaxi.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9524 entries, 597270 to 2512071
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   pickup                 9524 non-null   datetime64[us]
 1   dropoff                9524 non-null   datetime64[us]
 2   passenger_count        9524 non-null   float64       
 3   trip_distance          9524 non-null   float64       
 4   payment_type           9524 non-null   int64         
 5   fare_amount            9524 non-null   float64       
 6   extra                  9524 non-null   float64       
 7   mta_tax                9524 non-null   float64       
 8   tip_amount             9524 non-null   float64       
 9   tolls_amount           9524 non-null   float64       
 10  improvement_surcharge  9524 non-null   float64       
 11  total_amount           9524 non-null   float64       
 12  congestion_surcharge   9524 non-null   float64       
 13  Airport_fee            9524 non-null   float64       
dtypes: datetime64[us](2), float64(11), int64(1)
memory usage: 1.1 MB

# cast passenger_count to integers
minitaxi['passenger_count'] = minitaxi['passenger_count'].astype(int)
minitaxi.dtypes

pickup                   datetime64[us]
dropoff                  datetime64[us]
passenger_count                   int32
trip_distance                   float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
Airport_fee                     float64
dtype: object

Dropping rows with NA values#

minitaxi = minitaxi.dropna()

minitaxi.info()

Creating new columns#

Let’s calculate the following for each row:

elapsed time of the trip
the tip percentage
the total taxes, tolls, fees, and surcharges
the average speed of the taxi

minitaxi = minitaxi.assign(
    elapsed_time=lambda x: x.dropoff - x.pickup,
    cost_before_tip=lambda x: x.total_amount - x.tip_amount,
    tip_pct=lambda x: x.tip_amount / x.cost_before_tip, 
    fees=lambda x: x.cost_before_tip - x.fare_amount, 
    avg_speed=lambda x: x.trip_distance.div(
        x.elapsed_time.dt.total_seconds() / 60 / 60
    )
)

minitaxi.head(2)

Notes:

We used lambda functions to 1) avoid typing taxis repeatedly and 2) be able to access the cost_before_tip and elapsed_time columns in the same method that we create them.
To create a single new column, we can also use df['new_col'] = <values>.

Sorting by values#

# sort by descending passenger count and pickups from earliest to latest
minitaxi.sort_values(['passenger_count', 'pickup'], ascending=[False, True]).head()

# pick out the 3 trips with largest timespan
minitaxi.nlargest(3, 'elapsed_time')

Working with index#

Currently the index is simply using the row numbers, but if we wish to work with the pickup times significantly, perhaps indexing by the datetime column is more effective.

minitaxi.index

Setting index#

minitaxi = minitaxi.set_index('pickup')
minitaxi.head(3)

Sorting by index#

# sorting by indices
minitaxi = minitaxi.sort_index()
minitaxi.head()

	dropoff	passenger_count	trip_distance	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	Airport_fee	elapsed_time	cost_before_tip	tip_pct	fees	avg_speed
pickup
2024-01-01 00:10:19	2024-01-01 00:22:58	1	1.92	1	13.5	1.0	0.5	3.70	0.0	1.0	22.20	2.5	0.0	0 days 00:12:39	18.5	0.200000	5.0	9.106719
2024-01-01 00:15:10	2024-01-01 00:22:51	1	1.64	1	10.0	1.0	0.5	3.00	0.0	1.0	18.00	2.5	0.0	0 days 00:07:41	15.0	0.200000	5.0	12.806941
2024-01-01 00:19:09	2024-01-01 00:41:03	3	9.12	1	38.7	1.0	0.5	8.74	0.0	1.0	52.44	2.5	0.0	0 days 00:21:54	43.7	0.200000	5.0	24.986301
2024-01-01 00:20:34	2024-01-01 00:25:54	1	0.80	1	6.5	3.5	0.5	2.30	0.0	1.0	13.80	2.5	0.0	0 days 00:05:20	11.5	0.200000	5.0	9.000000
2024-01-01 00:23:35	2024-01-01 00:49:13	1	3.08	1	23.3	1.0	0.5	7.08	0.0	1.0	35.38	2.5	0.0	0 days 00:25:38	28.3	0.250177	5.0	7.209363

Recall how we have used [0:4] to locate the rows indexed 0 to 4. Since we are indexing by datetime, we can select within time ranges.

Selecting by index#

# selecting the taxi rides in the first 6 hours of the new year
minitaxi['2024-01-01 00:00':'2024-01-01 06:00']

	dropoff	passenger_count	trip_distance	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	Airport_fee	elapsed_time	cost_before_tip	tip_pct	fees	avg_speed
pickup
2024-01-01 00:10:19	2024-01-01 00:22:58	1	1.92	1	13.5	1.0	0.5	3.70	0.0	1.0	22.20	2.5	0.0	0 days 00:12:39	18.5	0.200000	5.0	9.106719
2024-01-01 00:15:10	2024-01-01 00:22:51	1	1.64	1	10.0	1.0	0.5	3.00	0.0	1.0	18.00	2.5	0.0	0 days 00:07:41	15.0	0.200000	5.0	12.806941
2024-01-01 00:19:09	2024-01-01 00:41:03	3	9.12	1	38.7	1.0	0.5	8.74	0.0	1.0	52.44	2.5	0.0	0 days 00:21:54	43.7	0.200000	5.0	24.986301
2024-01-01 00:20:34	2024-01-01 00:25:54	1	0.80	1	6.5	3.5	0.5	2.30	0.0	1.0	13.80	2.5	0.0	0 days 00:05:20	11.5	0.200000	5.0	9.000000
2024-01-01 00:23:35	2024-01-01 00:49:13	1	3.08	1	23.3	1.0	0.5	7.08	0.0	1.0	35.38	2.5	0.0	0 days 00:25:38	28.3	0.250177	5.0	7.209363
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2024-01-01 04:53:26	2024-01-02 03:11:13	6	9.97	1	52.0	1.0	0.5	0.00	0.0	1.0	57.00	2.5	0.0	0 days 22:17:47	57.0	0.000000	5.0	0.447158
2024-01-01 05:06:58	2024-01-01 05:10:03	1	0.63	1	5.8	1.0	0.5	2.16	0.0	1.0	12.96	2.5	0.0	0 days 00:03:05	10.8	0.200000	5.0	12.259459
2024-01-01 05:16:59	2024-01-01 05:35:39	1	4.26	1	22.6	1.0	0.5	5.52	0.0	1.0	33.12	2.5	0.0	0 days 00:18:40	27.6	0.200000	5.0	13.692857
2024-01-01 05:17:26	2024-01-01 05:22:32	1	1.90	1	10.0	1.0	0.5	0.00	0.0	1.0	15.00	2.5	0.0	0 days 00:05:06	15.0	0.000000	5.0	22.352941
2024-01-01 05:29:34	2024-01-01 05:38:38	1	1.57	1	10.7	1.0	0.5	0.00	0.0	1.0	15.70	2.5	0.0	0 days 00:09:04	15.7	0.000000	5.0	10.389706

70 rows × 18 columns

Resetting index#

We can revert any specific column index back to row numbers, but notice that by setting and resetting index you lose the original row numbers.

minitaxi = minitaxi.reset_index()
minitaxi.head()

	pickup	dropoff	passenger_count	trip_distance	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	congestion_surcharge	elapsed_time	cost_before_tip	tip_pct	fees	avg_speed
0	2024-01-01 00:10:19	2024-01-01 00:22:58	1	1.92	1	13.5	1.0	0.5	3.70	1.0	22.20	2.5	0 days 00:12:39	18.5	0.200000	5.0	9.106719
1	2024-01-01 00:15:10	2024-01-01 00:22:51	1	1.64	1	10.0	1.0	0.5	3.00	1.0	18.00	2.5	0 days 00:07:41	15.0	0.200000	5.0	12.806941
2	2024-01-01 00:19:09	2024-01-01 00:41:03	3	9.12	1	38.7	1.0	0.5	8.74	1.0	52.44	2.5	0 days 00:21:54	43.7	0.200000	5.0	24.986301
3	2024-01-01 00:20:34	2024-01-01 00:25:54	1	0.80	1	6.5	3.5	0.5	2.30	1.0	13.80	2.5	0 days 00:05:20	11.5	0.200000	5.0	9.000000
4	2024-01-01 00:23:35	2024-01-01 00:49:13	1	3.08	1	23.3	1.0	0.5	7.08	1.0	35.38	2.5	0 days 00:25:38	28.3	0.250177	5.0	7.209363

Practice: Meteorite indexing#

Using the meteor dataset,

cast the year column to an integer column.
create a new column indicating whether the meteorite was observed falling before 1970.
set the index to the id column and extract all the rows with IDs between 10,036 and 10,040 (inclusive) with loc[].
examine the year column to see if there are any data errors.

import pandas as pd

meteor = pd.read_csv('../data/Meteorite_Landings.csv')

#meteor = meteor.dropna()
meteor['year'] = meteor['year'].astype(int)
meteor.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38115 entries, 0 to 45715
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         38115 non-null  object 
 1   id           38115 non-null  int64  
 2   nametype     38115 non-null  object 
 3   recclass     38115 non-null  object 
 4   mass (g)     38115 non-null  float64
 5   fall         38115 non-null  object 
 6   year         38115 non-null  int32  
 7   reclat       38115 non-null  float64
 8   reclong      38115 non-null  float64
 9   GeoLocation  38115 non-null  object 
dtypes: float64(3), int32(1), int64(1), object(5)
memory usage: 3.1+ MB

meteor = meteor.set_index('id')

meteor.set_index('id')

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_26756\953505260.py in ?()
----> 1 meteor.set_index('id')

~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py in ?(self, keys, drop, append, inplace, verify_integrity)
   6118                     if not found:
   6119                         missing.append(col)
   6120 
   6121         if missing:
-> 6122             raise KeyError(f"None of {missing} are in the columns")
   6123 
   6124         if inplace:
   6125             frame = self

KeyError: "None of ['id'] are in the columns"

meteor.sort_index().loc[10036:10040]

<class 'pandas.core.frame.DataFrame'>
Index: 38115 entries, 1 to 30414
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         38115 non-null  object 
 1   nametype     38115 non-null  object 
 2   recclass     38115 non-null  object 
 3   mass (g)     38115 non-null  float64
 4   fall         38115 non-null  object 
 5   year         38115 non-null  int32  
 6   reclat       38115 non-null  float64
 7   reclong      38115 non-null  float64
 8   GeoLocation  38115 non-null  object 
dtypes: float64(3), int32(1), object(5)
memory usage: 2.8+ MB

For your reference, on the issue of indexing#

pandas 2.2 User Manual on Indexing