pandas
Part I#
This document includes a sequence of notebooks to introduce data manipulation in Python using the pandas
library.
Basic building blocks#
pandas
works with tabular data (rows and columns) through Series
, DataFrame
objects.
import pandas as pd
meteor = pd.read_csv('../data/Meteorite_Landings.csv')
Source: NASA Open Data Portal
type(meteor)
A DataFrame
is the basic “spreadsheet” or “table” used in Python. A DataFrame
object is composed of one or more Series
objects (columns), and indexed by Index
(rows).
# head.() shows us the first several rows
meteor.head()
# investigate a column (note its type)
print(type(meteor.name))
meteor.name.head()
# investigate how the columns are labeled
meteor.columns
# investigate how the rows are indexed
meteor.index
DataFrame sources#
DataFrame
s can be created from reading a file, scraping the web, and/or API requests.
Reading from a file#
import pandas as pd
meteor = pd.read_csv('../data/Meteorite_Landings.csv')
API requests (more details later)#
import requests
response = requests.get(
'https://data.nasa.gov/resource/gh4g-9sfh.json',
params={'$limit': 50000} # Depending on the API, there may be a default limit of records one can obtain
)
response # A 200 exit code indicates success
<Response [200]>
Tip: A list of HTTP GET exit codes is available at https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.
response.ok # checks ok flag
True
# Extract data if request is successful
if response.ok:
payload = response.json()
else:
print(f'Request failed with exit code {response.status_code}.')
payload = None
# Load into DataFrame
meteor_json = pd.DataFrame(payload)
meteor_json.head()
name | id | nametype | recclass | mass | fall | year | reclat | reclong | geolocation | :@computed_region_cbhk_fwbd | :@computed_region_nnqa_25f4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aachen | 1 | Valid | L5 | 21 | Fell | 1880-01-01T00:00:00.000 | 50.775000 | 6.083330 | {'latitude': '50.775', 'longitude': '6.08333'} | NaN | NaN |
1 | Aarhus | 2 | Valid | H6 | 720 | Fell | 1951-01-01T00:00:00.000 | 56.183330 | 10.233330 | {'latitude': '56.18333', 'longitude': '10.23333'} | NaN | NaN |
2 | Abee | 6 | Valid | EH4 | 107000 | Fell | 1952-01-01T00:00:00.000 | 54.216670 | -113.000000 | {'latitude': '54.21667', 'longitude': '-113.0'} | NaN | NaN |
3 | Acapulco | 10 | Valid | Acapulcoite | 1914 | Fell | 1976-01-01T00:00:00.000 | 16.883330 | -99.900000 | {'latitude': '16.88333', 'longitude': '-99.9'} | NaN | NaN |
4 | Achiras | 370 | Valid | L6 | 780 | Fell | 1902-01-01T00:00:00.000 | -33.166670 | -64.950000 | {'latitude': '-33.16667', 'longitude': '-64.95'} | NaN | NaN |
# Removing auto-computed columns
mask = meteor_json.columns.str.contains('@computed_region', regex=True)
columns_to_drop = meteor_json.columns[mask]
meteor_json = meteor_json.drop(columns=columns_to_drop)
# storing downloaded data into files
# meteor_json.to_csv('meteor.csv')
Basic inspection#
What type of data are available in the dataframe? Are there missing data?#
meteor.info()
How much data are available?#
meteor.shape
Subsetting and indexing#
Effectively extracting data from a full dataset requires fluency in how the DataFrame
can be subsetted and how it is indexed.
Calling a column by attributes (if valid)#
meteor.recclass
Calling a column by keys#
meteor['mass (g)']
Multiple columns by name#
meteor[['name', 'mass (g)']]
name | mass (g) | |
---|---|---|
0 | Aachen | 21.0 |
1 | Aarhus | 720.0 |
2 | Abee | 107000.0 |
3 | Acapulco | 1914.0 |
4 | Achiras | 780.0 |
... | ... | ... |
45711 | Zillah 002 | 172.0 |
45712 | Zinder | 46.0 |
45713 | Zlin | 3.3 |
45714 | Zubkovsky | 2167.0 |
45715 | Zulu Queen | 200.0 |
45716 rows × 2 columns
Selecting rows#
meteor[5:10] # end-exclusive
name | id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | |
---|---|---|---|---|---|---|---|---|---|---|
5 | Adhi Kot | 379 | Valid | EH4 | 4239.0 | Fell | 1919.0 | 32.10000 | 71.80000 | (32.1, 71.8) |
6 | Adzhi-Bogdo (stone) | 390 | Valid | LL3-6 | 910.0 | Fell | 1949.0 | 44.83333 | 95.16667 | (44.83333, 95.16667) |
7 | Agen | 392 | Valid | H5 | 30000.0 | Fell | 1814.0 | 44.21667 | 0.61667 | (44.21667, 0.61667) |
8 | Aguada | 398 | Valid | L6 | 1620.0 | Fell | 1930.0 | -31.60000 | -65.23333 | (-31.6, -65.23333) |
9 | Aguila Blanca | 417 | Valid | L | 1440.0 | Fell | 1920.0 | -30.86667 | -64.55000 | (-30.86667, -64.55) |
Indexing with .loc[]
, iloc[]
#
.loc[]
indexes by row labels.iloc[]
indexes by indices
meteor.loc[0:4, 'name':'mass (g)']
name | id | nametype | recclass | mass (g) | |
---|---|---|---|---|---|
0 | Aachen | 1 | Valid | L5 | 21.0 |
1 | Aarhus | 2 | Valid | H6 | 720.0 |
2 | Abee | 6 | Valid | EH4 | 107000.0 |
3 | Acapulco | 10 | Valid | Acapulcoite | 1914.0 |
4 | Achiras | 370 | Valid | L6 | 780.0 |
meteor.iloc[0:4, 0:5]
name | id | nametype | recclass | mass (g) | |
---|---|---|---|---|---|
0 | Aachen | 1 | Valid | L5 | 21.0 |
1 | Aarhus | 2 | Valid | H6 | 720.0 |
2 | Abee | 6 | Valid | EH4 | 107000.0 |
3 | Acapulco | 10 | Valid | Acapulcoite | 1914.0 |
Filtering or subsetting by condition#
Selection by condition can be performed by creating a Boolean mask with True/False values to specify which rows/columns to select.
# select records with heavy meteor (mass > 10^7) that are found (fall = 'Found')
mask = (meteor['mass (g)'] > 1e7) & (meteor.fall == 'Found')
mask
0 False
1 False
2 False
3 False
4 False
...
45711 False
45712 False
45713 False
45714 False
45715 False
Length: 45716, dtype: bool
Note: Each condition is surrounded by parentheses, and we use bitwise operator (&
, |
, ~
) instead of logical operators (and
, or
, not
).
meteor[mask]
name | id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | |
---|---|---|---|---|---|---|---|---|---|---|
3455 | Armanty | 2335 | Valid | Iron, IIIE | 28000000.0 | Found | 1898.0 | 47.00000 | 88.00000 | (47.0, 88.0) |
5016 | Bacubirito | 4919 | Valid | Iron, ungrouped | 22000000.0 | Found | 1863.0 | 26.20000 | -107.83333 | (26.2, -107.83333) |
5365 | Campo del Cielo | 5247 | Valid | Iron, IAB-MG | 50000000.0 | Found | 1575.0 | -27.46667 | -60.58333 | (-27.46667, -60.58333) |
5370 | Canyon Diablo | 5257 | Valid | Iron, IAB-MG | 30000000.0 | Found | 1891.0 | 35.05000 | -111.03333 | (35.05, -111.03333) |
5373 | Cape York | 5262 | Valid | Iron, IIIAB | 58200000.0 | Found | 1818.0 | 76.13333 | -64.93333 | (76.13333, -64.93333) |
5468 | Chupaderos | 5363 | Valid | Iron, IIIAB | 24300000.0 | Found | 1852.0 | 27.00000 | -105.10000 | (27.0, -105.1) |
12613 | Gibeon | 10912 | Valid | Iron, IVA | 26000000.0 | Found | 1836.0 | -25.50000 | 18.00000 | (-25.5, 18.0) |
16392 | Hoba | 11890 | Valid | Iron, IVB | 60000000.0 | Found | 1920.0 | -19.58333 | 17.91667 | (-19.58333, 17.91667) |
22921 | Mbosi | 15456 | Valid | Iron, ungrouped | 16000000.0 | Found | 1930.0 | -9.11667 | 33.06667 | (-9.11667, 33.06667) |
26174 | Morito | 16745 | Valid | Iron, IIIAB | 10100000.0 | Found | 1600.0 | 27.05000 | -105.43333 | (27.05, -105.43333) |
26297 | Mundrabilla | 16852 | Valid | Iron, IAB-ung | 24000000.0 | Found | 1911.0 | -30.78333 | 127.55000 | (-30.78333, 127.55) |
38317 | Willamette | 24269 | Valid | Iron, IIIAB | 15500000.0 | Found | 1902.0 | 45.36667 | -122.58333 | (45.36667, -122.58333) |
# negation of a mask
meteor[~mask]
Note: Boolean masks can be used with loc[]
and iloc[]
as well.
Calculating summary statistics#
This section discusses preliminary calculations before conducting further data analysis.
How many of the meteorites were observed falling vs found?#
meteor.fall.value_counts()
fall
Found 44609
Fell 1107
Name: count, dtype: int64
meteor.fall.value_counts(normalize=True)
fall
Found 0.975785
Fell 0.024215
Name: proportion, dtype: float64
Behavior of mass of a meterorite?#
meteor['mass (g)'].mean()
meteor['mass (g)'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])
meteor['mass (g)'].max()
# sometimes helpful to locate the other information related to a particular entry
meteor['mass (g)'].idxmax()
meteor.loc[meteor['mass (g)'].idxmax()] # note the "index" of max()
name Hoba
id 11890
nametype Valid
recclass Iron, IVB
mass (g) 60000000.0
fall Found
year 1920.0
reclat -19.58333
reclong 17.91667
GeoLocation (-19.58333, 17.91667)
Name: 16392, dtype: object
How many unique classes are in this dataset?#
meteor.recclass.unique()
array(['L5', 'H6', 'EH4', 'Acapulcoite', 'L6', 'LL3-6', 'H5', 'L',
'Diogenite-pm', 'Unknown', 'H4', 'H', 'Iron, IVA', 'CR2-an', 'LL5',
'CI1', 'L/LL4', 'Eucrite-mmict', 'CV3', 'Ureilite-an',
'Stone-uncl', 'L3', 'Angrite', 'LL6', 'L4', 'Aubrite',
'Iron, IIAB', 'Iron, IAB-sLL', 'Iron, ungrouped', 'CM2', 'OC',
'Mesosiderite-A1', 'LL4', 'C2-ung', 'LL3.8', 'Howardite',
'Eucrite-pmict', 'Diogenite', 'LL3.15', 'LL3.9', 'Iron, IAB-MG',
'H/L3.9', 'Iron?', 'Eucrite', 'H4-an', 'L/LL6', 'Iron, IIIAB',
'H/L4', 'H4-5', 'L3.7', 'LL3.4', 'Martian (chassignite)', 'EL6',
'H3.8', 'H3-5', 'H5-6', 'Mesosiderite', 'H5-7', 'L3-6', 'H4-6',
'Ureilite', 'Iron, IID', 'Mesosiderite-A3/4', 'CO3.3', 'H3',
'EH3/4-an', 'Iron, IIE', 'L/LL5', 'H3.7', 'CBa', 'H4/5', 'H3/4',
'H?', 'H3-6', 'L3.4', 'Iron, IAB-sHL', 'L3.7-6', 'EH7-an', 'Iron',
'CR2', 'CO3.2', 'K3', 'L5/6', 'CK4', 'Iron, IIE-an', 'L3.6',
'LL3.2', 'Pallasite', 'CO3.5', 'Lodranite', 'Mesosiderite-A3',
'L3-4', 'H5/6', 'Pallasite, PMG', 'Eucrite-cm', 'L5-6', 'CO3.6',
'Martian (nakhlite)', 'LL3.6', 'C3-ung', 'H3-4', 'CO3.4', 'EH3',
'Iron, IAB-ung', 'Winonaite', 'LL', 'Eucrite-br', 'Iron, IIF',
'R3.8-6', 'L4-6', 'EH5', 'LL3.00', 'H3.4', 'Martian (shergottite)',
'Achondrite-ung', 'LL3.3', 'C', 'H/L3.6', 'Iron, IIIAB-an', 'LL7',
'Mesosiderite-B2', 'LL4-6', 'CO3.7', 'L/LL6-an',
'Iron, IAB complex', 'Pallasite, PMG-an', 'H3.9/4', 'L3.8',
'LL5-6', 'LL3.8-6', 'L3.9', 'L4-5', 'L3-5', 'LL4/5', 'L4/5',
'H3.9', 'H3.6-6', 'H3.8-5', 'H3.8/4', 'H3.9-5', 'CH3', 'R3.8-5',
'L3.9/4', 'E4', 'CO3', 'Chondrite-ung', 'H~5', 'H~6', 'L/LL3.10',
'EL5', 'LL3', 'L~6', 'L~3', 'H~4', 'L(LL)3.5-3.7', 'Iron, IIIE-an',
'H3.6', 'L3.4-3.7', 'L3.5', 'CM1/2', 'Martian (OPX)', 'Brachinite',
'LL7(?)', 'LL6(?)', 'Eucrite-Mg rich', 'H3.5-4', 'EL3', 'R3.6',
'H3.5', 'CM1', 'L/LL3', 'H7', 'L(?)3', 'L3.2', 'L3.7-3.9',
'Mesosiderite-B1', 'Eucrite-unbr', 'LL3.7', 'CO3.0', 'LL3.5',
'L3.7-4', 'CV3-an', 'Lunar (anorth)', 'L3.3', 'Iron, IAB-sLM',
'Lunar', 'Iron, IC', 'Iron, IID-an', 'Iron, IIIE', 'Iron, IVA-an',
'CK6', 'L3.1', 'CK5', 'H3.3', 'H3.7-6', 'E6', 'H3.0', 'H3.1',
'L3.0', 'L/LL3.4', 'C6', 'LL3.0', 'Lunar (gabbro)', 'R4', 'C4',
'Iron, IIG', 'Iron, IIC', 'C1-ung', 'H5-an', 'EH4/5', 'Iron, IIIF',
'R3-6', 'Mesosiderite-B4', 'L6/7', 'Relict H', 'L-imp melt', 'CK3',
'H3-an', 'Iron, IVB', 'R3.8', 'L~5', 'Mesosiderite-an',
'Mesosiderite-A2', 'Pallasite, PES', 'C4-ung', 'Iron, IAB?',
'Mesosiderite-A', 'R3.5-6', 'H3.9-6', 'Ureilite-pmict', 'LL~6',
'CK4/5', 'EL4', 'Lunar (feldsp. breccia)', 'L3.9-6', 'H-an',
'L/LL3-6', 'L/LL3-5', 'H/L3.5', 'H/L3', 'R3-4', 'CK3-an', 'LL4-5',
'H/L6', 'L3/4', 'H-imp melt', 'CR', 'Chondrite-fusion crust',
'Iron, IAB-sLH', 'H(L)3-an', 'L(LL)3', 'H(L)3', 'R3', 'L7',
'CM-an', 'L/LL~6', 'L/LL~5', 'L~4', 'L/LL~4', 'LL(L)3', 'H3.2',
'L-melt breccia', 'H6-melt breccia', 'H5-melt breccia',
'H-melt rock', 'Eucrite-an', 'Lunar (bas/anor)', 'LL5/6', 'LL3/4',
'H3.4/3.5', 'Lunar (basalt)', 'H/L5', 'H(5?)', 'LL-imp melt',
'Mesosiderite?', 'H~4/5', 'L6-melt breccia', 'L3.5-3.7',
'Iron, IIAB-an', 'L3.3-3.7', 'L3.2-3.6', 'L3.3-3.6',
'Acapulcoite/Lodranite', 'Mesosiderite-B', 'CK5/6', 'L3.05', 'C2',
'C4/5', 'L/LL3.2', 'Iron, IIIAB?', 'L3.5-5', 'L/LL(?)3', 'H4(?)',
'Iron, IAB-sHH', 'Relict iron', 'EL4/5', 'L5-7', 'Diogenite-an',
'L-melt rock', 'CR1', 'H5 ', 'L5 ', 'H4 ', 'L4 ', 'E', 'L6 ',
'H3 ', 'LL6 ', 'H-metal', 'H6 ', 'L-metal', 'Relict OC', 'EH',
'Mesosiderite-A4', 'L/LL5/6', 'H3.8-4', 'CBb', 'EL6/7', 'EL7',
'CH/CBb', 'CO3.8', 'H/L~4', 'Mesosiderite-C2', 'R5', 'H4/6',
'H3.7-5', 'LL3.7-6', 'H3.7/3.8', 'L3.7/3.8', 'EH-imp melt', 'R',
'Fusion crust', 'Aubrite-an', 'R6', 'LL-melt rock', 'L3.5-3.9',
'L3.2-3.5', 'L3.3-3.5', 'L3.0-3.7', 'E3-an', 'K', 'E3',
'Acapulcoite/lodranite', 'CK4-an', 'L(LL)3.05', 'L3.10', 'CB',
'Diogenite-olivine', 'EL-melt rock', 'EH6', 'Pallasite, ungrouped',
'L/LL4/5', 'L3.8-an', 'Iron, IAB-an', 'C5/6-ung', 'CV2',
'Iron, IC-an', 'Lunar (bas. breccia)', 'L3.8-6', 'R3/4', 'R3.9',
'CK', 'LL3.10', 'R4/5', 'L3.8-5', 'Mesosiderite-C', 'Enst achon',
'H/L3-4', 'L(H)3', 'LL6/7', 'LL3.1', 'OC3', 'R3.7', 'CO3 ', 'CH3 ',
'LL~4', 'LL~4/5', 'L(LL)~4', 'H3.05', 'H3.10',
'Impact melt breccia', 'LL3-5', 'H/L3.7', 'LL3-4', 'CK3/4',
'Martian', 'CO3.1', 'Lunar (bas/gab brec)', 'Achondrite-prim',
'LL<3.5', 'CK3.8', 'L/LL-melt rock', 'H6/7', 'EL6 ',
'Iron, IAB-sHL-an', 'CM2-an', 'R3-5', 'L4-melt rock',
'L6-melt rock', 'H/L4/5', 'EL3/4', 'H/L6-melt rock',
'Enst achon-ung', 'L3-7', 'R3.4', 'LL3.05', 'LL4/6', 'LL3.8-4',
'H3.15', 'C3.0-ung', 'LL-melt breccia', 'LL6-melt breccia',
'L5-melt breccia', 'LL(L)3.1', 'LL6-an', 'L4-melt breccia',
'Howardite-an', 'H4-melt breccia', 'Martian (basaltic breccia)',
'L3-melt breccia', 'L~4-6', 'LL~5', 'R3.5-4', 'CR7',
'H-melt breccia', 'Lunar (norite)', 'L3.00', 'H3.0-3.4', 'L/LL4-6',
'CM', 'EH7', 'L4-an', 'E-an', 'H3.8/3.9', 'L3.9-5', 'H3.8-6',
'H3.4-5', 'L3.0-3.9', 'L3.5-3.8', 'H3.2-3.7', 'L3.6-4',
'Iron, IIE?', 'C3/4-ung', 'L/LL3.5', 'L/LL3.6/3.7', 'H/L4-5',
'LL~3', 'Pallasite?', 'LL5-7', 'LL3.9/4', 'H3.8-an', 'CR-an',
'L/LL5-6', 'L(LL)5', 'L(LL)6', 'LL3.1-3.5', 'E5', 'Lodranite-an',
'H3.2-6', 'H(?)4', 'E5-an', 'H3.2-an', 'EH6-an', 'Stone-ung',
'C1/2-ung', 'L/LL'], dtype=object)
General statistics#
The .describe()
method includes numeric columns by default. Here we can force it to include all columns.
meteor.describe(include='all')
name | id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | |
---|---|---|---|---|---|---|---|---|---|---|
count | 45716 | 45716.000000 | 45716 | 45716 | 4.558500e+04 | 45716 | 45425.000000 | 38401.000000 | 38401.000000 | 38401 |
unique | 45716 | NaN | 2 | 466 | NaN | 2 | NaN | NaN | NaN | 17100 |
top | Aachen | NaN | Valid | L6 | NaN | Found | NaN | NaN | NaN | (0.0, 0.0) |
freq | 1 | NaN | 45641 | 8285 | NaN | 44609 | NaN | NaN | NaN | 6214 |
mean | NaN | 26889.735104 | NaN | NaN | 1.327808e+04 | NaN | 1991.828817 | -39.122580 | 61.074319 | NaN |
std | NaN | 16860.683030 | NaN | NaN | 5.749889e+05 | NaN | 25.052766 | 46.378511 | 80.647298 | NaN |
min | NaN | 1.000000 | NaN | NaN | 0.000000e+00 | NaN | 860.000000 | -87.366670 | -165.433330 | NaN |
25% | NaN | 12688.750000 | NaN | NaN | 7.200000e+00 | NaN | 1987.000000 | -76.714240 | 0.000000 | NaN |
50% | NaN | 24261.500000 | NaN | NaN | 3.260000e+01 | NaN | 1998.000000 | -71.500000 | 35.666670 | NaN |
75% | NaN | 40656.750000 | NaN | NaN | 2.026000e+02 | NaN | 2003.000000 | 0.000000 | 157.166670 | NaN |
max | NaN | 57458.000000 | NaN | NaN | 6.000000e+07 | NaN | 2101.000000 | 81.166670 | 354.473330 | NaN |
Note: NaN
values signify missing data. For instance, the fall column contains strings, so there is no value for mean; likewise, mass (g) is numeric, so we don’t have entries for the categorical summary statistics (unique, top, freq).
Practice: Yellow cab data#
The following command downloads a .parquet
file containing NYC Yellow Taxi data, a common storage format for moderate to large datasets.
!curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
taxi = pd.read_parquet('yellow_tripdata_2024-01.parquet')
Examine the first five rows of the dataset.
How much data are included in this dataset?
Calculate summary statistics for the
fare_amount
,tolls_amount
, andtip_amount
. Do they add up to thetotal_amount
?
Find the trip that has the longest trip by distance (
trip_distance
).
Compare the average
total_amount
for short versus long trips (short trip hastrip_distance
< 10). Make sure we do not include zero-distance trips.
Data cleaning#
We will cover some common transformations that facilitate data analysis, including rearranging columns, type conversion, and sorting.
minitaxi = taxi.sample(10000) # down-sample our dataset for illustration
minitaxi.shape
minitaxi.head()
Dropping columns#
# drop all id columns and the store_and_fwd_flag column
mask = minitaxi.columns.str.contains('ID$|store_and_fwd_flag', regex=True)
columns_to_drop = minitaxi.columns[mask]
columns_to_drop
minitaxi = minitaxi.drop(columns=columns_to_drop)
minitaxi.head()
Renaming columns#
minitaxi = minitaxi.rename(
columns={
'tpep_pickup_datetime': 'pickup',
'tpep_dropoff_datetime': 'dropoff'
}
)
minitaxi.columns
Examine the correct data types (type conversion)#
minitaxi.dtypes
minitaxi.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9524 entries, 597270 to 2512071
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pickup 9524 non-null datetime64[us]
1 dropoff 9524 non-null datetime64[us]
2 passenger_count 9524 non-null float64
3 trip_distance 9524 non-null float64
4 payment_type 9524 non-null int64
5 fare_amount 9524 non-null float64
6 extra 9524 non-null float64
7 mta_tax 9524 non-null float64
8 tip_amount 9524 non-null float64
9 tolls_amount 9524 non-null float64
10 improvement_surcharge 9524 non-null float64
11 total_amount 9524 non-null float64
12 congestion_surcharge 9524 non-null float64
13 Airport_fee 9524 non-null float64
dtypes: datetime64[us](2), float64(11), int64(1)
memory usage: 1.1 MB
# cast passenger_count to integers
minitaxi['passenger_count'] = minitaxi['passenger_count'].astype(int)
minitaxi.dtypes
pickup datetime64[us]
dropoff datetime64[us]
passenger_count int32
trip_distance float64
payment_type int64
fare_amount float64
extra float64
mta_tax float64
tip_amount float64
tolls_amount float64
improvement_surcharge float64
total_amount float64
congestion_surcharge float64
Airport_fee float64
dtype: object
Dropping rows with NA values#
minitaxi = minitaxi.dropna()
minitaxi.info()
Creating new columns#
Let’s calculate the following for each row:
elapsed time of the trip
the tip percentage
the total taxes, tolls, fees, and surcharges
the average speed of the taxi
minitaxi = minitaxi.assign(
elapsed_time=lambda x: x.dropoff - x.pickup,
cost_before_tip=lambda x: x.total_amount - x.tip_amount,
tip_pct=lambda x: x.tip_amount / x.cost_before_tip,
fees=lambda x: x.cost_before_tip - x.fare_amount,
avg_speed=lambda x: x.trip_distance.div(
x.elapsed_time.dt.total_seconds() / 60 / 60
)
)
minitaxi.head(2)
Notes:
We used
lambda
functions to 1) avoid typing taxis repeatedly and 2) be able to access thecost_before_tip
andelapsed_time
columns in the same method that we create them.To create a single new column, we can also use
df['new_col'] = <values>
.
Sorting by values#
# sort by descending passenger count and pickups from earliest to latest
minitaxi.sort_values(['passenger_count', 'pickup'], ascending=[False, True]).head()
# pick out the 3 trips with largest timespan
minitaxi.nlargest(3, 'elapsed_time')
Working with index#
Currently the index is simply using the row numbers, but if we wish to work with the pickup times significantly, perhaps indexing by the datetime column is more effective.
minitaxi.index
Setting index#
minitaxi = minitaxi.set_index('pickup')
minitaxi.head(3)
Sorting by index#
# sorting by indices
minitaxi = minitaxi.sort_index()
minitaxi.head()
dropoff | passenger_count | trip_distance | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | Airport_fee | elapsed_time | cost_before_tip | tip_pct | fees | avg_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pickup | ||||||||||||||||||
2024-01-01 00:10:19 | 2024-01-01 00:22:58 | 1 | 1.92 | 1 | 13.5 | 1.0 | 0.5 | 3.70 | 0.0 | 1.0 | 22.20 | 2.5 | 0.0 | 0 days 00:12:39 | 18.5 | 0.200000 | 5.0 | 9.106719 |
2024-01-01 00:15:10 | 2024-01-01 00:22:51 | 1 | 1.64 | 1 | 10.0 | 1.0 | 0.5 | 3.00 | 0.0 | 1.0 | 18.00 | 2.5 | 0.0 | 0 days 00:07:41 | 15.0 | 0.200000 | 5.0 | 12.806941 |
2024-01-01 00:19:09 | 2024-01-01 00:41:03 | 3 | 9.12 | 1 | 38.7 | 1.0 | 0.5 | 8.74 | 0.0 | 1.0 | 52.44 | 2.5 | 0.0 | 0 days 00:21:54 | 43.7 | 0.200000 | 5.0 | 24.986301 |
2024-01-01 00:20:34 | 2024-01-01 00:25:54 | 1 | 0.80 | 1 | 6.5 | 3.5 | 0.5 | 2.30 | 0.0 | 1.0 | 13.80 | 2.5 | 0.0 | 0 days 00:05:20 | 11.5 | 0.200000 | 5.0 | 9.000000 |
2024-01-01 00:23:35 | 2024-01-01 00:49:13 | 1 | 3.08 | 1 | 23.3 | 1.0 | 0.5 | 7.08 | 0.0 | 1.0 | 35.38 | 2.5 | 0.0 | 0 days 00:25:38 | 28.3 | 0.250177 | 5.0 | 7.209363 |
Recall how we have used [0:4]
to locate the rows indexed 0 to 4. Since we are indexing by datetime, we can select within time ranges.
Selecting by index#
# selecting the taxi rides in the first 6 hours of the new year
minitaxi['2024-01-01 00:00':'2024-01-01 06:00']
dropoff | passenger_count | trip_distance | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | Airport_fee | elapsed_time | cost_before_tip | tip_pct | fees | avg_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pickup | ||||||||||||||||||
2024-01-01 00:10:19 | 2024-01-01 00:22:58 | 1 | 1.92 | 1 | 13.5 | 1.0 | 0.5 | 3.70 | 0.0 | 1.0 | 22.20 | 2.5 | 0.0 | 0 days 00:12:39 | 18.5 | 0.200000 | 5.0 | 9.106719 |
2024-01-01 00:15:10 | 2024-01-01 00:22:51 | 1 | 1.64 | 1 | 10.0 | 1.0 | 0.5 | 3.00 | 0.0 | 1.0 | 18.00 | 2.5 | 0.0 | 0 days 00:07:41 | 15.0 | 0.200000 | 5.0 | 12.806941 |
2024-01-01 00:19:09 | 2024-01-01 00:41:03 | 3 | 9.12 | 1 | 38.7 | 1.0 | 0.5 | 8.74 | 0.0 | 1.0 | 52.44 | 2.5 | 0.0 | 0 days 00:21:54 | 43.7 | 0.200000 | 5.0 | 24.986301 |
2024-01-01 00:20:34 | 2024-01-01 00:25:54 | 1 | 0.80 | 1 | 6.5 | 3.5 | 0.5 | 2.30 | 0.0 | 1.0 | 13.80 | 2.5 | 0.0 | 0 days 00:05:20 | 11.5 | 0.200000 | 5.0 | 9.000000 |
2024-01-01 00:23:35 | 2024-01-01 00:49:13 | 1 | 3.08 | 1 | 23.3 | 1.0 | 0.5 | 7.08 | 0.0 | 1.0 | 35.38 | 2.5 | 0.0 | 0 days 00:25:38 | 28.3 | 0.250177 | 5.0 | 7.209363 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2024-01-01 04:53:26 | 2024-01-02 03:11:13 | 6 | 9.97 | 1 | 52.0 | 1.0 | 0.5 | 0.00 | 0.0 | 1.0 | 57.00 | 2.5 | 0.0 | 0 days 22:17:47 | 57.0 | 0.000000 | 5.0 | 0.447158 |
2024-01-01 05:06:58 | 2024-01-01 05:10:03 | 1 | 0.63 | 1 | 5.8 | 1.0 | 0.5 | 2.16 | 0.0 | 1.0 | 12.96 | 2.5 | 0.0 | 0 days 00:03:05 | 10.8 | 0.200000 | 5.0 | 12.259459 |
2024-01-01 05:16:59 | 2024-01-01 05:35:39 | 1 | 4.26 | 1 | 22.6 | 1.0 | 0.5 | 5.52 | 0.0 | 1.0 | 33.12 | 2.5 | 0.0 | 0 days 00:18:40 | 27.6 | 0.200000 | 5.0 | 13.692857 |
2024-01-01 05:17:26 | 2024-01-01 05:22:32 | 1 | 1.90 | 1 | 10.0 | 1.0 | 0.5 | 0.00 | 0.0 | 1.0 | 15.00 | 2.5 | 0.0 | 0 days 00:05:06 | 15.0 | 0.000000 | 5.0 | 22.352941 |
2024-01-01 05:29:34 | 2024-01-01 05:38:38 | 1 | 1.57 | 1 | 10.7 | 1.0 | 0.5 | 0.00 | 0.0 | 1.0 | 15.70 | 2.5 | 0.0 | 0 days 00:09:04 | 15.7 | 0.000000 | 5.0 | 10.389706 |
70 rows × 18 columns
Resetting index#
We can revert any specific column index back to row numbers, but notice that by setting and resetting index you lose the original row numbers.
minitaxi = minitaxi.reset_index()
minitaxi.head()
pickup | dropoff | passenger_count | trip_distance | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | Airport_fee | elapsed_time | cost_before_tip | tip_pct | fees | avg_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2024-01-01 00:10:19 | 2024-01-01 00:22:58 | 1 | 1.92 | 1 | 13.5 | 1.0 | 0.5 | 3.70 | 0.0 | 1.0 | 22.20 | 2.5 | 0.0 | 0 days 00:12:39 | 18.5 | 0.200000 | 5.0 | 9.106719 |
1 | 2024-01-01 00:15:10 | 2024-01-01 00:22:51 | 1 | 1.64 | 1 | 10.0 | 1.0 | 0.5 | 3.00 | 0.0 | 1.0 | 18.00 | 2.5 | 0.0 | 0 days 00:07:41 | 15.0 | 0.200000 | 5.0 | 12.806941 |
2 | 2024-01-01 00:19:09 | 2024-01-01 00:41:03 | 3 | 9.12 | 1 | 38.7 | 1.0 | 0.5 | 8.74 | 0.0 | 1.0 | 52.44 | 2.5 | 0.0 | 0 days 00:21:54 | 43.7 | 0.200000 | 5.0 | 24.986301 |
3 | 2024-01-01 00:20:34 | 2024-01-01 00:25:54 | 1 | 0.80 | 1 | 6.5 | 3.5 | 0.5 | 2.30 | 0.0 | 1.0 | 13.80 | 2.5 | 0.0 | 0 days 00:05:20 | 11.5 | 0.200000 | 5.0 | 9.000000 |
4 | 2024-01-01 00:23:35 | 2024-01-01 00:49:13 | 1 | 3.08 | 1 | 23.3 | 1.0 | 0.5 | 7.08 | 0.0 | 1.0 | 35.38 | 2.5 | 0.0 | 0 days 00:25:38 | 28.3 | 0.250177 | 5.0 | 7.209363 |
Practice: Meteorite indexing#
Using the meteor
dataset,
cast the
year
column to an integer column.create a new column indicating whether the meteorite was observed falling before 1970.
set the index to the id column and extract all the rows with IDs between 10,036 and 10,040 (inclusive) with
loc[]
.examine the
year
column to see if there are any data errors.
import pandas as pd
meteor = pd.read_csv('../data/Meteorite_Landings.csv')
#meteor = meteor.dropna()
meteor['year'] = meteor['year'].astype(int)
meteor.info()
<class 'pandas.core.frame.DataFrame'>
Index: 38115 entries, 0 to 45715
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 38115 non-null object
1 id 38115 non-null int64
2 nametype 38115 non-null object
3 recclass 38115 non-null object
4 mass (g) 38115 non-null float64
5 fall 38115 non-null object
6 year 38115 non-null int32
7 reclat 38115 non-null float64
8 reclong 38115 non-null float64
9 GeoLocation 38115 non-null object
dtypes: float64(3), int32(1), int64(1), object(5)
memory usage: 3.1+ MB
meteor = meteor.set_index('id')
meteor.set_index('id')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_26756\953505260.py in ?()
----> 1 meteor.set_index('id')
~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py in ?(self, keys, drop, append, inplace, verify_integrity)
6118 if not found:
6119 missing.append(col)
6120
6121 if missing:
-> 6122 raise KeyError(f"None of {missing} are in the columns")
6123
6124 if inplace:
6125 frame = self
KeyError: "None of ['id'] are in the columns"
meteor.sort_index().loc[10036:10040]
<class 'pandas.core.frame.DataFrame'>
Index: 38115 entries, 1 to 30414
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 38115 non-null object
1 nametype 38115 non-null object
2 recclass 38115 non-null object
3 mass (g) 38115 non-null float64
4 fall 38115 non-null object
5 year 38115 non-null int32
6 reclat 38115 non-null float64
7 reclong 38115 non-null float64
8 GeoLocation 38115 non-null object
dtypes: float64(3), int32(1), object(5)
memory usage: 2.8+ MB