Intro to EDA

Published

April 7, 2026

Exploratory data analysis, or EDA, is a standard practice prior to any data manipulation and analysis.

Recall that data engineering is primarily about data preparation to serve smooth and effective data analysis. Exploratory data analysis generally refers to the step of understanding the data:

This document primarily deals with the first two items.

Goals

In the exploratory phase, these are for people behind the scenes to see.

The main goals here are:

  • capture main message
  • (relatively) quick exploration across many summaries (including plots)
  • not intended for a client or presentation

What does this translate to, technically?

  • each summary should have meaningful information
  • label your plots

Data summary

As a starting point, simply looking at the data is worth the while. Some common questions to consider are the following:

  1. General dataset info: size, dtypes
  2. Missing values?
  3. Duplicate data?
  4. Continuous variables
  5. Categorical variables
  6. Bivariate relationships
  7. Potential data quality issues, e.g., inconsistency, special NA characters
# !pip install pandas seaborn matplotlib
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns
The origin of sns.

Earthquake dataset

Source Link

# load and save a copy of the earthquake dataset
earthquake = pd.read_csv('https://raw.githubusercontent.com/mosesyhc/de300-2026wi/refs/heads/main/datasets/Canadian-Earthquakes-2010-2019.csv')
# take a glimpse of the data
earthquake.head()
magnitude_codelist magnitude magnitude_type date place depth latitude longitude OBJECTID longitude_geom latitude_geom
0 <2 1.7 ML 2010-01-01T00:16:49+0000 81 km NE  of Seattle 0.0 48.192001 -121.677002 1 -121.677315 48.191706
1 2 2.2 MN 2010-01-01T00:52:50+0000 86 km NW from Maniwaki 18.0 47.028999 -76.583000 2 -76.583303 47.028909
2 <2 1.8 MN 2010-01-01T03:21:58+0000 21 km NW from Mont-Laurier 18.0 46.651001 -75.734001 3 -75.733902 46.650809
3 <2 1.5 MN 2010-01-01T04:14:51+0000 CHARLEVOIX SEISMIC ZONE 13.0 47.740002 -69.741997 4 -69.742000 47.740210
4 <2 1.6 ML 2010-01-01T04:15:17+0000 83 km W   of Gold R. 11.6 49.500999 -127.222000 5 -127.222216 49.500705
# view a summary of the full data
earthquake.info()
<class 'pandas.DataFrame'>
RangeIndex: 44561 entries, 0 to 44560
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   magnitude_codelist  44561 non-null  str    
 1   magnitude           44561 non-null  float64
 2   magnitude_type      44462 non-null  str    
 3   date                44561 non-null  str    
 4   place               44561 non-null  str    
 5   depth               44561 non-null  float64
 6   latitude            44561 non-null  float64
 7   longitude           44561 non-null  float64
 8   OBJECTID            44561 non-null  int64  
 9   longitude_geom      44561 non-null  float64
 10  latitude_geom       44561 non-null  float64
dtypes: float64(6), int64(1), str(4)
memory usage: 5.8 MB
# checks for duplicates (also ask if duplicates make sense)
earthquake.duplicated()

# .loc / .iloc
0        False
1        False
2        False
3        False
4        False
         ...  
44556    False
44557    False
44558    False
44559    False
44560    False
Length: 44561, dtype: bool
# duplicates
# a quick numerical summary 
earthquake.describe(include='all')
magnitude_codelist magnitude magnitude_type date place depth latitude longitude OBJECTID longitude_geom latitude_geom
count 44561 44561.000000 44462 44561 44561 44561.000000 44561.000000 44561.000000 44561.000000 44561.000000 44561.000000
unique 6 NaN 7 44481 18639 NaN NaN NaN NaN NaN NaN
top <2 NaN ML 2018-07-02T04:08:13+0000 CHARLEVOIX SEISMIC ZONE NaN NaN NaN NaN NaN NaN
freq 19764 NaN 29509 3 1250 NaN NaN NaN NaN NaN NaN
mean NaN 2.134070 NaN NaN NaN 12.852194 53.351863 -118.953322 22281.000000 -118.953299 53.351830
std NaN 0.828096 NaN NaN NaN 9.963145 6.214464 23.696484 12863.797009 23.696493 6.214465
min NaN -1.400000 NaN NaN NaN -0.500000 40.808998 -148.811005 1.000000 -148.810526 40.808509
25% NaN 1.600000 NaN NaN NaN 5.000000 49.169998 -132.427994 11141.000000 -132.427618 49.170009
50% NaN 2.100000 NaN NaN NaN 10.000000 52.137001 -129.671997 22281.000000 -129.672016 52.136507
75% NaN 2.700000 NaN NaN NaN 18.000000 56.514999 -121.947998 33421.000000 -121.948318 56.515206
max NaN 7.700000 NaN NaN NaN 214.000000 82.608002 -39.320000 44561.000000 -39.319968 82.607812
# checks for possible statistical assumption(s)
import scipy.stats as sps

sps.normaltest(earthquake['magnitude'])
NormaltestResult(statistic=np.float64(1597.6658124732553), pvalue=np.float64(0.0))
# extract only numeric variables
earthquake
# for example, normality test
sps.shapiro(earthquake['magnitude'])
C:\Users\moses\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats\_axis_nan_policy.py:592: UserWarning: scipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is 44561.
  res = hypotest_fun_out(*samples, **kwds)
ShapiroResult(statistic=np.float64(0.9877396508026011), pvalue=np.float64(1.179774985585813e-49))
# for example, another normality test
# pairwise correlation
earthquake_num = earthquake.select_dtypes('number')
import numpy as np
np.corrcoef(earthquake_num.iloc[:500])

Data visualization

sns.set(context='talk', style='ticks')  # simply for aesthetics
sns.set_palette('magma')
%matplotlib inline 

# earthquake = earthquake.sample(n=500)  # (if too slow) for illustration purposes
# histogram for continuous variables using pandas built-in plots 
# relative frequency? ...
# histogram of masses by group
# other types of plots
# counts for categorical variables
# barplots by group
# bivariate plots
# bivariate plots (log-log)
# pairwise plots  (time-consuming)
# another pairwise plot by group

In-class activity

Refer to the following figure, choose two subfigures to reproduce with the earthquake dataset.

(In case you need this) Jupyter notebook setup

Visit https://docs.jupyter.org/en/latest/install/notebook-classic.html for some guidance to set up jupyter notebook.


Note: These notes are adapted from a blog post on Tom’s Blog.