Using Pandas for Complex Data Instead of Spreadsheets
The Pain of Using Spreadsheets
Spreadsheets are great, but they can become a pain when you are dealing with complex data:
- Calculations are often not reproducible.
- Data can be overwritten in the spreadsheet.
- Data cleaning may overwrite the original data.
- Sharing spreadsheets is difficult.
- Combining data from multiple spreadsheets is difficult.
- Spreadsheets often demonstrate poor performance.
- Large datasets are not handled well.

Pandas to the Rescue
Fortunately, we have Pandas to help us mung data on Python.
- Pandas is one of the most powerful open source libraries in Python for analyzing and manipulating data.
- This library was born on 2008 at AQR Capital when Wes McKinney was looking for a solution to offer a high-performance and flexible tool to perform quantitative analysis on financial data.
- Etymology: panel data structures
Why Pandas is Great
- Python + Pandas = the perfect combination for small experiments or for implementing large-scale production systems to analyze data and make smarter decisions.
- High-performance data structures:
- Series (1D labeled vectors)
- DataFrame (2D structures similar to spreadsheets)
- Panel (Collection of DataFrames as 3D labeled arrays)
- Built-in time series functionality, which is a must for financial and quants analysis
Reading a CSV file as a Pandas DataFrame
Use the following CSV files for this activity:
In [1]:
# Initial imports import pandas as pd from pathlib import Path
Step 1: Create a path to the file
In [2]:
csvpath = Path("../Resources/sales.csv")
Step 2: Read the CSV into a DataFrame using Pandas
In [3]:
sales_dataframe = pd.read_csv(csvpath) sales_dataframe.head()
Out [3]:
“| \n”, ” | FullName | \n”, ”Address | \n”, ”Zip | \n”, ”CreditCard | \n”, ”SalePrice | \n”, ”|
|---|---|---|---|---|---|---|
| 0 | \n”, ”Elwanda White | \n”, ”alyre2036@live.com | \n”, ”352 Lakeshore Mall | \n”, ”9236 | \n”, ”5327 0855 9720 7055 | \n”, ”84.33 | \n”, ”
| 1 | \n”, ”Lyndon Elliott | \n”, ”arrowy1873@outlook.com | \n”, ”1234 Avery Plaza | \n”, ”1330 | \n”, ”3717 498777 19636 | \n”, ”879.95 | \n”, ”
| 2 | \n”, ”Daisey Sellers | \n”, ”toucan2024@outlook.com | \n”, ”469 Elwood Street | \n”, ”7631 | \n”, ”3758 579477 35734 | \n”, ”907.58 | \n”, ”
| 3 | \n”, ”Issac Reeves | \n”, ”asarin1958@gmail.com | \n”, ”565 Phelps Field | \n”, ”81168 | \n”, ”4400 0380 4162 1622 | \n”, ”545.88 | \n”, ”
| 4 | \n”, ”Bradford Kinney | \n”, ”mibound1801@yandex.com | \n”, ”853 Mission Rock Freeway | \n”, ”41721 | \n”, ”3712 263405 60178 | \n”, ”517.49 | \n”, ”
Reading a CSV with no Header
In [4]:
# Without a header in the CSV, Pandas will use the first row as the header
csvpath = Path("../Resources/sales_no_header.csv")
sales_data = pd.read_csv(csvpath)
sales_data.head()
Out [4]:
“\n”, “| \n”, ” | Elwanda White | \n”, ”alyre2036@live.com | \n”, ”352 Lakeshore Mall | \n”, ”9236 | \n”, ”5327 0855 9720 7055 | \n”, ”84.33 | \n”, ”
|---|---|---|---|---|---|---|
| 0 | \n”, ”Lyndon Elliott | \n”, ”arrowy1873@outlook.com | \n”, ”1234 Avery Plaza | \n”, ”1330 | \n”, ”3717 498777 19636 | \n”, ”879.95 | \n”, ”
| 1 | \n”, ”Daisey Sellers | \n”, ”toucan2024@outlook.com | \n”, ”469 Elwood Street | \n”, ”7631 | \n”, ”3758 579477 35734 | \n”, ”907.58 | \n”, ”
| 2 | \n”, ”Issac Reeves | \n”, ”asarin1958@gmail.com | \n”, ”565 Phelps Field | \n”, ”81168 | \n”, ”4400 0380 4162 1622 | \n”, ”545.88 | \n”, ”
| 3 | \n”, ”Bradford Kinney | \n”, ”mibound1801@yandex.com | \n”, ”853 Mission Rock Freeway | \n”, ”41721 | \n”, ”3712 263405 60178 | \n”, ”517.49 | \n”, ”
| 4 | \n”, ”Fermina Cobb | \n”, ”kingfisher2013@live.com | \n”, ”929 Prague Trail | \n”, ”16625 | \n”, ”2351 7156 8193 8639 | \n”, ”889.95 | \n”, ”
In [5]:
# Read the file without a header and set header=none sales_data = pd.read_csv(csvpath, header=None) sales_data.head()
Out [5]:
“\n”, “| \n”, ” | 0 | \n”, ”1 | \n”, ”2 | \n”, ”3 | \n”, ”4 | \n”, ”5 | \n”, ”
|---|---|---|---|---|---|---|
| 0 | \n”, ”Elwanda White | \n”, ”alyre2036@live.com | \n”, ”352 Lakeshore Mall | \n”, ”9236 | \n”, ”5327 0855 9720 7055 | \n”, ”84.33 | \n”, ”
| 1 | \n”, ”Lyndon Elliott | \n”, ”arrowy1873@outlook.com | \n”, ”1234 Avery Plaza | \n”, ”1330 | \n”, ”3717 498777 19636 | \n”, ”879.95 | \n”, ”
| 2 | \n”, ”Daisey Sellers | \n”, ”toucan2024@outlook.com | \n”, ”469 Elwood Street | \n”, ”7631 | \n”, ”3758 579477 35734 | \n”, ”907.58 | \n”, ”
| 3 | \n”, ”Issac Reeves | \n”, ”asarin1958@gmail.com | \n”, ”565 Phelps Field | \n”, ”81168 | \n”, ”4400 0380 4162 1622 | \n”, ”545.88 | \n”, ”
| 4 | \n”, ”Bradford Kinney | \n”, ”mibound1801@yandex.com | \n”, ”853 Mission Rock Freeway | \n”, ”41721 | \n”, ”3712 263405 60178 | \n”, ”517.49 | \n”, ”
In [6]:
# Rewrite the column names columns = ["Full Name", "Email", "Address", "Zip Code", "Credit Card Number", "Sale Price"] sales_data.columns = columns sales_data.head()
Out [6]:
“\n”, “| \n”, ” | Full Name | \n”, ”Address | \n”, ”Zip Code | \n”, ”Credit Card Number | \n”, ”Sale Price | \n”, ”|
|---|---|---|---|---|---|---|
| 0 | \n”, ”Elwanda White | \n”, ”alyre2036@live.com | \n”, ”352 Lakeshore Mall | \n”, ”9236 | \n”, ”5327 0855 9720 7055 | \n”, ”84.33 | \n”, ”
| 1 | \n”, ”Lyndon Elliott | \n”, ”arrowy1873@outlook.com | \n”, ”1234 Avery Plaza | \n”, ”1330 | \n”, ”3717 498777 19636 | \n”, ”879.95 | \n”, ”
| 2 | \n”, ”Daisey Sellers | \n”, ”toucan2024@outlook.com | \n”, ”469 Elwood Street | \n”, ”7631 | \n”, ”3758 579477 35734 | \n”, ”907.58 | \n”, ”
| 3 | \n”, ”Issac Reeves | \n”, ”asarin1958@gmail.com | \n”, ”565 Phelps Field | \n”, ”81168 | \n”, ”4400 0380 4162 1622 | \n”, ”545.88 | \n”, ”
| 4 | \n”, ”Bradford Kinney | \n”, ”mibound1801@yandex.com | \n”, ”853 Mission Rock Freeway | \n”, ”41721 | \n”, ”3712 263405 60178 | \n”, ”517.49 | \n”, ”
In [7]:
# Generate summary statistics sales_data.describe()
Out [7]:
“| \n”, ” | Zip Code | \n”, ”Sale Price | \n”, ”
|---|---|---|
| count | \n”, ”100.000000 | \n”, ”100.000000 | \n”, ”
| mean | \n”, ”40952.160000 | \n”, ”533.007200 | \n”, ”
| std | \n”, ”30207.118496 | \n”, ”275.531072 | \n”, ”
| min | \n”, ”555.000000 | \n”, ”29.720000 | \n”, ”
| 25% | \n”, ”11109.750000 | \n”, ”328.507500 | \n”, ”
| 50% | \n”, ”40033.500000 | \n”, ”536.110000 | \n”, ”
| 75% | \n”, ”65834.750000 | \n”, ”767.885000 | \n”, ”
| max | \n”, ”99877.000000 | \n”, ”998.760000 | \n”, ”