Reading Financial Statements with Python Pandas
It’s that time of year again: the end of the fiscal year and the time to begin financial spring cleaning! Within a month, auditors will be camped out at your investment firm, inspecting everyone’s trades and the company’s end-of-year financial statements. All of the firm’s traders are under a lot of pressure to finalize their portfolio earnings and deliver them to their managers. That is, all traders except you.

You automated your end-of-year financial reporting last week, and now you’re using the pipeline to help Harold with his reports. Before loading Harold’s stock ticker data into Pandas, you open the Excel file he sent you to look at the quality of the data. You realize that Harold has not subscribed to any data quality standards and that the data is a mess.
For this activity, use Pandas to clean Harold’s portfolio data to get it fit for use.
Starter file
Harold's stock data is a mess! Help him clean up his data before the auditors arrive! # Initial imports
import pandas as pd
from pathlib import Path Load CSV data into Pandas using read_csv
Identify the number of rows and columns (shape) in the DataFrame.
Preview the DataFrame using head to visually ensure data has been loaded in correctly.
Identify the number of records in the DataFrame, and compare it with the number of rows in the original file.
Identify null records
Drop Null Records
Validate nulls have been dropped
Default null ebitda values to 0. Then, validate no records are null for ebitda.
Drop Duplicates
Challenge
Preview price field using the head function.
Clean price Series by replacing $
Confirm data type of price
Cast price Series as float and then validate using dtype
Instructions
Using the starter file and Harold’s financial data, complete the following steps.
- Load CSV data into Pandas using
read_csv. - Identify the number of rows and columns in the DataFrame, otherwise known as its shape/structure.
- Preview the DataFrame using
headto visually ensure data has been loaded in correctly. - Identify the number of records in the DataFrame, and compare it with the number of rows in the original file.
- Identify null records by calculating average percent of nulls for each Series. Hint: This step will require the
meanfunction. - Drop null records.
- Validate all nulls have been dropped by calculating the
sumof values that are null. - Default null
ebitdavalues to 0. - Check that there are no null
ebitdavalues using thesumfunction. - Remove duplicate rows.
Challenge
Now that nulls and duplicates have been wrangled, clean up the data a little more by removing the $ currency symbols from the price field. Then, use the astype function to cast price to a float, and validate using dtype.
Hint
Pandas offers a replace function that can be executed against a Series. Documentation can be found here.
Solution
Harold's stock data is a mess! Help him clean up his data before the auditors arrive!
In [1]: # Initial imports
import pandas as pd
from pathlib import Path Load CSV data into Pandas using read_csv
In [2]: csv_path = Path("../../Resources/stock_data.csv")
csv_data = pd.read_csv(csv_path) Identify the number of rows and columns (shape) in the DataFrame.
In [3]: csv_data.shape
Out [3]: (504, 14) Generate a sample of the data to visually ensure data has been loaded in correctly.
In [4]: csv_data.head()
Out [4]: Identify the number of records in the DataFrame, and compare it with the number of rows in the original file.
In [5]: csv_data.count()
Out [5]: symbol 504
name 502
sector 501
price 500
price_per_earnings 497
dividend_yield 499
earnings_per_share 498
52_week_low 500
52_week_high 500
market_cap 500
ebitda 492
price_per_sales 500
price_per_book 492
sec_filings 500
dtype: int64 Identify nulls records
In [6]: csv_data.isnull().mean() * 100
Out [6]: symbol 0.000000
name 0.396825
sector 0.595238
price 0.793651
price_per_earnings 1.388889
dividend_yield 0.992063
earnings_per_share 1.190476
52_week_low 0.793651
52_week_high 0.793651
market_cap 0.793651
ebitda 2.380952
price_per_sales 0.793651
price_per_book 2.380952
sec_filings 0.793651
dtype: float64 Drop Null Records
In [7]: csv_data = csv_data.dropna().copy() Validate nulls have been dropped
In [8]: csv_data.isnull().sum()
Out [8]: symbol 0
name 0
sector 0
price 0
price_per_earnings 0
dividend_yield 0
earnings_per_share 0
52_week_low 0
52_week_high 0
market_cap 0
ebitda 0
price_per_sales 0
price_per_book 0
sec_filings 0
dtype: int64 Default null ebitda values to 0. Then, validate no records are null for ebitda.
In [9]: csv_data["ebitda"] = csv_data["ebitda"].fillna(0)
csv_data["ebitda"].isnull().sum()
Out [9]: 0 Drop Duplicates
In [10]: csv_data = csv_data.drop_duplicates().copy()
Challenge
Sample price field
In [11]: csv_data["price"].head(10)
Out [11]: 0 $222.89
2 56.27
3 108.48
5 108.48
6 185.16
7 109.63
10 178
11 179.11
14 152.8
15 62.49
Name: price, dtype: object Clean price Series by replacing $
In [12]: csv_data["price"] = csv_data["price"].str.replace("$", "")
csv_data["price"].head(10)
Out [12]: 0 222.89
2 56.27
3 108.48
5 108.48
6 185.16
7 109.63
10 178
11 179.11
14 152.8
15 62.49
Name: price, dtype: object Confirm data type of price
In [13]: csv_data["price"].dtype
Out [13]: dtype('O') Cast price Series as float
In [14]: csv_data["price"] = csv_data["price"].astype('float')
csv_data["price"].dtype
Out [14]: dtype('float64')