Reading Financial Statements with Python Pandas

June 30, 2022

It’s that time of year again: the end of the fiscal year and the time to begin financial spring cleaning! Within a month, auditors will be camped out at your investment firm, inspecting everyone’s trades and the company’s end-of-year financial statements. All of the firm’s traders are under a lot of pressure to finalize their portfolio earnings and deliver them to their managers. That is, all traders except you.

You automated your end-of-year financial reporting last week, and now you’re using the pipeline to help Harold with his reports. Before loading Harold’s stock ticker data into Pandas, you open the Excel file he sent you to look at the quality of the data. You realize that Harold has not subscribed to any data quality standards and that the data is a mess.

For this activity, use Pandas to clean Harold’s portfolio data to get it fit for use.

Starter file

Harold's stock data is a mess! Help him clean up his data before the auditors arrive! # Initial imports
import pandas as pd
from pathlib import Path  Load CSV data into Pandas using read_csv
   Identify the number of rows and columns (shape) in the DataFrame.
   Preview the DataFrame using head to visually ensure data has been loaded in correctly.
   Identify the number of records in the DataFrame, and compare it with the number of rows in the original file.
   Identify null records
   Drop Null Records
   Validate nulls have been dropped
   Default null ebitda values to 0. Then, validate no records are null for ebitda.
   Drop Duplicates
   
Challenge
Preview price field using the head function.
   Clean price Series by replacing $
   Confirm data type of price
   Cast price Series as float and then validate using dtype

Instructions

Using the starter file and Harold’s financial data, complete the following steps.

Load CSV data into Pandas using read_csv.
Identify the number of rows and columns in the DataFrame, otherwise known as its shape/structure.
Preview the DataFrame using head to visually ensure data has been loaded in correctly.
Identify the number of records in the DataFrame, and compare it with the number of rows in the original file.
Identify null records by calculating average percent of nulls for each Series. Hint: This step will require the mean function.
Drop null records.
Validate all nulls have been dropped by calculating the sum of values that are null.
Default null ebitda values to 0.
Check that there are no null ebitda values using the sum function.
Remove duplicate rows.

Challenge

Now that nulls and duplicates have been wrangled, clean up the data a little more by removing the $ currency symbols from the price field. Then, use the astype function to cast price to a float, and validate using dtype.

Hint

Pandas offers a replace function that can be executed against a Series. Documentation can be found here.

Solution

 Harold's stock data is a mess! Help him clean up his data before the auditors arrive! 

In [1]:  # Initial imports
import pandas as pd
from pathlib import Path  Load CSV data into Pandas using read_csv

 In [2]:  csv_path = Path("../../Resources/stock_data.csv")
csv_data = pd.read_csv(csv_path)  Identify the number of rows and columns (shape) in the DataFrame.

 In [3]:  csv_data.shape  

Out [3]:  (504, 14) Generate a sample of the data to visually ensure data has been loaded in correctly.

 In [4]:  csv_data.head()  

Out [4]:   Identify the number of records in the DataFrame, and compare it with the number of rows in the original file.

 In [5]:  csv_data.count()  

Out [5]:  symbol                504
name                  502
sector                501
price                 500
price_per_earnings    497
dividend_yield        499
earnings_per_share    498
52_week_low           500
52_week_high          500
market_cap            500
ebitda                492
price_per_sales       500
price_per_book        492
sec_filings           500
dtype: int64 Identify nulls records

 In [6]:  csv_data.isnull().mean() * 100 

 Out [6]:  symbol                0.000000
name                  0.396825
sector                0.595238
price                 0.793651
price_per_earnings    1.388889
dividend_yield        0.992063
earnings_per_share    1.190476
52_week_low           0.793651
52_week_high          0.793651
market_cap            0.793651
ebitda                2.380952
price_per_sales       0.793651
price_per_book        2.380952
sec_filings           0.793651
dtype: float64 Drop Null Records

 In [7]:  csv_data = csv_data.dropna().copy()  Validate nulls have been dropped

 In [8]:  csv_data.isnull().sum() 

 Out [8]:  symbol                0
name                  0
sector                0
price                 0
price_per_earnings    0
dividend_yield        0
earnings_per_share    0
52_week_low           0
52_week_high          0
market_cap            0
ebitda                0
price_per_sales       0
price_per_book        0
sec_filings           0
dtype: int64 Default null ebitda values to 0. Then, validate no records are null for ebitda.

 In [9]:  csv_data["ebitda"] = csv_data["ebitda"].fillna(0)
csv_data["ebitda"].isnull().sum() 

 Out [9]:  0 Drop Duplicates

 In [10]:  csv_data = csv_data.drop_duplicates().copy()  

Challenge
Sample price field

 In [11]:  csv_data["price"].head(10) 

Out [11]:  0     $222.89
2       56.27
3      108.48
5      108.48
6      185.16
7      109.63
10        178
11     179.11
14      152.8
15      62.49
Name: price, dtype: object Clean price Series by replacing $

 In [12]:  csv_data["price"] = csv_data["price"].str.replace("$", "")
csv_data["price"].head(10) 

 Out [12]:  0     222.89
2      56.27
3     108.48
5     108.48
6     185.16
7     109.63
10       178
11    179.11
14     152.8
15     62.49
Name: price, dtype: object Confirm data type of price

 In [13]:  csv_data["price"].dtype 

 Out [13]:  dtype('O') Cast price Series as float

 In [14]:  csv_data["price"] = csv_data["price"].astype('float')
csv_data["price"].dtype 

 Out [14]:  dtype('float64')