Data Cleaning with Python and Pandas

11 minute read

Description

Data Scientists can spend up to 80% of their time cleaning data but it is necessary work to ensure a good functioning machine learning model.

I’ll be working with a small real estate dataset and cleaning it of various inconsistencies. Although the dataset is small it demonstrates a lot of real-world examples Data Scientists encounter on a regular basis.

Raw Data

Below is a table of the raw data as it’s stored in the csv.

PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
100001000 104 pUTNAM Y 3 1 1000
100002000 197 LEXINGTON N 3 1.5
100003000   LEXINGTON N n/a 1 850
100004000 201 BERKELEY 12 1 NaN 700
  203 BERKELEY Y 3 2 1600
100006000 207 berkeley Y NA 1 800
100007000 NA WASHINGTON   2 HURLEY 950
100008000 213 TREMONT Y 1  
100009000 215 TREMONT Y na 2 1800

Let’s start by asking the following questions:

  • What are the features?
  • What are the expected types (int, float, string, boolean)?
  • Are there obvious missing data values (values that Pandas can detect)?
  • Are there other types of missing data that’s not so obvious (can’t easily detect with Pandas)?

Loading Data

We’ll first start out by getting a quick feel for the data by looking at the first few rows.

# Importing libraries
import pandas as pd
import numpy as np

# Read csv file into a pandas dataframe
df = pd.read_csv("property data.csv")

# Take a look at the data
df
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 pUTNAM Y 3 1 1000
1 100002000.0 197.0 LEXINGTON N 3 1.5 --
2 100003000.0 NaN LEXINGTON N NaN 1 850
3 100004000.0 201.0 BERKELEY 12 1 NaN 700
4 NaN 203.0 BERKELEY Y 3 2 1600
5 100006000.0 207.0 berkeley Y NaN 1 800
6 100007000.0 NaN WASHINGTON NaN 2 HURLEY 950
7 100008000.0 213.0 TREMONT Y -- 1 NaN
8 100009000.0 215.0 TREMONT Y na 2 1800
df.columns
Index(['PID', 'ST_NUM', 'ST_NAME', 'OWN_OCCUPIED', 'NUM_BEDROOMS', 'NUM_BATH',
       'SQ_FT'],
      dtype='object')

Now we can start to answer some of our questions, what are my features? It is easy to infer this from the column names.

  • PID: Property ID
  • ST_NUM: Street number
  • ST_NAME: Street name
  • OWN_OCCUPIED: Is the residence owner occupied
  • NUM_BEDROOMS: Number of bedrooms
  • NUM_BATH: Number of bathrooms
  • SQ_FT: Square footage of the property

We can also answer, what are the expected types?

  • PID: float or int (some sort of numeric type)
  • ST_NUM: float or int (some sort of numeric type)
  • ST_NAME: string
  • OWN_OCCUPIED: string where Y (“Yes”) or N (“No”)
  • NUM_BEDROOMS: float or int(some sort of numeric type)
  • NUM_BATH: float or int (some sort of numeric type)
  • SQ_FT: float or int (some sort of numeric type)

When we look at what the actual data types are, some of them don’t match.

df.dtypes
PID             float64
ST_NUM          float64
ST_NAME          object
OWN_OCCUPIED     object
NUM_BEDROOMS     object
NUM_BATH         object
SQ_FT            object
dtype: object

Part of the reason that NaN is a ‘float’ data type which then gets mixed in with ‘string’ data types.

An example of this is the “Owner Occupied” column.

# Check type of each record
df['OWN_OCCUPIED'].map(type)
0      <class 'str'>
1      <class 'str'>
2      <class 'str'>
3      <class 'str'>
4      <class 'str'>
5      <class 'str'>
6    <class 'float'>
7      <class 'str'>
8      <class 'str'>
Name: OWN_OCCUPIED, dtype: object

It’s important to note that row 4 has a value of 12 but is categorized as a string data type. This isn’t a big deal since we’ll be altering that value anyways but it’s good to make note that some of the numerics in the dataset may need to be changed in the end to get the desired data type.

Let’s start digging into answering the next two questions on detecting missing values.

Standard Missing Values

What missing values can we detect with Pandas? Let’s take a look at our raw dataset again.

PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
100001000 104 PUTNAM Y 3 1 1000
100002000 197 LEXINGTON N 3 1.5
100003000   LEXINGTON N n/a 1 850
100004000 201 BERKELEY 12 1 NaN 700
  203 BERKELEY Y 3 2 1600
100006000 207 berkeley Y NA 1 800
100007000 NA WASHINGTON   2 HURLEY 950
100008000 213 TREMONT Y 1  
100009000 215 TREMONT Y na 2 1800

In the “Street Number” column there is an empty cell in the 3rd row and an “NA” value in the 7th row.

These are clearly both missing values. Let’s see how Pandas deals with these.

# Looking at the ST_NUM column
print(df['ST_NUM'])
print("\n") # line break for readability
print(df['ST_NUM'].isnull())
0    104.0
1    197.0
2      NaN
3    201.0
4    203.0
5    207.0
6      NaN
7    213.0
8    215.0
Name: ST_NUM, dtype: float64


0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8    False
Name: ST_NUM, dtype: bool

We can see that Pandas has filled in both the blank and “NA” records with NaN. Using the isnull() method confirms that both values were recognized as missing values as the boolean responses are True.

These two missing records were handled by Pandas with no issues. Now let’s take a look at some types that Pandas won’t recognize.

Non-standard Missing Values

Sometimes there are cases where the missing values have different formats.

Let’s take a look at raw values of the NUM_BEDROOMS column as an example.

NUM_BEDROOMS
3
3
n/a
1
3
NA
2
na

Here there are four missing values.

  • n/a
  • NA
  • --
  • na

From the previous section we know that Pandas will be able to handle NA but what about the other values?

# Looking at the NUM_BEDROOMS column
print(df['NUM_BEDROOMS'])
print("\n") # line break for readability
print(df['NUM_BEDROOMS'].isnull())
0      3
1      3
2    NaN
3      1
4      3
5    NaN
6      2
7     --
8     na
Name: NUM_BEDROOMS, dtype: object


0    False
1    False
2     True
3    False
4    False
5     True
6    False
7    False
8    False
Name: NUM_BEDROOMS, dtype: bool

Pandas was able to recognize “NA” and “n/a” as missing values. However, the “–” and “na” types were not.

This could occur when multiple users manually enter data and each one has their own preference of entering a field as null.

One way to detect these various formats is to put them into a list. Then when we import the data, Pandas will recognize them as missing values.

# Making a list of missing value types
missing_values = ["na", "--"]
df = pd.read_csv("property data.csv", na_values=missing_values)

Now let’s take a look to see what the NUM_BEDROOMS column looks like now.

# Take a look at the NUM_BEDROOMS column
print(df['NUM_BEDROOMS'])
print('\n')
print(df['NUM_BEDROOMS'].isnull())
0    3.0
1    3.0
2    NaN
3    1.0
4    3.0
5    NaN
6    2.0
7    NaN
8    NaN
Name: NUM_BEDROOMS, dtype: float64


0    False
1    False
2     True
3    False
4    False
5     True
6    False
7     True
8     True
Name: NUM_BEDROOMS, dtype: bool

By creating and referencing a list of different data types that represent missing values we were able to get Pandas to recognize them as such. This also addresses the “–” value seen in the 2nd row of the “square feet” column.

In larger datasets it may be harder to find these but as you go through the data you can add them to the missing_values list.

Unexpected Missing Values

So far we’ve gone through standard and non-standard types of missing values. What do we do when we encounter an unexpected type?

For example, if we expect a feature to be a string but there is a numeric type then it could be considered a missing value since it doesn’t provide any value to us.

An example of this is in the “owner occupied” column where there is a value of 12 in row 4.

OWN_OCCUPIED
Y
N
N
12
Y
Y
 
Y
Y

We already know the blank space in row 7 will be detected as a missing value by Pandas. Let’s confirm this real quick.

print(df['OWN_OCCUPIED'])
print('\n')
print(df['OWN_OCCUPIED'].isnull())
0      Y
1      N
2      N
3     12
4      Y
5      Y
6    NaN
7      Y
8      Y
Name: OWN_OCCUPIED, dtype: object


0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
Name: OWN_OCCUPIED, dtype: bool

The value in row 4 is the number 12. We expect and require the response for Owner Occupied to be a string of either Y or N.

Since we don’t have any information to help us determine if the number 12 should correspond to a Y or N we will label this as a missing value.

We’ll use a for loop to change any integer values to be a missing value.

df['OWN_OCCUPIED']
0      Y
1      N
2      N
3     12
4      Y
5      Y
6    NaN
7      Y
8      Y
Name: OWN_OCCUPIED, dtype: object
# Detecting numbers
counter = 0
for row in df['OWN_OCCUPIED']:
    try:
        int(row)
        df.loc[counter, 'OWN_OCCUPIED']=np.nan
    except ValueError:
        pass
    counter +=1

Let’s double check that it worked:

print(df['OWN_OCCUPIED'])
print("\n")
print(df['OWN_OCCUPIED'].isnull())
0      Y
1      N
2      N
3    NaN
4      Y
5      Y
6    NaN
7      Y
8      Y
Name: OWN_OCCUPIED, dtype: object


0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
Name: OWN_OCCUPIED, dtype: bool

Success! Now we can use the same for loop with a little alteration and apply it to the “Number of baths” column to replace the value “HURLEY” as missing.

NUM_BATH
1
1.5
1
NaN
2
1
HURLEY
1
2
# Detecting strings
counter = 0
for row in df['NUM_BATH']:
    try:
        float(row)
        df.loc[counter, 'NUM_BATH']=float(row)
    except ValueError:
        pass
    counter +=1

df['NUM_BATH'] = df['NUM_BATH'].apply(lambda x: np.nan if type(x) == str else x)
# Double check
print(df['NUM_BATH'])
print("\n")
print(df['NUM_BATH'].isnull())
0    1.0
1    1.5
2    1.0
3    NaN
4    2.0
5    1.0
6    NaN
7    1.0
8    2.0
Name: NUM_BATH, dtype: float64


0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
Name: NUM_BATH, dtype: bool

Summarizing Missing Values

After cleaning up our missing values, we’ll summarize them in different ways.

First let’s take a look at our dataset again.

df
PID ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS NUM_BATH SQ_FT
0 100001000.0 104.0 pUTNAM Y 3.0 1.0 1000.0
1 100002000.0 197.0 LEXINGTON N 3.0 1.5 NaN
2 100003000.0 NaN LEXINGTON N NaN 1.0 850.0
3 100004000.0 201.0 BERKELEY NaN 1.0 NaN 700.0
4 NaN 203.0 BERKELEY Y 3.0 2.0 1600.0
5 100006000.0 207.0 berkeley Y NaN 1.0 800.0
6 100007000.0 NaN WASHINGTON NaN 2.0 NaN 950.0
7 100008000.0 213.0 TREMONT Y NaN 1.0 NaN
8 100009000.0 215.0 TREMONT Y NaN 2.0 1800.0

Now let’s take a look at the total number of missing values for each feature.

# Total missing values for each feature
df.isnull().sum()
PID             1
ST_NUM          2
ST_NAME         0
OWN_OCCUPIED    2
NUM_BEDROOMS    4
NUM_BATH        2
SQ_FT           2
dtype: int64

There are times when we want to do a quick check to see if there are any missing values in the entire dataset.

# Any missing values?
df.isnull().values.any()
True

We can also get the total count of missing values.

# Total number of missing values
df.isnull().sum().sum()
13

Now that we identified and correctly labeled the missing values it’s time to deal with them.

Replacing

The easiest way to deal with missing values is to simply drop the record however you could be throwing away valuable information if the other columns are populated for that record.

We’ll be doing some basic imputations here but a detailed statistical approach to dealing with missing data coupled with good engineering judgement should be used for real-life models.

We could choose to fill in all missing values with a single value.

# Replace missing values with a number
print(df['ST_NUM'])
df['ST_NUM'].fillna(125, inplace=True)
print('\n')
print(df['ST_NUM'])
0    104.0
1    197.0
2      NaN
3    201.0
4    203.0
5    207.0
6      NaN
7    213.0
8    215.0
Name: ST_NUM, dtype: float64


0    104.0
1    197.0
2    125.0
3    201.0
4    203.0
5    207.0
6    125.0
7    213.0
8    215.0
Name: ST_NUM, dtype: float64

More likely, we would be using a location based imputation.

# Location based replacement
print(df['NUM_BEDROOMS'])
df.loc[2,'NUM_BEDROOMS'] = 2
print('\n')
print(df['NUM_BEDROOMS'])
0    3.0
1    3.0
2    NaN
3    1.0
4    3.0
5    NaN
6    2.0
7    NaN
8    NaN
Name: NUM_BEDROOMS, dtype: float64


0    3.0
1    3.0
2    2.0
3    1.0
4    3.0
5    NaN
6    2.0
7    NaN
8    NaN
Name: NUM_BEDROOMS, dtype: float64

A widely used method of replacing missing values is to use a median.

# Replace using median
print(df['NUM_BATH'])
median = df['NUM_BATH'].median()
df['NUM_BATH'].fillna(median, inplace=True)
print('\n')
print(df['NUM_BATH'])
0    1.0
1    1.5
2    1.0
3    NaN
4    2.0
5    1.0
6    NaN
7    1.0
8    2.0
Name: NUM_BATH, dtype: float64


0    1.0
1    1.5
2    1.0
3    1.0
4    2.0
5    1.0
6    1.0
7    1.0
8    2.0
Name: NUM_BATH, dtype: float64

These are very basic methods of handling missing data. There are much more sophisticated ways of handling missing data that I don’t go into here since a lot of it depends on a number of factors. Each dataset comes with its own nuances that the Data Scientist has to learn to navigate and make determinations on.

Miscellaneous Cleaning

Often times when dealing with name type columns you’ll encounter words that are capitalized, lowercase, or a combination of both. These name type columns aren’t always used however if you’re looking to extract some info from it then some cleaning is required.

For the “Street Name” column all records are uppercase except in row 6 where the value is all lowercase “berkeley” and row 1 where the first letter is lowercase and the rest is uppercase “pUTNAM”.

We’ll choose to standardized all values to use uppercase.

df['ST_NAME']
0        pUTNAM
1     LEXINGTON
2     LEXINGTON
3      BERKELEY
4      BERKELEY
5      berkeley
6    WASHINGTON
7       TREMONT
8       TREMONT
Name: ST_NAME, dtype: object
df['ST_NAME'] = df['ST_NAME'].apply(lambda x: x.upper())
df['ST_NAME']
0        PUTNAM
1     LEXINGTON
2     LEXINGTON
3      BERKELEY
4      BERKELEY
5      BERKELEY
6    WASHINGTON
7       TREMONT
8       TREMONT
Name: ST_NAME, dtype: object

Conclusion

Messy data is inevitable. Data cleaning is part of the process in any data science project.

Although the dataset used here was very small, these techniques can easily be applied to much larger datasets.