Mastering Data Cleaning in Pandas: Cheat Sheet

Mastering Data Cleaning in Pandas: Cheat Sheet

An quick data cleaning cheat sheet for busy data analyst.

Introduction

Data is generated from everything and everywhere nowadays. If you look around, you will see devices are generating an enormous amount of data from our daily devices such as phones, tablets, PCs, and smartwatches, and with the advancement of Internet of Things (IOTs) devices our fridges, TV, ACs, also generate data. More and more people are using services via the internet, which will create more data, and those data when land on the data analyst's lap see that data is dirty like hell(here, dirty means not properly formatted, missing values, or wrong data types). See the survey below

From the above picture, it is clear that Data scientists and analysts are spending most of their project time on Data Cleaning (nearly 60% of the time). That's why data cleaning is not a necessity it is the most important part of a data analysis and data scientist job role. In this article, I share a cheat sheet for data cleaning which will help my fellow learner to get clean data for their great projects

Benefits of clean data

Clean data is more important than your robust ML/DL models. Clean data is critical to the success of data science and data analysis.

If you feed dirt, will get dirt

It is always known that if you feed yourself dirt no matter how much you will eat, you will get ill. Some of the benefits of clean data are as follows

  • Accurate analysis - Clean data ensures that the analysis performed is accurate, reliable, and consistent. It helps to avoid errors, inconsistencies, and discrepancies that could lead to incorrect conclusions.

  • Time savings - Cleaning dirty data can be a time-consuming and tedious process. By starting with clean data, you can save a significant amount of time in the analysis process and focus on deriving insights from the data.

  • Better decision-making - Clean data leads to better decision-making. It helps to identify trends, patterns, and correlations that might otherwise go unnoticed. With clean data, you can make data-driven decisions with confidence.

  • Improved efficiency - Clean data allows for efficient data processing, modeling, and visualization. It helps to reduce data processing time, minimize errors, and improve overall efficiency.

  • Increased productivity - Clean data enables data scientists and analysts to focus on analyzing data rather than cleaning it. This helps to increase productivity and enables faster turnaround times for analysis.

  • Better collaboration - Clean data facilitates better collaboration between team members. It helps to avoid misunderstandings and disagreements that can arise from inconsistent or inaccurate data.

Cheatsheet of Data Cleaning

Importing pandas

import pandas as pd

Reading Data from file

df = pd.read_csv('your_filename.csv')

Handling missing values

Handling duplicates

Handling Outliers

Handling string data

Handling date

Renaming Column

Filtering Data

Sorting Data

Removing Columns

Handling Numerical data

Handling Categorical Columns

Conclusion

Data cleaning is a very crucial step to becoming a better Data Analyst or Data Scientist.

Eyes for detail

When you see the phrase "Eyes for detail" in the job criteria section, you must take it seriously if you are applying for any above positions. So, Hone these valuable skills. Thank you for reading.

If my article helps you, please share it with your friends and connections.

Don't forget to follow me on Hashnode and you can also follow me to get all my future blogs and post on Twitter, LinkedIn and Medium.