Mastering Data Cleaning in Pandas: Cheat Sheet
An quick data cleaning cheat sheet for busy data analyst.
Table of contents
Introduction
Data is generated from everything and everywhere nowadays. If you look around, you will see devices are generating an enormous amount of data from our daily devices such as phones, tablets, PCs, and smartwatches, and with the advancement of Internet of Things (IOTs) devices our fridges, TV, ACs, also generate data. More and more people are using services via the internet, which will create more data, and those data when land on the data analyst's lap see that data is dirty like hell(here, dirty means not properly formatted, missing values, or wrong data types). See the survey below
From the above picture, it is clear that Data scientists and analysts are spending most of their project time on Data Cleaning (nearly 60% of the time). That's why data cleaning is not a necessity it is the most important part of a data analysis and data scientist job role. In this article, I share a cheat sheet for data cleaning which will help my fellow learner to get clean data for their great projects
Benefits of clean data
Clean data is more important than your robust ML/DL models. Clean data is critical to the success of data science and data analysis.
If you feed dirt, will get dirt
It is always known that if you feed yourself dirt no matter how much you will eat, you will get ill. Some of the benefits of clean data are as follows
Accurate analysis - Clean data ensures that the analysis performed is accurate, reliable, and consistent. It helps to avoid errors, inconsistencies, and discrepancies that could lead to incorrect conclusions.
Time savings - Cleaning dirty data can be a time-consuming and tedious process. By starting with clean data, you can save a significant amount of time in the analysis process and focus on deriving insights from the data.
Better decision-making - Clean data leads to better decision-making. It helps to identify trends, patterns, and correlations that might otherwise go unnoticed. With clean data, you can make data-driven decisions with confidence.
Improved efficiency - Clean data allows for efficient data processing, modeling, and visualization. It helps to reduce data processing time, minimize errors, and improve overall efficiency.
Increased productivity - Clean data enables data scientists and analysts to focus on analyzing data rather than cleaning it. This helps to increase productivity and enables faster turnaround times for analysis.
Better collaboration - Clean data facilitates better collaboration between team members. It helps to avoid misunderstandings and disagreements that can arise from inconsistent or inaccurate data.
Cheatsheet of Data Cleaning
Importing pandas
import pandas as pd
Reading Data from file
df =
pd.read
_csv('your_filename.csv')
Handling missing values
Handling duplicates
Handling Outliers
Handling string data
Handling date
Renaming Column
Filtering Data
Sorting Data
Removing Columns
Handling Numerical data
Handling Categorical Columns
Conclusion
Data cleaning is a very crucial step to becoming a better Data Analyst or Data Scientist.
Eyes for detail
When you see the phrase "Eyes for detail" in the job criteria section, you must take it seriously if you are applying for any above positions. So, Hone these valuable skills. Thank you for reading.
If my article helps you, please share it with your friends and connections.
Don't forget to follow me on Hashnode and you can also follow me to get all my future blogs and post on Twitter, LinkedIn and Medium.