Photo by UX Indonesia on Unsplash
Pandas Data Analysis CheatSheet
A perfect cheat sheet for your data analysis in Pandas.
Table of contents
- Introduction
- Importing Pandas
- Reading Data
- Descriptive Statistics
- Looking at the head and tail of the data
- Shape and Columns of data
- Indexing and selecting Data
- Filtering Data
- Sorting Data
- Applying Functions
- Reshaping Data
- Cleaning Data
- Combining Data
- Quick Visualizing Data
- Aggregating Data
- Convert column to DateTime
- Saving Data
- Conclusion
Introduction
Pandas is the most important library for data scientists and Data Analysis. Pandas is a powerful data manipulation library for Python, used for data cleaning, transformation, and analysis. With its intuitive and efficient data structures, Pandas enables users to easily work with and analyze large datasets, making it a go-to tool for data professionals and enthusiasts alike.
In this blog post, I will try to create a cheat sheet for Data analysts and Data Scientists. It will be a handy tool for quick exploration of datasets and analysis.
Importing Pandas
import pandas as pd
Reading Data
data = pd.read_csv('filename.csv') # Reading CSV files
data = pd.read_excel('filename.xlsx') # Reading Excel files
You can also load many other files such as Parquet, arrow, etc.
Descriptive Statistics
data.describe()
Looking at the head and tail of the data
# Show the first 20 rows
data.head()
# Show the last 20 rows
data.tail()
# YOu can also give the rows as an argument
data.head(5) # will return only first 5 rows
Shape and Columns of data
# returns the number of rows and columns
data.shape
# Return the column names
data.columns
# You can create a list from the column names
col_name = list(data.columns)
Indexing and selecting Data
# select a column
data['col_name']
or
data.col_name
# Select multiple columns
data[['col1', 'col2']]
# select col with location
# [[]] return a DataFrame
data.loc[['col1', 'col2']]
# select data using index number
data.iloc[[0]] # first row
data.iloc[[0,1]] # first two row
# you can use slice
data.iloc[:3]
# or you can use lambda function
# select all the even rows
data.iloc[lambda x: x.index % 2 == 0]
Filtering Data
# selecting row based on condition
data[data['column_name'] > value]
# Selects rows where column value is in a list
data[data['column_name'].isin([value1, value2])]
# Selects rows where column value contains text
data[data['column_name'].str.contains('text')]
Sorting Data
# Sorts data by column_name in ascending order
data.sort_values('column_name', ascending=True)
# Sorts data by multiple columns in different order
data.sort_values(['col1', 'col2'], ascending=[True, False])
Applying Functions
# Applies function to column and returns new column
data['new_col'] = data['column_name'].apply(function)
# Applies function to row and returns new column
data['new_col'] = data.apply(lambda row: function(row['col1'], row['col2']), axis=1)
Reshaping Data
# Creates a pivot table
data.pivot_table(index='col1', columns='col2', values='col3', aggfunc='mean')
# Melts data from wide to long format
pd.melt(data, id_vars=['col1'], value_vars=['col2', 'col3'])
Cleaning Data
data.dropna() # Removes rows with missing values
data.fillna(value) # Fills missing values with a given value
data.drop_duplicates() # Removes duplicate rows
data.replace(to_replace, value) # Replaces values in the data
Combining Data
pd.concat([data1, data2], axis=0) # Concatenates data vertically
pd.concat([data1, data2], axis=1) # Concatenates data horizontally
pd.merge(data1, data2, on='column_name') # Merges data based on a common column
Quick Visualizing Data
data.plot() # Plots data
data.plot(kind='bar') # Plots a bar chart
data.plot(kind='box') # Plots a box plot
data.plot(kind='hist') # Plots a histogram
data.plot(kind='scatter', x='col1', y='col2') # Plots a scatter plot
Aggregating Data
# Aggregates data by column_name and returns sum of col1 and mean of col2
data.groupby('column_name').agg({'col1': 'sum', 'col2': 'mean'})
Convert column to DateTime
# Converts column to datetime format
data['date_column'] = pd.to_datetime(data['date_column'])
Saving Data
data.to_csv('filename.csv', index=False) # Saves data to a CSV file
data.to_excel('filename.xlsx', index=False) # Saves data to an Excel file
Conclusion
Pandas is an incredibly versatile and powerful tool for data analysis in Python. With its robust data structures and numerous functions, pandas make it easy to clean, manipulate, and analyze large datasets. By understanding and mastering these techniques, you can unleash the full potential of pandas and become a more effective data analyst. I hope this blog has been informative and valuable in your journey toward mastering data analysis with pandas. Thank you for reading, and I wish you the best of luck in your data analysis endeavors!