
Welcome to Day 13 of Learning Python for Data Science! Today, we’re focusing on three powerful tools in Pandas that can take your data analysis skills to the next level: Pivot, Apply, and Regular Expressions (RegEx). These features allow for advanced data transformation, customization, and pattern matching. Pivot tables help you restructure and summarize data effortlessly, Apply lets you run custom functions across rows or columns for flexible processing, and RegEx enables sophisticated string pattern searching and cleaning. Mastering these tools will give you greater control and insight when working with complex datasets in real-world projects.
Day 12 of Learning Python for Data Science – Pandas
What is a Pivot Table in Pandas?
In data analysis, a pivot table is a valuable tool for transforming or summarizing datasets. In pandas, the pivot()
and pivot_table()
functions provide the ability to reorganize and explore data with ease — enabling better analysis, similar to Excel’s pivot table functionality.
When Should You Use pivot()
vs pivot_table()
?
pivot()
is used when the index/column combination is unique and doesn’t require aggregation.pivot_table()
is used when duplicate entries exist, and an aggregation function (likesum
,mean
) is needed to resolve them.
Example: Using pivot()
and pivot_table()
import pandas as pd
# Sample DataFrame
data = {
'Employee': ['Alice', 'Bob', 'Alice', 'Bob'],
'Department': ['HR', 'HR', 'IT', 'IT'],
'Hours': [5, 6, 7, 8]
}
df = pd.DataFrame(data)
Using pivot()
– requires unique combinations
df.pivot(index='Employee', columns='Department', values='Hours')
Output:
Department HR IT
Employee
Alice 5 7
Bob 6 8
Using pivot_table()
– handles duplicates and performs aggregation
df.pivot_table(index='Employee', columns='Department', values='Hours', aggfunc='sum')
Key Parameters of pivot_table()
Parameter | Description |
---|---|
index | Column(s) to use as row index |
columns | Column(s) to use as columns |
values | Column(s) to aggregate |
aggfunc | Aggregation function to apply (default is mean ) |
fill_value | Value to replace missing entries (NaN) |
Additional Example: Handling Missing Values with fill_value
df.pivot_table(index='Employee', columns='Department', values='Hours', aggfunc='sum', fill_value=0)
This ensures you don’t have NaN
in your output.
Use Cases
- Analyzing total working hours per employee across departments
- Generating monthly summaries (e.g., sales, productivity)
- Group-wise aggregation and reporting in dashboards
Apply Function in Pandas
What is Apply Function in Pandas?
The apply()
function in pandas allows you to apply a function along an axis of the DataFrame (either rows or columns). It’s widely used for row-wise or column-wise transformations without writing for-loops.
Syntax
DataFrame.apply(func, axis=0)
func
: The function to applyaxis
: 0 for columns (default), 1 for rows
Example
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Score1': [85, 90, 95],
'Score2': [80, 88, 92]
}
df = pd.DataFrame(data)
print(df)
# Output:
# Name Score1 Score2
# 0 Alice 85 80
# 1 Bob 90 88
# 2 Charlie 95 92
def average(row):
return (row['Score1'] + row['Score2']) / 2
# Apply row-wise function
df['Average'] = df.apply(average, axis=1)
print(df)
Output:
Name Score1 Score2 Average
0 Alice 85 80 82.5
1 Bob 90 88 89.0
2 Charlie 95 92 93.5
When to Use:
- Calculating row-wise statistics
- Complex conditional logic per row or column
- Wrapping custom functions on DataFrame
Basics of Regex in Python (re
module)
What is Regex?
Regex (Regular Expressions) is a powerful tool for searching and manipulating strings based on patterns. Python provides support for regex via the built-in re
module.
Common Functions in re
Module with Examples
Search for pattern anywhere in string
import re
text = "I have 2 apples"
result = re.search(r'\d+', text)
print(result.group())
Output:
re.search(): 2
Match pattern only at the start of string
text2 = "2023 Report Released"
result = re.match(r'\d+', text2)
print(result.group())
Output:
2023
Find all matching patterns
sentence = "The order numbers are 123, 456 and 789."
results = re.findall(r'\d+', sentence)
print(results)
Output:
['123', '456', '789']
Replace matched patterns with a string
messy = "Error: 404 Not Found"
cleaned = re.sub(r'\d+', 'XXX', messy)
print(cleaned)
Output:
Error: XXX Not Found
When to Use:
- Text extraction and cleaning
- Pattern-based validation
- Log file or message parsing
Common Regex Symbols and Their Meaning
Symbol | Description |
---|---|
. | Matches any character except newline |
^ | Start of string |
$ | End of string |
* | 0 or more repetitions |
+ | 1 or more repetitions |
? | 0 or 1 repetition or makes quantifier non-greedy |
{n} | Exactly n repetitions |
{n,} | n or more repetitions |
{n,m} | Between n and m repetitions |
[] | Matches one character in the set |
[^] | Matches one character not in the set |
\d | Digit [0-9] |
\D | Non-digit |
\w | Alphanumeric [a-zA-Z0-9_] |
\W | Non-alphanumeric |
\s | Whitespace |
\S | Non-whitespace |
() | Capturing group |
(?:) | Non-capturing group |
Practice
- Find the phone number. text = “My phone number is a 9876543210 986 047.”
- Identify valid email addresses from a list. emails = [‘[email protected]’, ‘bob_at_example.com’, ‘[email protected]’]
- Extract all hashtags from a sentence. text = “I love #Python and #MachineLearning!”
- Replace all dates in the format YYYY-MM-DD with [DATE]. text = “The meeting is on 2024-12-22.”
- Extract phone numbers from a string. text = “Call me at 9876543210 or at 1234567890.”
- Create a DataFrame with sales data and pivot it.
- Write a function to categorize temperatures (
Cold
,Warm
,Hot
) and apply it to a DataFrame. - Write a regex to extract email IDs from a string.
- Calculate total pay for employees given hours worked and hourly rate.
- Create a pivot table showing the average salary by department.
- Count the number of employees in each city using a pivot table.
- Create a pivot table showing total sales by product.
- Create a pivot table showing the average age by gender and department.
- Create a pivot table with multiple aggregation functions (mean and sum) for sales by region.
- Create a pivot table showing percentage contribution of each product in overall sales.
- Create a pivot table with custom aggregation (e.g., standard deviation of scores by subject).
- Use margins in a pivot table to include row and column totals.
- Create a pivot table with a hierarchical index (multi-level rows and columns).
- Filter a pivot table to show only values where the average sales exceed a threshold.
- Use
.apply()
to calculate the length of strings in a Name column. - Use
.apply()
to convert a list of dates from string todatetime
. - Apply a function to standardize a column (z-score).
- Use row-wise
.apply()
to calculate total marks from multiple score columns. - Apply a function that flags customers as “High” or “Low” value based on total purchase.
- Use
.apply()
to extract the domain name from an email address column. - Apply a lambda function that returns different values based on multiple conditions (e.g., risk score).
- Use
.apply()
on grouped data to calculate the range of each group. - Create a new column that computes weighted average of other columns using
.apply()
. - Combine
.apply()
withnp.where
or nestedif-else
logic for complex row-wise classification. - Use
re.search()
to extract the first number from a sentence. - Use
re.match()
to check if a string starts with a capital letter. - Use
re.findall()
to extract all email addresses from a paragraph. - Extract all phone numbers of pattern
xxx-xxx-xxxx
from text usingre.findall()
. - Replace all digits in a string with
#
usingre.sub()
. - Extract all hashtags from a tweet using regex.
- Validate if a given string is a valid date in format
YYYY-MM-DD
. - Extract all words that start with capital letters and are more than 5 characters long.
- Remove all special characters except alphabets and numbers from a text string.
- Use regex to split a sentence into words ignoring punctuation.
We hope this article was helpful for you and you learned a lot about data science from it. If you have friends or family members who would find it helpful, please share it to them or on social media.
Join our social media for more.
Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science
Also Read
- Day 13 of Learning Python for Data Science: Mastering Pivot, Apply and RegEx
- Practice day 12 of Learning Python for Data Science
- Day 12 of Learning Python for Data Science – Pandas
- Day 10 Of Learning Python for Data Science – NumPy Array In Python
- Day 9 of Learning Python for Data Science – Queries Related To Functions In Python
Hi, I am Vishal Jaiswal, I have about a decade of experience of working in MNCs like Genpact, Savista, Ingenious. Currently i am working in EXL as a senior quality analyst. Using my writing skills i want to share the experience i have gained and help as many as i can.