Blog

Day 13 of Learning Python for Data Science: Mastering Pivot, Apply and RegEx

Welcome to Day 13 of Learning Python for Data Science! Today, we’re focusing on three powerful tools in Pandas that can take your data analysis skills to the next level: Pivot, Apply, and Regular Expressions (RegEx). These features allow for advanced data transformation, customization, and pattern matching. Pivot tables help you restructure and summarize data effortlessly, Apply lets you run custom functions across rows or columns for flexible processing, and RegEx enables sophisticated string pattern searching and cleaning. Mastering these tools will give you greater control and insight when working with complex datasets in real-world projects.

Day 12 of Learning Python for Data Science – Pandas

What is a Pivot Table in Pandas?

In data analysis, a pivot table is a valuable tool for transforming or summarizing datasets. In pandas, the pivot() and pivot_table() functions provide the ability to reorganize and explore data with ease — enabling better analysis, similar to Excel’s pivot table functionality.

When Should You Use `pivot()` vs `pivot_table()`?

pivot() is used when the index/column combination is unique and doesn’t require aggregation.
pivot_table() is used when duplicate entries exist, and an aggregation function (like sum, mean) is needed to resolve them.

Example: Using `pivot()` and `pivot_table()`

import pandas as pd

# Sample DataFrame
data = {
    'Employee': ['Alice', 'Bob', 'Alice', 'Bob'],
    'Department': ['HR', 'HR', 'IT', 'IT'],
    'Hours': [5, 6, 7, 8]
}
df = pd.DataFrame(data)

Using `pivot()` – requires unique combinations

df.pivot(index='Employee', columns='Department', values='Hours')

Output:

Department  HR  IT
Employee          
Alice        5   7
Bob          6   8

Using `pivot_table()` – handles duplicates and performs aggregation

df.pivot_table(index='Employee', columns='Department', values='Hours', aggfunc='sum')

Key Parameters of `pivot_table()`

Parameter	Description
`index`	Column(s) to use as row index
`columns`	Column(s) to use as columns
`values`	Column(s) to aggregate
`aggfunc`	Aggregation function to apply (default is `mean`)
`fill_value`	Value to replace missing entries (NaN)

Additional Example: Handling Missing Values with `fill_value`

df.pivot_table(index='Employee', columns='Department', values='Hours', aggfunc='sum', fill_value=0)

This ensures you don’t have NaN in your output.

Use Cases

Analyzing total working hours per employee across departments
Generating monthly summaries (e.g., sales, productivity)
Group-wise aggregation and reporting in dashboards

Apply Function in Pandas

What is Apply Function in Pandas?

The apply() function in pandas allows you to apply a function along an axis of the DataFrame (either rows or columns). It’s widely used for row-wise or column-wise transformations without writing for-loops.

Syntax

DataFrame.apply(func, axis=0)

func: The function to apply
axis: 0 for columns (default), 1 for rows

Example

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score1': [85, 90, 95],
    'Score2': [80, 88, 92]
}
df = pd.DataFrame(data)
print(df)

# Output:
#       Name  Score1  Score2
# 0    Alice      85      80
# 1      Bob      90      88
# 2  Charlie      95      92

def average(row):
    return (row['Score1'] + row['Score2']) / 2

# Apply row-wise function
df['Average'] = df.apply(average, axis=1)
print(df)

Output:

      Name  Score1  Score2  Average
0    Alice      85      80     82.5
1      Bob      90      88     89.0
2  Charlie      95      92     93.5

When to Use:

Calculating row-wise statistics
Complex conditional logic per row or column
Wrapping custom functions on DataFrame

Basics of Regex in Python (`re` module)

What is Regex?

Regex (Regular Expressions) is a powerful tool for searching and manipulating strings based on patterns. Python provides support for regex via the built-in re module.

Common Functions in `re` Module with Examples

Search for pattern anywhere in string

import re

text = "I have 2 apples"
result = re.search(r'\d+', text)
print(result.group())

Output:

re.search(): 2

Match pattern only at the start of string

text2 = "2023 Report Released"
result = re.match(r'\d+', text2)
print(result.group())

Output:

Find all matching patterns

sentence = "The order numbers are 123, 456 and 789."
results = re.findall(r'\d+', sentence)
print(results)

Output:

['123', '456', '789']

Replace matched patterns with a string

messy = "Error: 404 Not Found"
cleaned = re.sub(r'\d+', 'XXX', messy)
print(cleaned)

Output:

Error: XXX Not Found

When to Use:

Text extraction and cleaning
Pattern-based validation
Log file or message parsing

Common Regex Symbols and Their Meaning

Symbol	Description
`.`	Matches any character except newline
`^`	Start of string
`$`	End of string
`*`	0 or more repetitions
`+`	1 or more repetitions
`?`	0 or 1 repetition or makes quantifier non-greedy
`{n}`	Exactly n repetitions
`{n,}`	n or more repetitions
`{n,m}`	Between n and m repetitions
`[]`	Matches one character in the set
`[^]`	Matches one character not in the set
`\d`	Digit `[0-9]`
`\D`	Non-digit
`\w`	Alphanumeric `[a-zA-Z0-9_]`
`\W`	Non-alphanumeric
`\s`	Whitespace
`\S`	Non-whitespace
`()`	Capturing group
`(?:)`	Non-capturing group

Practice

Find the phone number. text = “My phone number is a 9876543210 986 047.”
Identify valid email addresses from a list. emails = [‘alice@example.com’, ‘bob_at_example.com’, ‘charlie@example.co.uk’]
Extract all hashtags from a sentence. text = “I love #Python and #MachineLearning!”
Replace all dates in the format YYYY-MM-DD with [DATE]. text = “The meeting is on 2024-12-22.”
Extract phone numbers from a string. text = “Call me at 9876543210 or at 1234567890.”
Create a DataFrame with sales data and pivot it.
Write a function to categorize temperatures (Cold, Warm, Hot) and apply it to a DataFrame.
Write a regex to extract email IDs from a string.
Calculate total pay for employees given hours worked and hourly rate.
Create a pivot table showing the average salary by department.
Count the number of employees in each city using a pivot table.
Create a pivot table showing total sales by product.
Create a pivot table showing the average age by gender and department.
Create a pivot table with multiple aggregation functions (mean and sum) for sales by region.
Create a pivot table showing percentage contribution of each product in overall sales.
Create a pivot table with custom aggregation (e.g., standard deviation of scores by subject).
Use margins in a pivot table to include row and column totals.
Create a pivot table with a hierarchical index (multi-level rows and columns).
Filter a pivot table to show only values where the average sales exceed a threshold.
Use .apply() to calculate the length of strings in a Name column.
Use .apply() to convert a list of dates from string to datetime.
Apply a function to standardize a column (z-score).
Use row-wise .apply() to calculate total marks from multiple score columns.
Apply a function that flags customers as “High” or “Low” value based on total purchase.
Use .apply() to extract the domain name from an email address column.
Apply a lambda function that returns different values based on multiple conditions (e.g., risk score).
Use .apply() on grouped data to calculate the range of each group.
Create a new column that computes weighted average of other columns using .apply().
Combine .apply() with np.where or nested if-else logic for complex row-wise classification.
Use re.search() to extract the first number from a sentence.
Use re.match() to check if a string starts with a capital letter.
Use re.findall() to extract all email addresses from a paragraph.
Extract all phone numbers of pattern xxx-xxx-xxxx from text using re.findall().
Replace all digits in a string with # using re.sub().
Extract all hashtags from a tweet using regex.
Validate if a given string is a valid date in format YYYY-MM-DD.
Extract all words that start with capital letters and are more than 5 characters long.
Remove all special characters except alphabets and numbers from a text string.
Use regex to split a sentence into words ignoring punctuation.

We hope this article was helpful for you and you learned a lot about data science from it. If you have friends or family members who would find it helpful, please share it to them or on social media.

Join our social media for more.

Join us on Telegram

Follow our WhatsApp Channel

Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science

Also Read

Vishal

Hi, I am Vishal Jaiswal, I have about a decade of experience of working in MNCs like Genpact, Savista, Ingenious. Currently i am working in EXL as a senior quality analyst. Using my writing skills i want to share the experience i have gained and help as many as i can.

Spread the love