Day 15 of Learning Python for Data Science
Welcome to Day 15 of our Python for Data Science journey!
On Day 15, we dived deeper into EDA (Exploratory Data Analysis) using Matplotlib and Pandas, working with real-world employee data.
Day 14 of Learning Python for Data Science: Mastering Data Visualization with Seaborn
Let’s walk through what we covered, along with key observations and learnings.
We started by loading an Employee dataset:
emp = pd.read_csv('/kaggle/input/employee-data/employee_data.csv')
emp.head()
emp.describe(include='all')
emp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 1000 non-null int64
1 Age 1000 non-null int64
2 Gender 1000 non-null object
3 Department 1000 non-null object
4 Salary 1000 non-null float64
5 Years_of_Experience 1000 non-null int64
6 Performance_Score 1000 non-null int64
7 Work_Location 1000 non-null object
8 Joining_Date 1000 non-null object
9 Hours_Worked_Per_Week 1000 non-null int64
dtypes: float64(1), int64(5), object(4)
memory usage: 78.2+ KB
ID
, Name
, Age
, Department
, Salary
, Joining_Date
, and Performance_Score
.Department
) and some were numeric (Age
, Salary
, Performance_Score
).Joining_Date
was converted to a datetime format to enable time-based visualizations.emp['Joining_Date'] = pd.to_datetime(emp['Joining_Date'], errors='coerce')
emp['Joining_Date'] = pd.to_datetime(emp['Joining_Date'])
We dropped the ID
column as it had no meaningful use for our analysis, and converted the Joining_Date to date time format.
# Filter for 2011 joiners
filtered_df = emp[emp['Joining_Date'].dt.year == 2011]
# Sort for line plot
df_sorted = filtered_df.sort_values(by='Joining_Date')
# Plot
plt.figure(figsize=(12,6))
plt.plot(df_sorted['Joining_Date'], df_sorted['Salary'], color='orange', marker='o')
plt.xlabel('Joining Date')
plt.ylabel('Salary')
plt.title('Salary Trend in 2011')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Sort the data by date
df_sorted = emp.sort_values(by='Joining_Date')
# Extract year from the joining date
df_sorted['year'] = df_sorted['Joining_Date'].dt.year
# Group by year and calculate average salary
yearly_data = df_sorted.groupby('year')['Salary'].mean().reset_index()
# Plot the trend
plt.figure(figsize=(7, 5))
plt.plot(yearly_data['year'], yearly_data['Salary'], color='green', marker='*', linewidth=2)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Salary', fontsize=12)
plt.title('📈 Average Salary Trend Over the Years', fontsize=14)
plt.xticks(yearly_data['year'])
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
We wanted to understand if salaries differed by department.
dep_grouped = emp.groupby('Department')['Salary'].mean().reset_index()
plt.figure(figsize=(12,6))
plt.bar(dep_grouped['Department'], dep_grouped['Salary'], color='green')
plt.xlabel('Department')
plt.ylabel('Salary')
plt.title('Department wise average salary')
plt.show()
# Ensure Joining_Date is properly formatted
emp['Joining_Date'] = pd.to_datetime(emp['Joining_Date'], errors='coerce')
emp.dropna(subset=['Joining_Date'], inplace=True)
# Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=emp, x='Age', y='Salary', hue='Department', palette='Set2', alpha=0.8)
# Customizations
plt.title(' Age vs Salary by Department', fontsize=14)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Salary', fontsize=12)
plt.legend(title='Department', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
This plot is too noisy to draw and clear conclusion, from this scatter plot, we observe:
Helps understand how salaries are spread across employees.
plt.figure(figsize=(10,5))
sns.histplot(emp['Salary'], kde=True, color='skyblue')
plt.title('Distribution of Salaries')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.show()
The salary distribution is approximately symmetric, not heavily right-skewed as initially assumed.
Most employees earn salaries in the moderate range — there’s a clear central peak, likely indicating the most common salary band.
The curve does not show multiple distinct peaks (no bimodal behavior), implying there aren’t separate salary groups or clusters based on role levels.
There are a few outliers on both ends, but they are not dominant enough to heavily influence the shape — overall, the distribution appears reasonably balanced.
A clean way to compare salary ranges, medians, and detect outliers.
plt.figure(figsize=(12,6))
sns.boxplot(x='Department', y='Salary', data=emp)
plt.title('Salary Distribution by Department')
plt.xticks(rotation=45)
plt.show()
Let’s explore if salary increases with age.
plt.figure(figsize=(10,6))
sns.scatterplot(x='Age', y='Salary', data=emp, hue='Department')
plt.title('Age vs Salary by Department')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()
This plot needs refinement for clearer observations.
plt.figure(figsize=(10,5))
sns.countplot(x='Department', data=emp, palette='viridis')
plt.title('Number of Employees in Each Department')
plt.xticks(rotation=45)
plt.show()
We will continue this exploration on the next day!
✅ How to clean and prepare data (e.g., dropping unnecessary columns, converting data types).
✅ How to plot Line charts to observe time-based trends.
✅ How to create Bar charts for categorical comparisons.
✅ How to interpret simple EDA visualizations to derive business insights.
✅ How to observe average salary patterns across departments and over time.
We hope this article was helpful for you and you learned a lot about data science from it. If you have friends or family members who would find it helpful, please share it to them or on social media.
Join our social media for more.
Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science Python for Data Science
Hi, I am Vishal Jaiswal, I have about a decade of experience of working in MNCs like Genpact, Savista, Ingenious. Currently i am working in EXL as a senior quality analyst. Using my writing skills i want to share the experience i have gained and help as many as i can.
Welcome to Day 14 of our Python for Data Science journey! Today, we explored Seaborn,…
Welcome to Day 13 of Learning Python for Data Science! Today, we’re focusing on three…
Test your understanding of Python Data Structure, which we learned in our previous lesson of…
Welcome to Day 12 of Learning Python for Data Science. Today, we’ll dive into Pandas,…
NumPy Array in Python is a powerful library for numerical computing in Python. It provides…
Welcome to Day 9 of Learning Python for Data Science. Today we will explore comprehensions,…