Exercises with Solutions | Data Analysis in Python for Complete Beginners

Exercise 1: Data Visualization

Goal: Visualize the relationship between age and salary using a scatter plot.

Instructions:

Use matplotlib or seaborn.
Add axis labels and a title.

Question:

What trend do you observe between age and salary across departments?

import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot of Age vs Salary
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='age', y='salary', hue='department', alpha=0.7)
plt.title('Age vs Salary by Department')
plt.xlabel('Age')
plt.ylabel('Salary ($)')
plt.show()

Exercise 2: Independent t-test

Goal: Compare if the average salary of IT and Finance departments are significantly different.

Instructions:

Extract salary data for both departments.
Use scipy.stats.ttest_ind().
Print the t-statistic and p-value.

Question:

Is there a statistically significant difference at α = 0.05?

from scipy import stats

it_salary = df[df['department'] == 'IT']['salary']
finance_salary = df[df['department'] == 'Finance']['salary']

t_stat, p_val = stats.ttest_ind(it_salary, finance_salary)
print(f"T-statistic: {t_stat:.3f}, P-value: {p_val:.4f}")

Exercise 3: Logistic Regression

Goal: Predict whether an employee has a high salary (above median) using age and years_employed.

Instructions:

Create a binary column high_salary (1 = above median, 0 = below).
Split the data into training/testing sets.
Fit a logistic regression model and check accuracy.

Question:

How well does the model predict high-salary employees based on age and years employed?

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create target variable
median_salary = df['salary'].median()
df['high_salary'] = (df['salary'] > median_salary).astype(int)

# Prepare features and labels
X = df[['age', 'years_employed']]
y = df['high_salary']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(f"Model accuracy: {accuracy_score(y_test, y_pred):.2f}")