Skip to content

Step 19

In this step, we’ll explore Pandas, a powerful Python library for data manipulation and analysis. Pandas provides tools to explore, manipulate, and analyze datasets efficiently, which is essential for behavioral scientists working with real-world data.

We’ll use a CSV file named happiness correlation data-2.csv, which you can download below. Each row represents data from one participant, with columns capturing various aspects like age, work hours, GPA, life satisfaction, and more.

0. Download the Dataset

Click this link to download the data

Familiar?

This data was pulled from Stats 2002, a course at UC. If this data is familiar, it's probably because you've seen this before!

IMPORTANT: Make sure to place the downloaded CSV file in the same directory as your notebook or script!

1. Getting Started with Pandas

Installing and Importing Pandas

To use Pandas, ensure it’s installed in your Python environment. You can install it by running:

!pip install pandas

%pip install

If the code above doesn't work, try using %pip install instead of !pip install.

Then, import Pandas at the beginning of your notebook or script:

import pandas as pd

Loading the Data

Load the CSV file into a DataFrame, which is Pandas’ primary data structure for handling data tables.

# Load the data from the your directory
file_path = 'happiness correlation data-2.csv'
df = pd.read_csv(file_path)

Viewing the Data

Use head() to see the first few rows and get a feel for the structure.

df.head()

2. Exploring the Dataset

This dataset has columns capturing the following participant information:

  • age: Participant's age
  • hours_work_week: Hours worked per week
  • gpa: Participant's GPA
  • life_satisfaction: Self-reported life satisfaction score
  • desire_to_achieve: Self-reported desire to achieve
  • number_drinks: Number of alcoholic drinks consumed per week
  • stress: Self-reported stress level

Basic Data Information

To get a quick summary of the dataset, including column names, data types, and any missing values:

df.info()

To get basic descriptive statistics (mean, median, etc.) for each column:

df.describe()

3. Analyzing Specific Columns

Calculating the Mean Age of Participants

Let’s calculate the average age of participants.

mean_age = df['age'].mean()
print("Average Age:", mean_age)

Distribution of Life Satisfaction Scores

To understand the distribution of life_satisfaction scores, we can use Pandas to plot a histogram (requires matplotlib library).

import matplotlib.pyplot as plt

df['life_satisfaction'].plot(kind='hist', title='Life Satisfaction Distribution')
plt.xlabel('Life Satisfaction')
plt.show()

Exploring Correlations

We may want to see how different variables relate to each other. For example, are work hours correlated with stress?

correlation = df[['hours_work_week', 'stress']].corr()
print("Correlation between hours worked and stress:\n", correlation)

Grouping Data

We can group data to find insights, such as average Stress level by different levels of desire_to_achieve.

avg_stress_by_achievement = df.groupby('desire_to_achieve')['stress'].mean()
print("Stress by Desire to Achieve:\n", avg_stress_by_achievement)

4. Data Cleaning and Manipulation

Calculating Letter Grades

Add a new column to the dataset that calculates the letter grade for each student's GPA:

def calculate_letter_grade(gpa):
    if gpa >= 3.7:
        return 'A'
    elif gpa >= 3.0:
        return 'B'
    elif gpa >= 2.0:
        return 'C'
    elif gpa >= 1.0:
        return 'D'
    else:
        return 'F'

df['letter_grade'] = df['gpa'].apply(calculate_letter_grade)

Create a Bar Graph of Letter Grades

To visualize the distribution of letter grades, we can create a bar graph:

grade_counts = df['letter_grade'].value_counts()
grade_counts.plot(kind='bar', title='Letter Grade Distribution')
plt.xlabel('Letter Grade')
plt.ylabel('Count')
plt.show()

5. Saving Processed Data

After adding a new column, it’s often useful to save the processed dataset. Here’s how to save it to a new CSV file:

df.to_csv('letter_grades_added_happiness_data.csv', index=False)

Summary

In this step, you learned:

  • Loading a CSV file into a Pandas DataFrame
  • Exploring the data using basic summary and statistical methods
  • Analyzing specific columns and relationships between them
  • Cleaning data by handling missing values
  • Saving processed data to a new CSV file

Pandas is a powerful tool for data analysis in Python, allowing you to work with datasets efficiently and discover meaningful insights.