Step 19¶

In this step, we’ll explore Pandas, a powerful Python library for data manipulation and analysis. Pandas provides tools to explore, manipulate, and analyze datasets efficiently, which is essential for behavioral scientists working with real-world data.

We’ll use a CSV file named happiness correlation data-2.csv, which you can download below. Each row represents data from one participant, with columns capturing various aspects like age, work hours, GPA, life satisfaction, and more.

0. Download the Dataset¶

Click this link to download the data

Familiar?

This data was pulled from Stats 2002, a course at UC. If this data is familiar, it's probably because you've seen this before!

IMPORTANT: Make sure to place the downloaded CSV file in the same directory as your notebook or script!

1. Getting Started with Pandas¶

Installing and Importing Pandas¶

To use Pandas, ensure it’s installed in your Python environment. You can install it by running:

!pip install pandas

%pip install

If the code above doesn't work, try using %pip install instead of !pip install.

Then, import Pandas at the beginning of your notebook or script:

import pandas as pd

Loading the Data¶

Load the CSV file into a DataFrame, which is Pandas’ primary data structure for handling data tables.

# Load the data from the your directory
file_path = 'happiness correlation data-2.csv'
df = pd.read_csv(file_path)

Viewing the Data¶

Use head() to see the first few rows and get a feel for the structure.

df.head()

2. Exploring the Dataset¶

This dataset has columns capturing the following participant information:

age: Participant's age
hours_work_week: Hours worked per week
gpa: Participant's GPA
life_satisfaction: Self-reported life satisfaction score
desire_to_achieve: Self-reported desire to achieve
number_drinks: Number of alcoholic drinks consumed per week
stress: Self-reported stress level

Basic Data Information¶

To get a quick summary of the dataset, including column names, data types, and any missing values:

df.info()

To get basic descriptive statistics (mean, median, etc.) for each column:

df.describe()

3. Analyzing Specific Columns¶

Calculating the Mean Age of Participants¶

Let’s calculate the average age of participants.

mean_age = df['age'].mean()
print("Average Age:", mean_age)

Distribution of Life Satisfaction Scores¶

To understand the distribution of life_satisfaction scores, we can use Pandas to plot a histogram (requires matplotlib library).

import matplotlib.pyplot as plt

df['life_satisfaction'].plot(kind='hist', title='Life Satisfaction Distribution')
plt.xlabel('Life Satisfaction')
plt.show()

Exploring Correlations¶

We may want to see how different variables relate to each other. For example, are work hours correlated with stress?

correlation = df[['hours_work_week', 'stress']].corr()
print("Correlation between hours worked and stress:\n", correlation)

Grouping Data¶

We can group data to find insights, such as average Stress level by different levels of desire_to_achieve.

avg_stress_by_achievement = df.groupby('desire_to_achieve')['stress'].mean()
print("Stress by Desire to Achieve:\n", avg_stress_by_achievement)

4. Data Cleaning and Manipulation¶

Calculating Letter Grades¶

Add a new column to the dataset that calculates the letter grade for each student's GPA:

def calculate_letter_grade(gpa):
    if gpa >= 3.7:
        return 'A'
    elif gpa >= 3.0:
        return 'B'
    elif gpa >= 2.0:
        return 'C'
    elif gpa >= 1.0:
        return 'D'
    else:
        return 'F'

df['letter_grade'] = df['gpa'].apply(calculate_letter_grade)

Create a Bar Graph of Letter Grades¶

To visualize the distribution of letter grades, we can create a bar graph:

grade_counts = df['letter_grade'].value_counts()
grade_counts.plot(kind='bar', title='Letter Grade Distribution')
plt.xlabel('Letter Grade')
plt.ylabel('Count')
plt.show()

5. Saving Processed Data¶

After adding a new column, it’s often useful to save the processed dataset. Here’s how to save it to a new CSV file:

df.to_csv('letter_grades_added_happiness_data.csv', index=False)

Summary¶

In this step, you learned:

Loading a CSV file into a Pandas DataFrame
Exploring the data using basic summary and statistical methods
Analyzing specific columns and relationships between them
Cleaning data by handling missing values
Saving processed data to a new CSV file

Pandas is a powerful tool for data analysis in Python, allowing you to work with datasets efficiently and discover meaningful insights.