Step 16¶
Introduction to Data Visualization¶
In this step, we'll start exploring data visualization using Python, specifically with the matplotlib
library and numpy
for data creation. Data visualization is an essential tool in data analysis—it allows you to better understand your data by uncovering patterns, trends, and outliers. In scientific research, visualizing data is especially valuable, helping to interpret experiment results, compare groups, and communicate findings effectively.
We'll cover a few essential plot types: line plots, bar plots, histograms, and scatter plots. These visualizations are foundational, and you’ll likely use them to analyze data in future projects.
Getting Started: Setting Up Matplotlib and Numpy¶
To start, make sure you have matplotlib
installed. If you haven’t installed it yet, you can install it by running:
!pip install matplotlib
The !
The !
at the beginning of the command is used in Jupyter notebooks to things like installation commands - this will install matplotlib if it's not already installed.
Import matplotlib
and numpy
with the following aliases:
import matplotlib.pyplot as plt
import numpy as np
Did your import not work?
If you're having import issues, try using the following code to install the necessary packages:
%pip install matplotlib
%
is used in Jupyter notebooks to run commands in the terminal, and !
is used to run commands in the shell - depending on your environment, you may need to use one or the other.
If you're still having issues, try restarting VSCode and running the import commands again.
1. Line Plot¶
Use Case¶
Line plots are great for visualizing changes over time or across experimental trials. Let’s start by creating a basic line plot with synthetic data.
Creating a Basic Line Plot¶
We’ll create an array of data points and visualize them:
# Generate data
x = np.arange(0, 10, 1) # x-axis: from 0 to 9
y = np.random.randint(1, 10, size=10) # y-axis: random integers
# Create line plot
plt.plot(x, y, label="Random Data", color="blue", linestyle="--", marker="o")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Basic Line Plot")
plt.legend()
plt.show()
Plots a as a Debugging Tool
Just like when you use a print statement, when you run plt.show()
, you should see the plot appear in your notebook - this is great for instantaneous feedback.
Think about how you can use this to debug your data - if you're not getting the results you expect, you can visualize your data to visually inspect it, and see if something's wrong. As scientists, we build up intuitions for the "shape" of our data, and visualizing it can help us understand what's going on.
Breaking it down:¶
plt.plot(x, y)
: This function creates a line plot withx
on the x-axis andy
on the y-axis.plt.xlabel()
: Sets the label for the x-axis.plt.ylabel()
: Sets the label for the y-axis.plt.title()
: Sets the title of the plot.plt.legend()
: Displays the legend on the plot.plt.show()
: Displays the plot.
You can check out the plot
documentation for matplotlib here.
Customizing the Line Plot¶
Try adjusting colors, line styles, and adding markers to make the plot more informative.
Here is a web link to the named colors of matplotlib, it can be a lot of fun and very useful to experiment with different colors.
Colors from Hex Codes
You can also use hex codes to specify colors in matplotlib. For example, color="#FF5733"
would give you a specific shade of orange.
Something that I personally love to do to find unique combinations of hex code colors is to use PokePalettes, a website that generates color palettes based on Pokemon. Check it out here!
To replace a color in a plot with a hex code, you can use the following syntax:
plt.plot(x, y, color="#FF5733")
2. Bar Plot¶
Use Case¶
Bar plots are useful for comparing categorical data, like response counts across different participant groups.
Creating a Basic Bar Plot¶
Here, we’ll create a bar plot with categorical labels on the x-axis:
# Data
categories = ["Group A", "Group B", "Group C"]
values = [15, 35, 25]
# Create bar plot
plt.bar(categories, values, color="green")
plt.xlabel("Groups")
plt.ylabel("Values")
plt.title("Bar Plot of Categorical Data")
plt.show()
You can see the bar
documentation for matplotlib here.
Customizing the Bar Plot¶
Experiment with changing colors, adjusting bar width, and adding values on top of each bar to emphasize comparisons.
Fun Example
You can change the way the bars are displayed, their orientation, and other properties. For example, you can create a horizontal bar plot by using plt.barh()
instead of plt.bar()
.
plt.barh(categories, values, color="green")
3. Histogram¶
Use Case¶
Histograms are ideal for visualizing the distribution of a dataset, such as reaction times or score distributions.
Creating a Basic Histogram¶
Using Numpy, we’ll generate a normal distribution and plot it as a histogram.
# Generate data with a normal distribution
data = np.random.normal(loc=0, scale=1, size=1000)
# Create histogram
plt.hist(data, bins=30, color="purple", edgecolor="black")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of Normally Distributed Data")
plt.show()
Importance of Bin Size¶
Histograms are great data visualization tools, but it's important to understand how the bin size can be adjusted to change the level of detail in the plot, sometimes altering the interpretation of the data.
You can adjust the number of bins by changing the bins
parameter in the hist()
function. Functionally, the number of bins determines the number of intervals (or groups) the data is divided into, the more bins, the more detailed the histogram.
Customizing the Histogram¶
Experiment with changing colors, adjusting the number of bins, and adding labels to make the histogram more informative.
4. Scatter Plot¶
Use Case¶
Scatter plots are useful for showing relationships between two continuous variables, like age and reaction time.
Creating a Basic Scatter Plot¶
Here, we’ll create two correlated datasets and plot them against each other.
# Generate data
x = np.random.rand(100)
y = x * 2 + np.random.normal(0, 0.1, 100)
# Create scatter plot
plt.scatter(x, y, color="red", s=20, label="Data Points")
plt.xlabel("X Variable")
plt.ylabel("Y Variable")
plt.title("Scatter Plot Example")
plt.legend()
plt.show()
Customizing the Scatter Plot¶
Change point size, colors, and add legends to enhance interpretability.
Fun Example
You can change the shape of the markers in a scatter plot. For example, you can use triangles instead of circles by specifying marker="^"
.
plt.scatter(x, y, color="red", s=20, label="Data Points", marker="^")
This is particularly useful when you have multiple datasets on the same plot and want to differentiate them.
5. Combining Multiple Plots in a Single Figure¶
Use Case¶
Sometimes, it’s useful to compare multiple visualizations side-by-side. Using subplots, you can arrange multiple plots in one figure.
Creating Subplots¶
Below is an example of creating a line plot and a histogram in a 1x2 grid layout.
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Scatter Plot
axes[0].scatter(x, y, color="blue", label="Scatter Plot")
axes[0].set_title("Scatter Plot")
axes[0].set_xlabel("X-axis")
axes[0].set_ylabel("Y-axis")
axes[0].legend()
# Histogram
axes[1].hist(data, bins=20, color="orange", edgecolor="black")
axes[1].set_title("Histogram")
axes[1].set_xlabel("Value")
axes[1].set_ylabel("Frequency")
# Adjust layout
plt.tight_layout()
plt.show()
Notice how axes[0]
and axes[1]
are used to access the individual subplots. This allows you to customize each subplot independently.
A quick note on the importance of clear labels¶
When creating visualizations, it's crucial to provide clear labels for the axes, titles, and legends. These labels help viewers understand the data being presented and the context of the visualization. Always ensure that your visualizations are labeled appropriately to convey the intended message effectively.
Your goal should be for the plot to make sense independent of the context; such that if you were to show a data visualization depicting the relationship between age and reaction time, someone who had never seen the data before could understand what the plot is showing by reading the labels, axes, and title.
Coding Exercise: Custom Visualizations¶
In this exercise, you’ll work with a dataset generated using numpy
. Your goal is to explore different ways to visualize this data and think about what insights each visualization type can reveal. You’ll create at least two different visualizations of the dataset, using different plot types.
A. Generate the Data¶
Run the code below to generate a synthetic (fake) dataset. This dataset simulates reaction times and accuracy scores from a fictional psychology experiment with 200 participants.
import numpy as np
# Setting random seed for reproducibility
np.random.seed(42)
# Generate participant IDs (1 through 200)
participant_ids = np.arange(1, 201)
# Generate random reaction times (in milliseconds) with a mean of 500 ms and std deviation of 50 ms
reaction_times = np.random.normal(loc=500, scale=50, size=200)
# Generate random accuracy scores (out of 100) with a mean of 75 and std deviation of 10
accuracy_scores = np.random.normal(loc=75, scale=10, size=200)
# Print first few values for each array
print("Participant IDs:", participant_ids[:5])
print("Reaction Times:", reaction_times[:5])
print("Accuracy Scores:", accuracy_scores[:5])
B. Review your data visualization options¶
- Line Plot: Consider visualizing reaction times across participants to see how response times vary.
- Histogram: A histogram of reaction times or accuracy scores can help show the distribution of these values.
- Scatter Plot: A scatter plot of accuracy vs. reaction time can show any correlation between speed and accuracy.
- Bar Plot: You could create a bar plot of average reaction times or accuracy for groups of participants (e.g., participants 1–50, 51–100).
C. Create Your Visualizations¶
Using matplotlib
, create at least two different visualizations of the data. Use the examples above or come up with your own! Think about:
- What insights does each plot type reveal?
- How can you customize your plots to make them more informative?
Example Starter Code
Below is an example of how you might start with a histogram. Remember to create your own additional plot(s) and customize as needed!
import matplotlib.pyplot as plt
# Example Histogram for Reaction Times
plt.hist(reaction_times, bins=20, color="skyblue", edgecolor="black")
plt.xlabel("Reaction Time (ms)")
plt.ylabel("Frequency")
plt.title("Distribution of Reaction Times")
plt.show()
# Continue with your own visualizations!
D. Reflect¶
After creating your visualizations, reflect on the following:
- What insights did each visualization reveal about the data?
- Which visualization do you think was most effective, and why?
Review¶
In this step, you learned:
- How to create and customize line plots, bar plots, histograms, and scatter plots.
- Basic customization techniques to make plots more readable and informative.
- How to use subplots to compare multiple visualizations side-by-side.
Data visualization is a powerful skill that will help you analyze and interpret data effectively in your research and academic work.