Welcome to Basic Visualizations! We’re going to go over some code on how to create our first visualizations using our SDR 2023 Data.
Let’s load in the tidyverse
, janitor
, and here
libraries as we’ll be using different functions within these packages.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(here)
library(janitor)
library(vembedr)
Next, let’s read in our data and assign it to the variable sdr_data
.
We use the same functions as we did in the previous lesson
read_csv()
and here()
sdr_data <- read_csv(here("data/SDR-2023-Data.csv"))
Use the clean_names
function from the janitor package to clean up the column names of our dataframe.
sdr_data <- sdr_data %>%
clean_names()
ggplot()
We’re going to be using the ggplot()
frequently to graph our visualizations. ggplot is a powerful visualization package for R and allows for customization in creating our plots. Here are some basic concepts and inputs in ggplot:
When should we use a scatterplot?
Scatterplots are helpful in visualizing relationships between two continuous variables. For instance, we can identify patterns or trends between two variables, clusters or groups, or compare. Let’s review the from Data to Viz resource to learn more.
Let’s make our first plot!
In the following code, we’re:
ggplot()
functionsdr_data
to the data argumentgeom_point()
to make a scatter plot where each point is a country and it’s location on the graph is based on the country’s SDG 3 scores and SDG 4 scoresggplot(data = sdr_data, aes(x= goal_3_score, y = goal_4_score)) +
geom_point()
## Warning: Removed 27 rows containing missing values (`geom_point()`).
Awesome! What if we wanted to color the points/countries based on their region? Let’s use the color
argument to add some color to do so.
ggplot(data = sdr_data, aes(x= goal_3_score, y = goal_4_score, color = regions_used_for_the_sdr)) +
geom_point()
## Warning: Removed 27 rows containing missing values (`geom_point()`).
Interpretation: Many countries in Sub-Saharan Africa seem to be in the lower left quadrant of our graph, meaning they aren’t doing as well in good health and well-being (SDG 3) and quality education (SDG 4) compared to the rest of the world. There seems to be a cluster or group of countries that belong to the OECD region on the top right quadrant, meaning they’re doing better in terms of these goals.
You may be thinking, what is OECD?
The regions_used_for_the_sdr column groups countries by geographic region, except for countries part of the OECD, which stands for the Organization for Economic Co-operation and Development (OECD).
Two major differences between OECD countries and non-OECD countries are the amount of primary energy that they consume and their population growth. The countries that participate in OECD tend to be wealthier countries and use quite a bit more primary energy per capita.
Let’s create one more scatter plot!
So far we have only colored by the categorical variable region.
We can also color by continuous variables like the other SDG Scores.
ggplot(data = sdr_data, aes(x= goal_3_score, y = goal_4_score, color = goal_5_score)) +
geom_point()
## Warning: Removed 27 rows containing missing values (`geom_point()`).
Cool! It seems that the darkest blue points (countries with the lowest SDG 5 scores) tend to have mid-range or low SDG 3 / SDG 4 scores as well. The lightest blue points (countries with the highest SDG 5 scores) tend to have high SDG 3 / SDG 4 scores as well. However, there are several countries that do not follow this trend.
Create a scatterplot using two different SDG scores. Then, color the points by using a third SDG score. In a couple of sentences, describe what you see.
When should we use a bar chart?
A bar chart is great for comparing a numeric variable by group.
ggplot(sdr_data, aes(x= goal_3_score, y = country)) +
geom_bar(stat = "identity")
## Warning: Removed 27 rows containing missing values (`position_stack()`).
Hmm… it looks like the country names are hard to read - there are too many of them! Let’s try to look at a specific region.
To do this, let’s use the filter()
function we learned about in our last lesson
lac_sdr_data <- sdr_data %>%
filter(regions_used_for_the_sdr == "LAC") # filtering for LAC countries only
Now let’s make a bar chart from our new dataframe lac_sdr_data
ggplot(lac_sdr_data, aes(x= goal_3_score, y = country)) +
geom_bar(stat = "identity")
## Warning: Removed 6 rows containing missing values (`position_stack()`).
As we talked about in ggplot, let’s try to use faceting. Considering that regions_used_for_the_sdr is a categorical variable, we can create separate plots based on regions.
To do this, we add a +
to our code and use the facet_wrap()
function
We seperate the plots by region so we put our region column regions_used_for_the_sdr
into the facet_wrap()
function
Then we set scales = “free_y”
We also adjust the y axis text size to make everything fit nicely
ggplot(sdr_data, aes(x= goal_3_score, y = country)) +
geom_bar(stat = "identity") +
facet_wrap(~regions_used_for_the_sdr, scales = "free_y") + # faceting for regions
theme(axis.text.y = element_text(size = 4)) # adjusting text size of y-axis values
## Warning: Removed 27 rows containing missing values (`position_stack()`).
Interpretation: As we can see in the Oceania region, most Countries do not have the necessary data to calculate a SDG 3 Score. Sub-Saharan Africa appears to have some of the lowest Goal 3 Scores, while OECD has some of the highest.
Create a new bar chart with countries only from the MENA region.
When do we use a histogram?
Histograms are awesome at visualizing distributions of a single continuous variable.
ggplot(sdr_data, aes(x = goal_4_score)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 27 rows containing non-finite values (`stat_bin()`).
Each bar/bin represents a range of SDG 4 Scores. There are 30 bars/bins so each bar/bin has a range of about 3.3.
The height of the bar/bin represent the number of countries that fall into that SDG 4 Score Range.
About 37 countries fall into the 97-100 SDG 4 Score Range. This is the most popular range.
About 19 countries fall into the 94-97 SDG 4 Score Range, and so on
This is great for looking at the distribution of Goal 4 scores, but let’s explore further with adding another layer of visualization.
ggplot(sdr_data, aes(x = goal_4_score, fill=regions_used_for_the_sdr)) +
geom_histogram(bins=33, color="black")
## Warning: Removed 27 rows containing non-finite values (`stat_bin()`).
Interpretation: Countries that fall in the higher end of the Goal 4 spectrum belong to the OECD region, while countries in Sub-Saharan Africa lie towards the lower end.
Make a histogram for a different SDG score. Then, fill by region. In a couple of sentences, describe any interesting trends or what you notice from your new histogram.