Clustering: Are there groups of countries based on SDR 2023 Scores? How many groups are there? Which countries fall into which group?
Random Forest Regression: Which of the 98 SDG Indicators in the 2023 Sustainable Development Report are the strongest predictors of maternal mortality?
We cannot perform 1 and 2 above with missing data (NA values in the SDR 2023 Scores). Imputation: How can we take an educated guess on missing data?
Loading in our Libraries
Read in 2023 Sustainable Development Data with read_csv() and here()
sdr_data <- read_csv(here("data/SDR-2023-Data.csv"))
Clean column names
sdr_data <- sdr_data %>%
clean_names()
Create a subsetted dataframe with only the country column and columns containing the text “normalized score”
sdr_data_normalized_scores <- sdr_data %>%
select(country, contains("normalized_score"))
https://epirhandbook.com/en/missing-data.html
Types of missing data Here are three general types of missing data:
Missing Completely at Random (MCAR). This means that there is no relationship between the probability of data being missing and any of the other variables in your data. The probability of being missing is the same for all cases This is a rare situation.
Missing at Random (MAR). This name is actually a bit misleading as MAR means that your data is missing in a systematic, predictable way based on the other information you have. For example, maybe every observation in our dataset with a missing value for fever was actually not recorded because every patient with chills and and aches was just assumed to have a fever so their temperature was never taken. If true, we could easily predict that every missing observation with chills and aches has a fever as well and use this information to impute our missing data. In practice, this is more of a spectrum. Maybe if a patient had both chills and aches they were more likely to have a fever as well if they didn’t have their temperature taken, but not always. This is still predictable even if it isn’t perfectly predictable. This is a common type of missing data
Missing not at Random (MNAR). Sometimes, this is also called Not Missing at Random (NMAR). This assumes that the probability of a value being missing is NOT systematic or predictable using the other information we have but also isn’t missing randomly. In this situation data is missing for unknown reasons or for reasons you don’t have any information about. For example, in our dataset maybe information on age is missing because some very elderly patients either don’t know or refuse to say how old they are. In this situation, missing data on age is related to the value itself (and thus isn’t random) and isn’t predictable based on the other information we have. MNAR is complex and often the best way of dealing with this is to try to collect more data or information about why the data is missing rather than attempt to impute it.
In general, imputing MCAR data is often fairly simple, while MNAR is very challenging if not impossible. Many of the common data imputation methods assume MAR.
How many cells do we have in the dataframe: sdr_data_normalized_scores?
206 rows by 99 columns
206 * 99
## [1] 20394
How many of those cells are NA?
sum(is.na(sdr_data_normalized_scores))
## [1] 3965
Not bad, ~19 % of our data is missing
Do some columns/variables have more missing data than others? Or are NA’s distributed equally across columns/variables?
gg_miss_var(sdr_data_normalized_scores, show_pct = TRUE) +
theme(axis.text.y = element_text(size = 8))
Do some countries have more missing data than others?
sdr_data_normalized_scores_longer <- sdr_data_normalized_scores %>%
pivot_longer(cols = !country)
missing_data_by_country <- sdr_data_normalized_scores_longer %>%
group_by(country) %>%
miss_var_summary() %>%
arrange(desc(pct_miss))
missing_data_by_country
## # A tibble: 412 × 4
## # Groups: country [206]
## country variable n_miss pct_miss
## <chr> <chr> <int> <dbl>
## 1 Andorra value 98 100
## 2 Antigua and Barbuda value 98 100
## 3 Dominica value 98 100
## 4 Eritrea value 98 100
## 5 Micronesia, Fed. Sts. value 98 100
## 6 Guinea-Bissau value 98 100
## 7 Equatorial Guinea value 98 100
## 8 Grenada value 98 100
## 9 Kiribati value 98 100
## 10 St. Kitts and Nevis value 98 100
## # ℹ 402 more rows
27 countries in the 2023 SDR have NA values for every single normalized score. What are they?
completely_na_countries <- missing_data_by_country$country[missing_data_by_country$pct_miss == 100]
completely_na_countries
## [1] "Andorra" "Antigua and Barbuda"
## [3] "Dominica" "Eritrea"
## [5] "Micronesia, Fed. Sts." "Guinea-Bissau"
## [7] "Equatorial Guinea" "Grenada"
## [9] "Kiribati" "St. Kitts and Nevis"
## [11] "Libya" "St. Lucia"
## [13] "Liechtenstein" "Monaco"
## [15] "Marshall Islands" "Nauru"
## [17] "Palau" "Korea, Dem. Rep."
## [19] "Solomon Islands" "San Marino"
## [21] "Seychelles" "Timor-Leste"
## [23] "Tonga" "Tuvalu"
## [25] "St. Vincent and the Grenadines" "Vanuatu"
## [27] "Samoa"
Unfortunately, we cannot keep these countries in our analysis due to them not having any data for any of the sdg scores. The challenge of missing data in these countries is important, and needs to be acknowledged in this kind of work
sdr_data_normalized_scores_no_na_countries <- sdr_data_normalized_scores %>%
filter(!country %in% completely_na_countries)
With the new list of 179 countries (few of them are regions) let’s take another look at missing data by variable/column
We are only going to keep the variables/columns where less than 20% of the data is missing, so let’s add a line to the plot to indicate the cutoff
gg_miss_var(sdr_data_normalized_scores_no_na_countries, show_pct = TRUE) +
theme(axis.text.y = element_text(size = 8)) +
geom_hline(yintercept = 20, color = "steelblue", linetype = "dashed")
Create a new dataframe that drops the columns/variables with >20% of the data missing
sdr_data_normalized_scores_less_na <- sdr_data_normalized_scores_no_na_countries %>%
select(where(~ sum(is.na(.))/length(.) <= 0.2))
Impute the missing data using a random forest method from the missRanger package
sdr_data_imputed <- missRanger(sdr_data_normalized_scores_less_na)
Are there groups of countries based on SDR 2023 Scores? How many groups are there? Which countries fall into which group?
Review this (excellent UC Business Analytics guide to clustering)[https://uc-r.github.io/kmeans_clustering]
The functions we are going to use to cluster require that the column being clustered (country), based on the information in all of the other columns (sdg scores) is actually the rownames of the dataframe
sdr_data_imputed <- sdr_data_imputed %>%
remove_rownames %>%
column_to_rownames(var="country")
The function below assists us in determining the number of groups or clusters based on our data
fviz_nbclust(sdr_data_imputed, kmeans, method = "silhouette")
The kmeans() function clusters countries into groups (int this case 2, visualized above) based on 84 SDG scores
k2 <- kmeans(sdr_data_imputed, centers = 2)
We can visualize the outputs of the kmeans function with fviz_cluster()
fviz_cluster(k2, data = sdr_data_imputed) +
theme_minimal()
You can view the countries and their assigned cluster in a dataframe with the code below
country_clusters <- as.data.frame(k2$cluster)
Which of the 98 SDG Indicators in the 2023 Sustainable Development Report are the strongest predictors of maternal mortality?
To answer this question, we train a random forest model to make predictions of maternal mortality scores for each country and then ask the model which SDG Scores were the most important/influential in making it’s prediction
rf_matmort <- randomForest(normalized_score_sdg3_matmort ~ .,
data = sdr_data_imputed,
importance = TRUE)
Our model is relatively accurate with a mean error rate of ~ 10
The “% Var explained” value of 80.83 indicates that approximately 80.83% of the variance in the target variable (normalized_score_sdg3_matmort) is explained by the predictor variables included in the model.
rf_matmort
##
## Call:
## randomForest(formula = normalized_score_sdg3_matmort ~ ., data = sdr_data_imputed, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 27
##
## Mean of squared residuals: 102.2408
## % Var explained: 80.97
Now we can look at the variables most important in making the predictions
importance_df <- as.data.frame(rf_matmort$importance)
Let’s take the top 10 and graph them
importance_df_top_10 <- importance_df %>%
rownames_to_column(var = "variable") %>%
slice_max(n = 10, order_by = `%IncMSE`)
ggplot(importance_df_top_10, aes(x = `%IncMSE`, y = reorder(variable, `%IncMSE`))) +
geom_bar(stat = "identity", fill = "steelblue", color = "black") +
theme_minimal() +
labs(title = "Most Important Variables in Predicting Maternal Mortality",
subtitle = "Top 10",
y = "SDG Indicator",
x = "Feature Importance (% Increase in Mean Squared Error)")
Lastly, we want to know how these important variables influence the prediction: positively or negatively?
pdp::partial(rf_matmort, pred.var = "normalized_score_sdg6_sanita", plot = TRUE)
How to interpret:
yhat is the models prediction of maternal mortality scores
the x axis is the SDG 6 sanitation score
Once the SDG 6 sanitation score reaches 40 the model starts to predict higher maternal mortality scores
Remeber, this is a general trend for all countries in the dataset
Now that we saw how one variable influences the models prediction, how about 2 variables?
pd <- pdp::partial(rf_matmort, pred.var = c("normalized_score_sdg6_sanita", "normalized_score_sdg3_u5mort"))
# Default PDP
plotPartial(pd)
How to interpret:
the y axis is SDG 3 under 5 mortality score
the the x axis is the SDG 6 sanitation score
the color is the models prediction of SDG 3 maternal mortality scores
Edit this code chunk k2 <- kmeans(sdr_data_imputed, centers = 2)
under the silhouette plot to increase the number of clusters
fviz_cluster(k2, data = sdr_data_imputed) + theme_minimal()
to see the new visualization with the increased number of clustersEdit this code chunk rf_matmort <- randomForest(normalized_score_sdg3_matmort ~ ., data = sdr_data_imputed, importance = TRUE)
to predict a different indicator score (any indicator other than normalized_score_sdg3_matmort)
Rename the random forest model from rf_matmort to something that more accurately represents the model (base the name on the indicator you are predicting)
Run the remainder of the code chunks below to see the different outputs