This page will explore the Pearson correlations between variables at the indicator level. Note that indicators have already been recoded as necessary such that they all point in the desirable direction. For example, the negative correlation between carbon fluxes (CO2 emissions from agriculture) and total quantity food products (animal and crop sales) means that as food production increases (good), carbon fluxes increase (bad).
These negative correlations between indicators in different dimensions are to be expected. More problematic are the positive correlations between indicators in different dimensions, like operations diversification and value-added markets. This might suggest there are omitted variables, or that certain facets of the food system are being double-counte
We are using only the latest time point for each metric available. Dimensions are represented by colors, and divided by black lines in the matrix. Hovering over the diagram will show a tooltip with the correlation coefficient and p-value.
1 Correlation Matrix
Code
# Load indicator data.final_scores <-readRDS('data/state_score_iterations.rds')get_str(final_scores)# Get filtered frame subset to be able to color indicators by dimension laterfiltered_frame <-readRDS('data/filtered_frame.rds')inds_and_dims <- filtered_frame %>%select(indicator, dimension) %>%unique() %>%mutate(color =case_when( dimension =='economics'~'royalblue', dimension =='environment'~'darkgreen', dimension =='health'~'orange', dimension =='production'~'darkred', dimension =='social'~'black' ))color_map <-setNames(inds_and_dims$color, inds_and_dims$indicator)# Pull out minmax geo indicators only. Also use states only, no aggregatesminmax_geo_indicators <- final_scores$raw_minmax_geometric$indicator_scores %>%filter(! state %in%c('US_mean', 'US_median', 'NE_median', 'NE_mean'))get_str(minmax_geo_indicators)# Make a correlation matrix using all the selected variablesmat <- minmax_geo_indicators %>%select(-state) %>%as.matrix()# Get correlationscor <-rcorr(mat, type ='pearson')# Melt correlation values and rename columnscor_r <- reshape::melt(cor$r) %>%setNames(c('var_1', 'var_2', 'value'))# Save p valuescor_p <-melt(cor$P)p.value <- cor_p$value# Make heatmap with custom text aesthetic for tooltipplot_out <- cor_r %>%ggplot(aes(var_1, var_2, fill = value, text =paste0('Var 1: ', var_1, '\n','Var 2: ', var_2, '\n','Correlation: ', format(round(value, 3), nsmall =3), '\n','P-Value: ', format(round(p.value, 3), nsmall =3) ))) +geom_tile() +scale_fill_gradient2(low ="#762a83", mid ="white", high ="#1b7837", midpoint =0 ) +geom_hline(yintercept =5.5) +geom_hline(yintercept =14.5) +geom_hline(yintercept =23.5) +geom_hline(yintercept =31.5) +geom_vline(xintercept =5.5) +geom_vline(xintercept =14.5) +geom_vline(xintercept =23.5) +geom_vline(xintercept =31.5) +theme(axis.text.x =element_text(hjust =1, angle =45 ) ) +labs(x =NULL,y =NULL,fill =NULL )plot_out
Code
# Save this for preso# preso_matrix <- list(mat, color_map)saveRDS(mat, 'preso/data/correlation_data.rds')
We have many significant correlations between indicators, but we probably don’t care too much about weak correlations. Let’s isolate the correlations that are significant and > 0.5. These are the ones that might suggest we are double-counting certain aspects of the food system.
Code
# Isolate all significant correlationsget_str(cor_r)# Save p valuescor_p <-melt(cor$P)p.value <- cor_p$value# Add p values to dataframe with correlationscor_r$p <- cor_p$valueget_str(cor_r)# filter for correlations over 0.5sig <- cor_r %>%rowwise() %>%mutate(pair =paste(sort(c(var_1, var_2)), collapse ="_")) %>%ungroup() %>%distinct(pair, .keep_all =TRUE) %>%select(-pair) %>%filter(!is.na(p), abs(value) >0.5)# Clean up columns for tablesig <- sig %>%mutate(value =abs(value),across(where(is.numeric), ~format(round(.x, 3), nsmall =3)) ) %>%setNames(c('Indicator 1', 'Indicator 2', 'Correlation', 'P Value'))get_str(sig)# Get reactable table for next celltable_out <-get_reactable(sig)# But while we're at it, count how many times each indicator appearscor_counts <-c(sig[[1]], sig[[2]]) %>%table() %>%sort(decreasing =TRUE) %>%as.data.frame() %>%setNames(c('indicator', 'correlations'))get_str(cor_counts)# Clean it up for presoframework <-readRDS('data/filtered_frame.rds') %>%select(indicator, index, dimension)cor_table <- cor_counts %>%left_join(framework) %>%select(indicator, index, dimension, correlations) %>%unique() %>%filter(correlations >0) %>%setNames(c(str_to_title(names(.))))get_str(cor_table)# Save this for preso saveRDS(cor_table, 'preso/data/correlation_counts.rds')
The wealth/income distribution indicator (economics) is correlating strongly with several indicators, some from the economics dimension and some from health. Note that there are several metrics in that indicator related to median earnings, which might be a proxy for gdp per capita. Now that I look at this, it might be worth including gdp per capita at least as a control variable to see how much fo the variation it accounts for.
It looks like all the indicators from the carbon index (embodied, fluxes, stocks) correlate with one another, which makes enough sense. I imagine that one shouldn’t be too much of a problem if they are being aggregated at the index level anyway.
Forest health and carbon stocks are currently quite highly correlated, but this is because the metrics for carbon stocks are not ideal. The metrics for carbon stocks and forest health all come from the same TreeMap dataset. I suspect that if we include a better set of metrics for carbon stocks, this won’t be a such a problem.
Value-added markets and operations diversification are all using a very similar set of metrics as well. They mostly come from NASS, and it would be worth digging into the NASS docs to see how whether value-added sales might overlap with agritourism, direct to consumer sales, or local marketing channel sales.
As for what to do about highly correlating indicators in general:
They could be reworked to use metrics that don’t lead to indicator correlations. This might be hard to do because the reality may be that these dimensions are highly correlated.
They could be weighted in their respective dimensions to account for the correlations. This might be done with PCA loadings or by expert opinion.
We could also leave them as is. This would mean potentially double-counting certain aspects, but may be a reasonable approximation of reality.