Exploring methods of transforming, rescaling, and aggregating data into indicator, index, and dimension scores. Using the OECD handbook on composite indices as a guide (OECD 2008).
Geometric means for aggregate scores across four dimensions
2 Imputation
First, check how much missing data there are. If it is within reason, use missForest algorithm to impute missing data (Stekhoven and Bühlmann 2012). This is particularly good at handling MAR data, and does a decent job at handling MNAR data and non-linear relationships as well. If less than 5% of data are missing, just about any method for handling it is reasonable, even listwise deletion (Beaujean 2013).
Code
metrics_df <-readRDS('data/metrics_df.rds')get_str(metrics_df)# Check for missing dataskimr::skim(metrics_df)mis_dat <-sum(is.na(metrics_df))/(nrow(metrics_df)*(ncol(metrics_df) -1)) *100mis_dat <-round(mis_dat, 3)# Change fips from column to rowname so we can impute without losing itmetrics_df <- metrics_df %>%column_to_rownames('fips')get_str(metrics_df)# Impute missing variablesset.seed(42)mf_out <- metrics_df %>%missForest(ntree =200,mtry =10,verbose =FALSE,variablewise =FALSE )get_str(mf_out)# Extract OOB error(oob <- mf_out$OOBerror)# Check missing againskimr::skim(mf_out$ximp)# Looks good# Save just imputed dataimp_dat <- mf_out$ximp
We had 0.785% missing data, which is rather little, and gives us flexibility in handling it. The Out-of-Bag (OOB) error, quantified by the normalized residual mean squared error (NRMSE) the missForest imputation algorithm was 0.
3 Transformations
Transforming distributions and dealing with outliers. Three options.
Raw Values
Just as it sounds - do nothing. More interpretable, but vulnerable to outliers that throw off the distribution of scores.
Winsorization
Shift extreme values to the 5th and 95th percentiles. Does not reward overperformance in any one area, but is rather heavy-handed in reshaping the distribution and leads to information loss.
Box-Cox transformations are non-linear transformations that use an optimal value of lambda to make the distribution as normal as possible. This has some strengths in that the data are easier to work with in further analyses. It also effectively pulls outliers inward toward the center of the distribution. However, it also changes relationships between the variables, so it will distort any bivariate correlations.
Min-maxing scales all the data from 0 to 1 by subtracting the minimum value of each variable from all cases and dividing by the range of all cases in the variable. It is rather intuitive, as 1 is the best score, and 0 is the worst. This is a linear transformation, so the relationships between the values should not change.
Z-scores are stardized to have a mean of 0 and a standard deviation of 1. Larger numbers are better, but there are no caps on the highest or lowest values. A value of 2 would mean that it is 2 standard deviations greater than the mean. Again, this is a linear transformation, so relationships between variables should not change.
Where \(x^t_qc\) is the metric \(q\) for state \(c\) at time \(t\).
Code
get_str(imp_dat)# List of resultsscaled <-list()# Z-scores, one for each transformationscaled$zscore <-map(transformed, \(trans) { trans %>%mutate(across(everything(),~as.numeric(scale(.x, scale =TRUE, center =TRUE)) ))})get_str(scaled$zscore)# Min Maxmin_max <-function(x) { (x -min(x)) / (max(x) -min(x))}scaled$minmax <-map(transformed, \(trans) { trans %>%mutate(across(everything(), min_max))})get_str(scaled$minmax) # Rank order from lowest to highest value for each var. We are coding this such# that higher ranks are better. So 51 should have the highest/best value and# rank 1 should have the worst.scaled$rank <-map(transformed, \(trans) {map(names(trans), \(col_name) { trans %>%rownames_to_column('fips') %>% dplyr::select(fips, col_name) %>%mutate(!!sym(col_name) :=dense_rank(.data[[col_name]])) }) %>%reduce(full_join) %>%column_to_rownames('fips')})get_str(scaled$rank)# Checkmap(scaled, get_str)# Unlist these so instead of 3x3 it is 1x9scaled <- purrr::list_flatten(scaled, name_spec ="{inner}_{outer}")names(scaled)get_str(scaled)
5 Directional Values
Here, we are assuming that each metric has a direction that is more sustainable than the opposite. Either more of it is better, or less of it is better. This is rather problematic in that just about any metric becomes negative with too much or too little of it. What might make more sense in the long run would be to consult the expertise of our teams and develop targets or acceptable ranges for some metrics once they are settled. Still, just about every sustainability indicator framework does some variation of this one-way value system (Schneider et al. 2023; Béné et al. 2019; Nicoletti 2000; Jacobi et al. 2020; Gómez-Limón and Sanchez-Fernandez 2010).
Alas, for now we will invert variables in each of the transformed datasets as necessary so that larger numbers are more sustainable, and smaller numbers are less sustainable. The table below shows this assignment in the desirable column. For a handful of variables (vacancy rate, mean farm size, etc.) I was not comfortable assigning one direction as better than the other, so I have removed them from the refined framework.
Code
# Check variable names(vars <-names(scaled[[1]]))# Higher numbers should be better. Reverse metrics that are the opposite, # where lower numbers are better. Only listing reverse here - metrics are # implicitly better with larger numbers otherwise.reverse <-c('unemploymentRate','gini','lowBirthweight','teenBirths','uninsured','incomeInequality','childrenInSingleParentHouseholds','injuryDeaths','airPollutionParticulateMatter','drinkingWaterViolations','severeHousingProblems','prematureAgeAdjustedMortality','infantMortality','frequentPhysicalDistress','frequentMentalDistress','diabetesPrevalence','hivPrevalence','limitedAccessToHealthyFoods','drugOverdoseDeaths','disconnectedYouth','residentialSegregationBlackWhite','suicides','motorVehicleCrashDeaths','severeHousingCostBurden','schoolSegregation','childCareCostBurden','wicPercEligible', # Iffy on this one'droughtMeanPercArea','pctAtRiskAnimalSpp','pctAtRiskPlantSpp','pctAtRiskBeeSpp','pctAtRiskOrchidSpp','pctAtRiskEcosystems','expChemicalPct','ageProducers', # Could be better to use age diversity?'waterIrrSrcOffFarmExp','waterIrrSrcOffFarmExpPerAcreFt','CH4FromAg','N2OFromAg','CO2FromAg','propAreaFsaSecDisasters','totalCapConsNoDwellings','totalIntExpRealEstateNoDwellings','totalIncomeInsuranceIndemnities','totalIncomeInsuranceIndemnitiesFederal','totalValueEmergPayments','totalValueOtherAdHocEmergPayments','totalValueDairyMarginProtPayments','totalValueAllLossCoveragePayments','totalValueAgRiskCoveragePayments','totalCapExpBldgsLandNoDwellings','alcoholImpairedDrivingDeaths')# Iffy: landValPF, landValPerAcre - in good column for now, but unclear# indemnities and emergency payments - in bad column for now, but more access# coud be good?# Some are unclear - without clear direction, better to remove:remove <-c('vacancyRate','expHiredLaborPercOpExp','acresPF','medianAcresPF','importsTopFive')## Remove the unclear ones from all three datasets# Then for each transformation, flip values in a way that makes sensevalued <-imap(scaled, \(df, method) { df %>%select(-all_of(remove)) %>%mutate(across(all_of(reverse), ~case_when(str_detect(method, 'rank') ~max(.x) - .x +1,str_detect(method, 'zscore') ~ .x *-1,str_detect(method, 'minmax') ~1- .x,.default =NA )),across(everything(), as.numeric) )})map(valued, get_str)# Comparechecklist <-list(scaled, valued)map(checklist, ~ .x$raw_rank[[1]])map(checklist, ~ .x$winsor_zscore[[1]])map(checklist, ~ .x$raw_minmax[[1]])# Save this as our 'normalized data' that we use for building scoressaveRDS(valued, 'data/valued_rescaled_metrics.rds')
Code
## Show table of which metrics were set in which directionsm_data <-readRDS('data/sm_data.rds')meta <- sm_data$metadata# Reactable table showing var, metric, source, and directiontable <- meta %>% dplyr::filter(variable_name %in%names(imp_dat)) %>%mutate(desirable =case_when( variable_name %in% reverse ~'Lower', variable_name %in% remove ~'Removed',.default ='Higher' )) %>%select( metric, variable_name, dimension, index, indicator, desirable, definition, source )get_str(table)# Save this desirable direction table for presosaveRDS(table, 'preso/data/desirable_directions_table.rds')
Here we are combining values in each indicator, index, and dimension using both arithmetic and geometric means (OECD 2008). Arithmetic means are fully compensable, in that a strong score in one area can make up for a weak score in another. Geometric means are only somewhat compensable - it effectively applies a penalty for unbalanced scores. We might also consider PCA here, but it does not do terribly well with strong a prior hypotheses like we have.
Note that we are using some functions to automate this process. They can be found in dev/get_aggregations.R.
Indicator aggregation:
Code
# We need to attach these back to framework from metadata# Filter frame from earlier down to our current metrics# We are also removing the 'remove' metrics without clear directional valuesframe <-readRDS('data/frame.rds')filtered_frame <- frame %>% dplyr::filter(variable_name %in%names(valued[[1]])) %>% dplyr::select(variable_name, indicator, index, dimension)get_str(filtered_frame)# Save this for later - use in regression and variable selection saveRDS(filtered_frame, 'data/filtered_frame.rds')# Using modular functions here to do each step. # See dev/get_aggregations.R for detailsindicator_scores <-get_agg_indicators( valued, filtered_frame, aggregation ='both')get_str(indicator_scores)
Index aggregation:
Code
# For each set of indicator scores, calculate index scores# Using custom modular functionindex_scores <-get_agg_indices(indicator_scores, filtered_frame)get_str(index_scores)
Here, we organize arithmetic and geometric means for each level of the framework (indicator, index, dimension) in a way that is easier to work with. We also add means and medians for all US states as well as New England states that we can use as points of comparison.
Code
state_key <-readRDS('data/state_key.rds')get_str(indicator_scores, 4)get_str(index_scores, 4)get_str(dimension_scores, 4)# Want to end up with 18 versions: 3 transforms, 3 rescalings, 2 aggregations# Put them all together in one list to work with# Actually switching to modular function here too:# Run whole process with functionfinal_scores <-get_all_aggregations(normed_data = valued,framework = filtered_frame,state_key = state_key,aggregation ='both')names(final_scores)get_str(final_scores)# Now we have a list of 18 iterations# Save this for use elsewheresaveRDS(final_scores, 'data/state_score_iterations.rds')
This gives us a list of 18 elements, one for each combination of transformation method, rescaling method, and aggregation method.
Adamu Demelash, Sewareg, and Esubalew Abate Alemu. 2024. “Measuring Food System Sustainability in Ethiopia: Towards a Multi-Dimensional Perspective.”Ecological Indicators 161 (April): 111991. https://doi.org/10.1016/j.ecolind.2024.111991.
Béné, Christophe, Steven D. Prager, Harold A. E. Achicanoy, Patricia Alvarez Toro, Lea Lamotte, Camila Bonilla, and Brendan R. Mapes. 2019. “Global Map and Indicators of Food System Sustainability.”Scientific Data 6 (1): 279. https://doi.org/10.1038/s41597-019-0301-5.
Bickel, Peter J., and Kjell A. Doksum. 1981. “An Analysis of Transformations Revisited.”Journal of the American Statistical Association 76 (374): 296–311. https://doi.org/10.1080/01621459.1981.10477649.
Gómez-Limón, José A., and Gabriela Sanchez-Fernandez. 2010. “Empirical Evaluation of Agricultural Sustainability Using Composite Indicators.”Ecological Economics 69 (5): 1062–75. https://doi.org/10.1016/j.ecolecon.2009.11.027.
Jacobi, Johanna, Stellah Mukhovi, Aymara Llanque, Markus Giger, Adriana Bessa, Christophe Golay, Chinwe Ifejika Speranza, et al. 2020. “A New Understanding and Evaluation of Food Sustainability in Six Different Food Systems in Kenya and Bolivia.”Scientific Reports 10 (1): 19145. https://doi.org/10.1038/s41598-020-76284-y.
Nicoletti, Giuseppe. 2000. “Summary Indicators of Product Market Regulation with an Extension to Employment Protection Legislation.” {{OECD Economics Department Working Papers}} 226. Vol. 226. OECD Economics Department Working Papers. https://doi.org/10.1787/215182844604.
OECD. 2008. Handbook on Constructing Composite Indicators: Methodology and User Guide. Paris: Organisation for Economic Co-operation and Development.
Schneider, Kate R., Jessica Fanzo, Lawrence Haddad, Mario Herrero, Jose Rosero Moncayo, Anna Herforth, Roseline Remans, et al. 2023. “The State of Food Systems Worldwide in the Countdown to 2030.”Nature Food 4 (12): 1090–110. https://doi.org/10.1038/s43016-023-00885-9.
Stekhoven, Daniel J., and Peter Bühlmann. 2012. “MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.”Bioinformatics 28 (1): 112–18. https://doi.org/10.1093/bioinformatics/btr597.