Metric Aggregation

1 Introduction

Exploring methods of transforming, rescaling, and aggregating data into indicator, index, and dimension scores. Using the OECD handbook on composite indices as a guide (OECD 2008).

Some examples to pull from:
- Schneider et al. (2023)
  - rank order comparisons only
  - compare to global weighted means by groups based on GDP
  - min max scaling to show distance from global groups
- Béné et al. (2019)
  - Box cox for most skewed indicators (skew > 2) then min max
  - geometric means for enviro and economic dimensions
  - arithmetic means for social and food dimensions
  - geometric mean for combining all four dimensions into one
- Nicoletti (2000)
  - Used normalized square loadings (indicator weights) to weight each indicator
- Gómez-Limón and Sanchez-Fernandez (2010)
  - min max normalization
  - compared several different methods of aggregation
    - mostly correlated, no big differences
    - weighted sum of indicators
    - product of weighted indicators
    - muilticriteria function based on distance to ideal point
  - Weighting - did both PCA and analytic hierarchy process
  - Validation (identifying important factors) - double censored tobit - index as dependent, indicators as independent
- Adamu Demelash and Abate Alemu (2024)
  - Normalization with distance to reference
    - reference determined by quartile analysis
    - not affected by outliers, extreme values
  - Equal weighting
  - Linear aggregation
  - Additive method for indicators within dimensions
  - Geometric means for aggregate scores across four dimensions

2 Imputation

First, check how much missing data there are. If it is within reason, use missForest algorithm to impute missing data (Stekhoven and Bühlmann 2012). This is particularly good at handling MAR data, and does a decent job at handling MNAR data and non-linear relationships as well. If less than 5% of data are missing, just about any method for handling it is reasonable, even listwise deletion (Beaujean 2013).

Code

metrics_df <- readRDS('data/metrics_df.rds')
get_str(metrics_df)

# Check for missing data
skimr::skim(metrics_df)
mis_dat <- sum(is.na(metrics_df))/(nrow(metrics_df)*(ncol(metrics_df) - 1)) * 100 
mis_dat <- round(mis_dat, 3)

# Change fips from column to rowname so we can impute without losing it
metrics_df <- metrics_df %>% 
  column_to_rownames('fips')
get_str(metrics_df)

# Impute missing variables
set.seed(42)
mf_out <- metrics_df %>%
  missForest(
    ntree = 200,
    mtry = 10,
    verbose = FALSE,
    variablewise = FALSE
  )
get_str(mf_out)

# Extract OOB error
(oob <- mf_out$OOBerror)

# Check missing again
skimr::skim(mf_out$ximp)
# Looks good

# Save just imputed data
imp_dat <- mf_out$ximp

We had 0.785% missing data, which is rather little, and gives us flexibility in handling it. The Out-of-Bag (OOB) error, quantified by the normalized residual mean squared error (NRMSE) the missForest imputation algorithm was 0.

3 Transformations

Transforming distributions and dealing with outliers. Three options.

Raw Values

Just as it sounds - do nothing. More interpretable, but vulnerable to outliers that throw off the distribution of scores.

Winsorization

Shift extreme values to the 5th and 95th percentiles. Does not reward overperformance in any one area, but is rather heavy-handed in reshaping the distribution and leads to information loss.

Box Cox (Bickel and Doksum 1981)

Box-Cox transformations are non-linear transformations that use an optimal value of lambda to make the distribution as normal as possible. This has some strengths in that the data are easier to work with in further analyses. It also effectively pulls outliers inward toward the center of the distribution. However, it also changes relationships between the variables, so it will distort any bivariate correlations.

\[\begin{equation} {\rm For}\ \lambda\neq0,\ f\lambda(x) = (sign(x)|x|^\lambda-1)/\lambda \end{equation}\]

\[\begin{equation} {\rm For}\ \lambda = 0,\ f_0(x) = log(x) \end{equation}\]

Code

# List of results of transformations
transformed <- list()
get_str(imp_dat)

# Raw - just leave it as is
transformed$raw <- imp_dat

# Box Cox. If there are negatives, shifting to remove them
transformed$boxcox <- imp_dat %>% 
  mutate(across(everything(), ~ {
    if (any(.x <= 0)) {
      shift <- abs(min(.x)) + 1
      vals <- .x + shift
    } else {
      vals <- .x
    }
    optimal_lambda <- forecast::BoxCox.lambda(vals, method = 'loglik')
    print(paste0('Optimal lambda: ', optimal_lambda))
    out <- forecast::BoxCox(vals, lambda = optimal_lambda)
    return(out)
  }))
get_str(transformed$boxcox)    

# Winsorization
transformed$winsor <- imp_dat %>% 
  mutate(across(everything(), DescTools::Winsorize))
get_str(transformed$winsor)

# Check
map(transformed, get_str)

4 Rescaling

We are rescaling our data using five methods: rank order, min-max, and Z-scores.

Rank Order

Yields nice clean distributions, but means lots of lost information. used by Schneider et al. (2023) in the Food Systems Countdown to 2030.

Min Max (OECD 2008)

Min-maxing scales all the data from 0 to 1 by subtracting the minimum value of each variable from all cases and dividing by the range of all cases in the variable. It is rather intuitive, as 1 is the best score, and 0 is the worst. This is a linear transformation, so the relationships between the values should not change.

\[\begin{equation} I^t_qc = \frac{x^t_qc - min_c(x^{t_0}_q)}{max_c(x^{t_0}_q)-min_c(x^{t_0}_q)} \end{equation}\]

Where \(x^t_qc\) is the metric \(q\) for state \(c\) at time \(t\).

Z-Scores (OECD 2008)

Z-scores are stardized to have a mean of 0 and a standard deviation of 1. Larger numbers are better, but there are no caps on the highest or lowest values. A value of 2 would mean that it is 2 standard deviations greater than the mean. Again, this is a linear transformation, so relationships between variables should not change.

\[\begin{equation} I^t_{qc} = \frac{x^t_{qc}-x^t_{qc=\overline{c}}}{\sigma^t_{qc=\overline{c}}} \end{equation}\]

Where \(x^t_qc\) is the metric \(q\) for state \(c\) at time \(t\).

Code

get_str(imp_dat)

# List of results
scaled <- list()

# Z-scores, one for each transformation
scaled$zscore <- map(transformed, \(trans) {
  trans %>% 
    mutate(across(
      everything(),
      ~ as.numeric(scale(.x, scale = TRUE, center = TRUE))
    ))
})
get_str(scaled$zscore)

# Min Max
min_max <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}
scaled$minmax <- map(transformed, \(trans) {
  trans %>% 
    mutate(across(everything(), min_max))
})
get_str(scaled$minmax)  

# Rank order from lowest to highest value for each var. We are coding this such
# that higher ranks are better. So 51 should have the highest/best value and
# rank 1 should have the worst.
scaled$rank <- map(transformed, \(trans) {
  map(names(trans), \(col_name) {
    trans %>% 
      rownames_to_column('fips') %>% 
      dplyr::select(fips, col_name) %>% 
      mutate(!!sym(col_name) := dense_rank(.data[[col_name]]))
  }) %>% 
    reduce(full_join) %>% 
    column_to_rownames('fips')
})
get_str(scaled$rank)

# Check
map(scaled, get_str)

# Unlist these so instead of 3x3 it is 1x9
scaled <- purrr::list_flatten(scaled, name_spec = "{inner}_{outer}")
names(scaled)
get_str(scaled)

5 Directional Values

Here, we are assuming that each metric has a direction that is more sustainable than the opposite. Either more of it is better, or less of it is better. This is rather problematic in that just about any metric becomes negative with too much or too little of it. What might make more sense in the long run would be to consult the expertise of our teams and develop targets or acceptable ranges for some metrics once they are settled. Still, just about every sustainability indicator framework does some variation of this one-way value system (Schneider et al. 2023; Béné et al. 2019; Nicoletti 2000; Jacobi et al. 2020; Gómez-Limón and Sanchez-Fernandez 2010).

Alas, for now we will invert variables in each of the transformed datasets as necessary so that larger numbers are more sustainable, and smaller numbers are less sustainable. The table below shows this assignment in the desirable column. For a handful of variables (vacancy rate, mean farm size, etc.) I was not comfortable assigning one direction as better than the other, so I have removed them from the refined framework.

Code

# Check variable names
(vars <- names(scaled[[1]]))

# Higher numbers should be better. Reverse metrics that are the opposite, 
# where lower numbers are better. Only listing reverse here - metrics are 
# implicitly better with larger numbers otherwise.
reverse <- c(
  'unemploymentRate',
  'gini',
  'lowBirthweight',
  'teenBirths',
  'uninsured',
  'incomeInequality',
  'childrenInSingleParentHouseholds',
  'injuryDeaths',
  'airPollutionParticulateMatter',
  'drinkingWaterViolations',
  'severeHousingProblems',
  'prematureAgeAdjustedMortality',
  'infantMortality',
  'frequentPhysicalDistress',
  'frequentMentalDistress',
  'diabetesPrevalence',
  'hivPrevalence',
  'limitedAccessToHealthyFoods',
  'drugOverdoseDeaths',
  'disconnectedYouth',
  'residentialSegregationBlackWhite',
  'suicides',
  'motorVehicleCrashDeaths',
  'severeHousingCostBurden',
  'schoolSegregation',
  'childCareCostBurden',
  'wicPercEligible', # Iffy on this one
  'droughtMeanPercArea',
  'pctAtRiskAnimalSpp',
  'pctAtRiskPlantSpp',
  'pctAtRiskBeeSpp',
  'pctAtRiskOrchidSpp',
  'pctAtRiskEcosystems',
  'expChemicalPct',
  'ageProducers', # Could be better to use age diversity?
  'waterIrrSrcOffFarmExp',
  'waterIrrSrcOffFarmExpPerAcreFt',
  'CH4FromAg',
  'N2OFromAg',
  'CO2FromAg',
  'propAreaFsaSecDisasters',
  'totalCapConsNoDwellings',
  'totalIntExpRealEstateNoDwellings',
  'totalIncomeInsuranceIndemnities',
  'totalIncomeInsuranceIndemnitiesFederal',
  'totalValueEmergPayments',
  'totalValueOtherAdHocEmergPayments',
  'totalValueDairyMarginProtPayments',
  'totalValueAllLossCoveragePayments',
  'totalValueAgRiskCoveragePayments',
  'totalCapExpBldgsLandNoDwellings',
  'alcoholImpairedDrivingDeaths' 
)

# Iffy: landValPF, landValPerAcre - in good column for now, but unclear
# indemnities and emergency payments - in bad column for now, but more access
# coud be good?

# Save these for posterity
saveRDS(reverse, 'data/helpers/metrics_value_reversed.rds')

# Some are unclear - without clear direction, better to remove:
remove <- c(
  'vacancyRate',
  'expHiredLaborPercOpExp',
  'acresPF',
  'medianAcresPF',
  'importsTopFive'
)

## Remove the unclear ones from all three datasets
# Then for each transformation, flip values in a way that makes sense
valued <- imap(scaled, \(df, method) {
  df %>% 
    select(-all_of(remove)) %>% 
    mutate(
      across(all_of(reverse), ~ case_when(
        str_detect(method, 'rank') ~ max(.x) - .x + 1,
        str_detect(method, 'zscore') ~ .x * -1,
        str_detect(method, 'minmax') ~ 1 - .x,
        .default = NA
      )),
      across(everything(), as.numeric)
    )
})
map(valued, get_str)

# Compare
checklist <- list(scaled, valued)
map(checklist, ~ .x$raw_rank[[1]])
map(checklist, ~ .x$winsor_zscore[[1]])
map(checklist, ~ .x$raw_minmax[[1]])

# Save this as our 'normalized data' that we use for building scores
saveRDS(valued, 'data/valued_rescaled_metrics.rds')

Code

## Show table of which metrics were set in which direction
sm_data <- readRDS('data/sm_data.rds')
meta <- sm_data$metadata

# Reactable table showing var, metric, source, and direction
table <- meta %>% 
  dplyr::filter(variable_name %in% names(imp_dat)) %>% 
  mutate(desirable = case_when(
    variable_name %in% reverse ~ 'Lower',
    variable_name %in% remove ~ 'Removed',
    .default = 'Higher'
  )) %>% 
  select(
    metric, 
    variable_name, 
    dimension,
    index,
    indicator,
    desirable, 
    definition, 
    source
  )
get_str(table)

# Save this desirable direction table for preso
saveRDS(table, 'preso/data/desirable_directions_table.rds')

Code

# Make reactable table
table %>% 
  get_reactable(
    defaultPageSize = 5,
    columns = list(
      'definition' = colDef(
        minWidth = 150
      ),
      'source' = colDef(
        minWidth = 150
      )
    )
  )

6 Aggregation

Here we are combining values in each indicator, index, and dimension using both arithmetic and geometric means (OECD 2008). Arithmetic means are fully compensable, in that a strong score in one area can make up for a weak score in another. Geometric means are only somewhat compensable - it effectively applies a penalty for unbalanced scores. We might also consider PCA here, but it does not do terribly well with strong a prior hypotheses like we have.

Note that we are using some functions to automate this process. They can be found in dev/get_aggregations.R.

Indicator aggregation:

Code

# We need to attach these back to framework from metadata
# Filter frame from earlier down to our current metrics
# We are also removing the 'remove' metrics without clear directional values
frame <- readRDS('data/frame.rds')
filtered_frame <- frame %>% 
  dplyr::filter(variable_name %in% names(valued[[1]])) %>% 
  dplyr::select(variable_name, indicator, index, dimension)
get_str(filtered_frame)

# Save this for later - use in regression and variable selection 
saveRDS(filtered_frame, 'data/filtered_frame.rds')

# Using modular functions here to do each step. 
# See dev/get_aggregations.R for details
indicator_scores <- get_agg_indicators(
  valued, 
  filtered_frame, 
  aggregation = 'both'
)
get_str(indicator_scores)

Index aggregation:

Code

# For each set of indicator scores, calculate index scores
# Using custom modular function
index_scores <- get_agg_indices(indicator_scores, filtered_frame)
get_str(index_scores)

Dimension aggregation:

Code

# Custom function
dimension_scores <- get_agg_dimensions(index_scores, filtered_frame)
get_str(dimension_scores)

7 Wrangle

Here, we organize arithmetic and geometric means for each level of the framework (indicator, index, dimension) in a way that is easier to work with. We also add means and medians for all US states as well as New England states that we can use as points of comparison.

Code

state_key <- readRDS('data/state_key.rds')

get_str(indicator_scores, 4)
get_str(index_scores, 4)
get_str(dimension_scores, 4)

# Want to end up with 18 versions: 3 transforms, 3 rescalings, 2 aggregations
# Put them all together in one list to work with

# Actually switching to modular function here too:
# Run whole process with function
final_scores <- get_all_aggregations(
  normed_data = valued,
  framework = filtered_frame,
  state_key = state_key,
  aggregation = 'both'
)
names(final_scores)
get_str(final_scores)
# Now we have a list of 18 iterations

# Save this for use elsewhere
saveRDS(final_scores, 'data/state_score_iterations.rds')

This gives us a list of 18 elements, one for each combination of transformation method, rescaling method, and aggregation method.

8 References

Adamu Demelash, Sewareg, and Esubalew Abate Alemu. 2024. “Measuring Food System Sustainability in Ethiopia: Towards a Multi-Dimensional Perspective.” Ecological Indicators 161 (April): 111991. https://doi.org/10.1016/j.ecolind.2024.111991.

Beaujean, A. Alexander. 2013. “Factor Analysis Using R.” https://doi.org/10.7275/Z8WR-4J42.

Béné, Christophe, Steven D. Prager, Harold A. E. Achicanoy, Patricia Alvarez Toro, Lea Lamotte, Camila Bonilla, and Brendan R. Mapes. 2019. “Global Map and Indicators of Food System Sustainability.” Scientific Data 6 (1): 279. https://doi.org/10.1038/s41597-019-0301-5.

Bickel, Peter J., and Kjell A. Doksum. 1981. “An Analysis of Transformations Revisited.” Journal of the American Statistical Association 76 (374): 296–311. https://doi.org/10.1080/01621459.1981.10477649.

Gómez-Limón, José A., and Gabriela Sanchez-Fernandez. 2010. “Empirical Evaluation of Agricultural Sustainability Using Composite Indicators.” Ecological Economics 69 (5): 1062–75. https://doi.org/10.1016/j.ecolecon.2009.11.027.

Jacobi, Johanna, Stellah Mukhovi, Aymara Llanque, Markus Giger, Adriana Bessa, Christophe Golay, Chinwe Ifejika Speranza, et al. 2020. “A New Understanding and Evaluation of Food Sustainability in Six Different Food Systems in Kenya and Bolivia.” Scientific Reports 10 (1): 19145. https://doi.org/10.1038/s41598-020-76284-y.

Nicoletti, Giuseppe. 2000. “Summary Indicators of Product Market Regulation with an Extension to Employment Protection Legislation.” {{OECD Economics Department Working Papers}} 226. Vol. 226. OECD Economics Department Working Papers. https://doi.org/10.1787/215182844604.

OECD. 2008. Handbook on Constructing Composite Indicators: Methodology and User Guide. Paris: Organisation for Economic Co-operation and Development.

Schneider, Kate R., Jessica Fanzo, Lawrence Haddad, Mario Herrero, Jose Rosero Moncayo, Anna Herforth, Roseline Remans, et al. 2023. “The State of Food Systems Worldwide in the Countdown to 2030.” Nature Food 4 (12): 1090–110. https://doi.org/10.1038/s43016-023-00885-9.

Stekhoven, Daniel J., and Peter Bühlmann. 2012. “MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1): 112–18. https://doi.org/10.1093/bioinformatics/btr597.