library(mcbarycenter)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Introduction

The mcbarycenter package implements a method for estimating the quantile function of the Wasserstein barycenter using the marginal-constructed barycenter (MCB) procedure. Briefly, the estimator can be used in sparse sampling settings, in which each distribution is observed only through a potentially small i.i.d. sample. The estimator works by first estimating a series of binomial mixing distributions using a beta parametric specification, a spline parametric specification, or the NPMLE. It then combines those mixing distributions to estimate the Wasserstein barycenter.

For more details, please see Peng, Stijven et al. (arXiv, 2026).

The mcbarycenter package estimates the quantile function of the Wasserstein barycenter using either the empirical barycenter or the MCB estimator. In addition, the package provides tools for calculating quantile differences, graphing results, and carrying out hypothesis tests to compare barycenters across groups of distributions.

Some functionality depends on external packages. In particular, REBayes and Rmosek can be used to make the NPMLE workflow run faster. To implement the spline mixing distribution estimator, we used code from deconvolveR, but rewrote the relevant function to make it faster for this package’s use case.

The examples below use the bundled sparse_dist_data data set.

data("sparse_dist_data")
head(sparse_dist_data)
#>   group id      value
#> 1     0  1 -0.5277575
#> 2     0  1 -0.8679693
#> 3     0  1  1.9976595
#> 4     0  1  0.6019881
#> 5     0  1  2.6935115
#> 6     0  2  0.8427657

Note that the data set must be in long format, with one column specifying the ids of the distributions and another column specifying the i.i.d. samples from those distributions.

Graphing individual empirical quantiles

To visualize the empirical quantile curve for each individual distribution, we can use graph_individual_quantiles(). This helper starts from a raw data frame in long format, computes empirical quantiles separately for each id, and then plots the resulting curves on a common quantile grid.

When interactive = TRUE (the default), the function returns an interactive plotly widget. The curves are visible by default, and clicking a line highlights the selected individual. Double-clicking resets the plot. Because the graph can become crowded when there are many ids, the example below focuses on the first 10 individuals.

first_10_df <- sparse_dist_data %>% 
  filter(id %in% 1:10)

graph_individual_quantiles(first_10_df, 
                           id_col = "id", 
                           val_col = "value")

If a static figure is preferred, set interactive = FALSE to return a ggplot object instead.

Estimating the Wasserstein barycenter quantile function

Empirical Wasserstein barycenter

The empbary() function computes empirical quantiles for each subject and then averages those quantiles pointwise within each group.

Arguments:

  • df: the input data frame.
  • id_col: the column identifying subjects or groups.
  • val_col: the column containing the numeric observations.
  • weight_col: an optional column containing group weights; if omitted, the empirical barycenter is unweighted.
  • alpha_grid: the grid of quantile levels to estimate.
  • quantile_type: the type argument passed to stats::quantile().

Returns:

  • An empbary_result object containing the estimated quantile curve in res, an estimated covariance matrix in cov, and the standardized input data in data.
  • This objected can be plotted with the function graph_quantiles().

Example:

sample_grp0 <- subset(sparse_dist_data, group == 0)
sample_grp1 <- subset(sparse_dist_data, group == 1)

emp_res_grp0 <- empbary(
  df = sample_grp0,
  id_col = "id",
  val_col = "value"
)

emp_res_grp1 <- empbary(
  df = sample_grp1,
  id_col = "id",
  val_col = "value"
)

head(emp_res_grp0$res)
#>   quantile  estimate         se     ci_lo     ci_hi
#> 1     0.01 -1.230124 0.05797421 -1.343751 -1.116496
#> 2     0.02 -1.230124 0.05797421 -1.343751 -1.116496
#> 3     0.03 -1.230124 0.05797421 -1.343751 -1.116496
#> 4     0.04 -1.230124 0.05797421 -1.343751 -1.116496
#> 5     0.05 -1.230124 0.05797421 -1.343751 -1.116496
#> 6     0.06 -1.230124 0.05797421 -1.343751 -1.116496
head(emp_res_grp1$res)
#>   quantile  estimate         se     ci_lo      ci_hi
#> 1     0.01 -1.050571 0.05949284 -1.167175 -0.9339675
#> 2     0.02 -1.050571 0.05949284 -1.167175 -0.9339675
#> 3     0.03 -1.050571 0.05949284 -1.167175 -0.9339675
#> 4     0.04 -1.050571 0.05949284 -1.167175 -0.9339675
#> 5     0.05 -1.050571 0.05949284 -1.167175 -0.9339675
#> 6     0.06 -1.050571 0.05949284 -1.167175 -0.9339675

Marginal-constructed barycenter

The mcbary() function estimates the barycenter quantile function using the marginal-corrected barycenter procedure.

Arguments:

  • df, id_val, val_col: same arguments as in empbary() function.
  • method: the mixture estimation method; options include "spline", "npmle", "raw", and "beta".
  • x_grid: the threshold grid to use when cutpoints is not supplied.
  • cutpoints: the number of evenly spaced grid points constructed from the observed data range; either cutpoints or x_grid must be supplied.
  • bootstrap_samples: the number of bootstrap resamples used for uncertainty estimation.
  • alpha_grid: the grid of quantile levels to estimate.
  • use_midpoint: whether to use interval midpoints when constructing the quantile distribution.
  • estimate_endpoints: whether to estimate the endpoint support regions.
  • use_isotonic_dist: whether to enforce monotonicity on the estimated distribution of quantiles before converting them to quantile distributions.
  • use_isotonic_output: whether to enforce monotonicity on the final estimated quantile curve and confidence bounds.
  • weight_col: an optional column containing id-level weights.
  • progress: whether to display a bootstrap progress bar.
  • ci_level: the confidence level used for reported intervals.
  • ...: additional arguments passed to the mixture estimation routine.

Returns:

  • An mcbary_result object containing the estimated quantile curve and uncertainty summaries in res, the bootstrap covariance matrix in cov, the estimated mixing distributions in mixtures, the chosen estimation method in method, and the standardized input data in data.
mcb_res_grp0 <- mcbary(
  df = sample_grp0,
  id_col = "id",
  val_col = "value",
  method = "beta",
  cutpoints = 100,
  use_midpoint = TRUE,
  estimate_endpoints = TRUE,
  bootstrap_samples = 20,
  progress = FALSE
)

mcb_res_grp1 <- mcbary(
  df = sample_grp1,
  id_col = "id",
  val_col = "value",
  method = "beta",
  cutpoints = 100,
  use_midpoint = TRUE,
  estimate_endpoints = TRUE,
  bootstrap_samples = 20,
  progress = FALSE
)

head(mcb_res_grp0$res)
#>   quantile  estimate estimate_bs        se     ci_lo     ci_hi pct_ci_lo
#> 1     0.01 -2.320901   -2.453698 0.2916539 -2.892532 -1.749270 -2.840639
#> 2     0.02 -2.104384   -2.061285 0.1612055 -2.420341 -1.788427 -2.352138
#> 3     0.03 -1.951799   -1.917535 0.1429517 -2.231979 -1.671618 -2.195089
#> 4     0.04 -1.831607   -1.807628 0.1365658 -2.099271 -1.563943 -2.105406
#> 5     0.05 -1.730841   -1.708127 0.1303675 -1.986357 -1.475326 -2.012876
#> 6     0.06 -1.642907   -1.616832 0.1007392 -1.840352 -1.445462 -1.832649
#>   pct_ci_hi
#> 1 -2.038460
#> 2 -1.835889
#> 3 -1.735626
#> 4 -1.653020
#> 5 -1.558585
#> 6 -1.479050
head(mcb_res_grp1$res)
#>   quantile  estimate estimate_bs         se     ci_lo     ci_hi pct_ci_lo
#> 1     0.01 -2.101027   -2.213573 0.25839620 -2.607474 -1.594580 -2.655272
#> 2     0.02 -1.906815   -1.923476 0.14148063 -2.184112 -1.629518 -2.172275
#> 3     0.03 -1.765184   -1.776654 0.12166094 -2.003635 -1.526733 -2.007182
#> 4     0.04 -1.652430   -1.653460 0.11990177 -1.887433 -1.417427 -1.874286
#> 5     0.05 -1.558018   -1.544372 0.11405772 -1.781567 -1.334469 -1.762633
#> 6     0.06 -1.476152   -1.464006 0.09534207 -1.663019 -1.289285 -1.629568
#>   pct_ci_hi
#> 1 -1.876289
#> 2 -1.682424
#> 3 -1.574713
#> 4 -1.457719
#> 5 -1.374835
#> 6 -1.297677

Graphing tools

The package includes helpers for plotting both the estimated barycenter quantiles and the intermediate mixing distributions.

The graph below displays the empirical barycenters and the MCB estimators for groups 0 and 1.

graph_quantiles(
  emp_grp0 = emp_res_grp0,
  emp_grp1 = emp_res_grp1,
  mcb_grp0 = mcb_res_grp0,
  mcb_grp1 = mcb_res_grp1
)

graph_mixtures(mcb_res_grp0)

Comparing Wasserstein barycenters

To compare barycenters between groups of observations, we can use the quantile_diff() function, which calculates the quantile differences between the two groups.

For example, we can compute the difference between the estimated MCB quantile functions for groups 0 and 1, and then graph the resulting quantile difference curve.

quantile_diff_res <- quantile_diff(mcb_res_grp0, mcb_res_grp1)

head(quantile_diff_res)
#>   quantile   estimate        se      ci_lo     ci_hi
#> 1     0.01 -0.2198736 0.3896544 -0.9835823 0.5438351
#> 2     0.02 -0.1975685 0.2144854 -0.6179521 0.2228151
#> 3     0.03 -0.1866141 0.1877141 -0.5545269 0.1812987
#> 4     0.04 -0.1791770 0.1817323 -0.5353659 0.1770118
#> 5     0.05 -0.1728231 0.1732191 -0.5123263 0.1666800
#> 6     0.06 -0.1667549 0.1387029 -0.4386076 0.1050979
graph_quantiles(mcb_difference = quantile_diff_res)

To compare barycenters across samples, we first estimate each barycenter on the same quantile grid and then apply equalbary_test(). Here we compare the two groups in sparse_dist_data.

comparison <- equalbary_test(mcb_res_grp0, mcb_res_grp1, B = 1000, seed = 1)

comparison$sup_norm$pval
#> [1] 0.011
comparison$int_sq_norm$pval
#> [1] 0
comparison$hotelling$pval
#> [1] 0.1248716