mcbarycenter packagelibrary(mcbarycenter)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
The mcbarycenter package implements a method for
estimating the quantile function of the Wasserstein barycenter using the
marginal-constructed barycenter (MCB) procedure. Briefly, the estimator
can be used in sparse sampling settings, in which each distribution is
observed only through a potentially small i.i.d. sample. The estimator
works by first estimating a series of binomial mixing distributions
using a beta parametric specification, a spline parametric
specification, or the NPMLE. It then combines those mixing distributions
to estimate the Wasserstein barycenter.
For more details, please see Peng, Stijven et al. (arXiv, 2026).
The mcbarycenter package estimates the quantile function
of the Wasserstein barycenter using either the empirical barycenter or
the MCB estimator. In addition, the package provides tools for
calculating quantile differences, graphing results, and carrying out
hypothesis tests to compare barycenters across groups of
distributions.
Some functionality depends on external packages. In particular,
REBayes and Rmosek can be used to make the
NPMLE workflow run faster. To implement the spline mixing distribution
estimator, we used code from deconvolveR, but rewrote the
relevant function to make it faster for this package’s use case.
The examples below use the bundled sparse_dist_data data
set.
data("sparse_dist_data")
head(sparse_dist_data)
#> group id value
#> 1 0 1 -0.5277575
#> 2 0 1 -0.8679693
#> 3 0 1 1.9976595
#> 4 0 1 0.6019881
#> 5 0 1 2.6935115
#> 6 0 2 0.8427657
Note that the data set must be in long format, with one column specifying the ids of the distributions and another column specifying the i.i.d. samples from those distributions.
To visualize the empirical quantile curve for each individual
distribution, we can use graph_individual_quantiles(). This
helper starts from a raw data frame in long format, computes empirical
quantiles separately for each id, and then plots the
resulting curves on a common quantile grid.
When interactive = TRUE (the default), the function
returns an interactive plotly widget. The curves are
visible by default, and clicking a line highlights the selected
individual. Double-clicking resets the plot. Because the graph can
become crowded when there are many ids, the example below focuses on the
first 10 individuals.
first_10_df <- sparse_dist_data %>%
filter(id %in% 1:10)
graph_individual_quantiles(first_10_df,
id_col = "id",
val_col = "value")
If a static figure is preferred, set interactive = FALSE
to return a ggplot object instead.
The empbary() function computes empirical quantiles for
each subject and then averages those quantiles pointwise within each
group.
Arguments:
df: the input data frame.id_col: the column identifying subjects or groups.val_col: the column containing the numeric
observations.weight_col: an optional column containing group
weights; if omitted, the empirical barycenter is unweighted.alpha_grid: the grid of quantile levels to
estimate.quantile_type: the type argument passed to
stats::quantile().Returns:
empbary_result object containing the estimated
quantile curve in res, an estimated covariance matrix in
cov, and the standardized input data in
data.graph_quantiles().Example:
sample_grp0 <- subset(sparse_dist_data, group == 0)
sample_grp1 <- subset(sparse_dist_data, group == 1)
emp_res_grp0 <- empbary(
df = sample_grp0,
id_col = "id",
val_col = "value"
)
emp_res_grp1 <- empbary(
df = sample_grp1,
id_col = "id",
val_col = "value"
)
head(emp_res_grp0$res)
#> quantile estimate se ci_lo ci_hi
#> 1 0.01 -1.230124 0.05797421 -1.343751 -1.116496
#> 2 0.02 -1.230124 0.05797421 -1.343751 -1.116496
#> 3 0.03 -1.230124 0.05797421 -1.343751 -1.116496
#> 4 0.04 -1.230124 0.05797421 -1.343751 -1.116496
#> 5 0.05 -1.230124 0.05797421 -1.343751 -1.116496
#> 6 0.06 -1.230124 0.05797421 -1.343751 -1.116496
head(emp_res_grp1$res)
#> quantile estimate se ci_lo ci_hi
#> 1 0.01 -1.050571 0.05949284 -1.167175 -0.9339675
#> 2 0.02 -1.050571 0.05949284 -1.167175 -0.9339675
#> 3 0.03 -1.050571 0.05949284 -1.167175 -0.9339675
#> 4 0.04 -1.050571 0.05949284 -1.167175 -0.9339675
#> 5 0.05 -1.050571 0.05949284 -1.167175 -0.9339675
#> 6 0.06 -1.050571 0.05949284 -1.167175 -0.9339675
The mcbary() function estimates the barycenter quantile
function using the marginal-corrected barycenter procedure.
Arguments:
df, id_val, val_col: same
arguments as in empbary() function.method: the mixture estimation method; options include
"spline", "npmle", "raw", and
"beta".x_grid: the threshold grid to use when
cutpoints is not supplied.cutpoints: the number of evenly spaced grid points
constructed from the observed data range; either cutpoints
or x_grid must be supplied.bootstrap_samples: the number of bootstrap resamples
used for uncertainty estimation.alpha_grid: the grid of quantile levels to
estimate.use_midpoint: whether to use interval midpoints when
constructing the quantile distribution.estimate_endpoints: whether to estimate the endpoint
support regions.use_isotonic_dist: whether to enforce monotonicity on
the estimated distribution of quantiles before converting them to
quantile distributions.use_isotonic_output: whether to enforce monotonicity on
the final estimated quantile curve and confidence bounds.weight_col: an optional column containing id-level
weights.progress: whether to display a bootstrap progress
bar.ci_level: the confidence level used for reported
intervals....: additional arguments passed to the mixture
estimation routine.Returns:
mcbary_result object containing the estimated
quantile curve and uncertainty summaries in res, the
bootstrap covariance matrix in cov, the estimated mixing
distributions in mixtures, the chosen estimation method in
method, and the standardized input data in
data.mcb_res_grp0 <- mcbary(
df = sample_grp0,
id_col = "id",
val_col = "value",
method = "beta",
cutpoints = 100,
use_midpoint = TRUE,
estimate_endpoints = TRUE,
bootstrap_samples = 20,
progress = FALSE
)
mcb_res_grp1 <- mcbary(
df = sample_grp1,
id_col = "id",
val_col = "value",
method = "beta",
cutpoints = 100,
use_midpoint = TRUE,
estimate_endpoints = TRUE,
bootstrap_samples = 20,
progress = FALSE
)
head(mcb_res_grp0$res)
#> quantile estimate estimate_bs se ci_lo ci_hi pct_ci_lo
#> 1 0.01 -2.320901 -2.453698 0.2916539 -2.892532 -1.749270 -2.840639
#> 2 0.02 -2.104384 -2.061285 0.1612055 -2.420341 -1.788427 -2.352138
#> 3 0.03 -1.951799 -1.917535 0.1429517 -2.231979 -1.671618 -2.195089
#> 4 0.04 -1.831607 -1.807628 0.1365658 -2.099271 -1.563943 -2.105406
#> 5 0.05 -1.730841 -1.708127 0.1303675 -1.986357 -1.475326 -2.012876
#> 6 0.06 -1.642907 -1.616832 0.1007392 -1.840352 -1.445462 -1.832649
#> pct_ci_hi
#> 1 -2.038460
#> 2 -1.835889
#> 3 -1.735626
#> 4 -1.653020
#> 5 -1.558585
#> 6 -1.479050
head(mcb_res_grp1$res)
#> quantile estimate estimate_bs se ci_lo ci_hi pct_ci_lo
#> 1 0.01 -2.101027 -2.213573 0.25839620 -2.607474 -1.594580 -2.655272
#> 2 0.02 -1.906815 -1.923476 0.14148063 -2.184112 -1.629518 -2.172275
#> 3 0.03 -1.765184 -1.776654 0.12166094 -2.003635 -1.526733 -2.007182
#> 4 0.04 -1.652430 -1.653460 0.11990177 -1.887433 -1.417427 -1.874286
#> 5 0.05 -1.558018 -1.544372 0.11405772 -1.781567 -1.334469 -1.762633
#> 6 0.06 -1.476152 -1.464006 0.09534207 -1.663019 -1.289285 -1.629568
#> pct_ci_hi
#> 1 -1.876289
#> 2 -1.682424
#> 3 -1.574713
#> 4 -1.457719
#> 5 -1.374835
#> 6 -1.297677
The package includes helpers for plotting both the estimated barycenter quantiles and the intermediate mixing distributions.
The graph below displays the empirical barycenters and the MCB estimators for groups 0 and 1.
graph_quantiles(
emp_grp0 = emp_res_grp0,
emp_grp1 = emp_res_grp1,
mcb_grp0 = mcb_res_grp0,
mcb_grp1 = mcb_res_grp1
)
graph_mixtures(mcb_res_grp0)
To compare barycenters between groups of observations, we can use the
quantile_diff() function, which calculates the quantile
differences between the two groups.
For example, we can compute the difference between the estimated MCB quantile functions for groups 0 and 1, and then graph the resulting quantile difference curve.
quantile_diff_res <- quantile_diff(mcb_res_grp0, mcb_res_grp1)
head(quantile_diff_res)
#> quantile estimate se ci_lo ci_hi
#> 1 0.01 -0.2198736 0.3896544 -0.9835823 0.5438351
#> 2 0.02 -0.1975685 0.2144854 -0.6179521 0.2228151
#> 3 0.03 -0.1866141 0.1877141 -0.5545269 0.1812987
#> 4 0.04 -0.1791770 0.1817323 -0.5353659 0.1770118
#> 5 0.05 -0.1728231 0.1732191 -0.5123263 0.1666800
#> 6 0.06 -0.1667549 0.1387029 -0.4386076 0.1050979
graph_quantiles(mcb_difference = quantile_diff_res)
To compare barycenters across samples, we first estimate each
barycenter on the same quantile grid and then apply
equalbary_test(). Here we compare the two groups in
sparse_dist_data.
comparison <- equalbary_test(mcb_res_grp0, mcb_res_grp1, B = 1000, seed = 1)
comparison$sup_norm$pval
#> [1] 0.011
comparison$int_sq_norm$pval
#> [1] 0
comparison$hotelling$pval
#> [1] 0.1248716