Introduction to MultiMlearn • MultiMlearn

The MultiMlearn package is designed for estimating the individualized treatment rules (ITRs) under the setting of multicategory treatments. It is designed for data from observational studies (such as electronic health records), but it can also be used for randomized controlled trials.

The MultiMlearn package implements a matched learning (M-learning) model that uses the one-versus-one approach to convert the multicategory treatment comparison to a set of binary classification problems.

Also, the proposed model can incorporate the identified latent subgroup information to alleviate confounding effects among subjects and improve the performance of the M-learning. The coding example for subgroup identification can be found on this GitHub repository, while it is currently not a function of the MultiMlearn package.

Documentation

A simulation study illustrating the necessity of latent subgroup information can be found in this supplementary material. The following explainatory example uses a similar simulation scenario to that stated in Section S.1.

Installation

Install the development version using the devtools package:

devtools::install_github("jitonglou/MultiMlearn")

Usage

Load the packages:

library(MultiMlearn)
library(dplyr)
library(caret)
library(personalized)

Simulate a dataset

Following the settings (equation (1), (2), (5), and (6)) in Section S.1.1 and S.1.2 of the supplementary material, we simulate a dataset for N=200 subjects. The covariate vector \(\mathbf{x}_i\) was set to be p=20 dimensional. Moreover, we set the number of treatments K=4, the number of groups J=4.

rseed = 0
set.seed(rseed)

simdata = simulate_data(
   N = 200, p = 20, K = 4, J = 4,
   propensity_func = pi.true, 
   main_func = mu.true, 
   interaction_func = delta.true
)
# ?simulate_data # explanations of function arguments and the columns in the returned data frame 
# ?pi.true # true propensity model used in the supplementary material
# ?mu.true # true main effects used in the supplementary material
# ?delta.true # true interaction effects used in the supplementary material

table(treatment=simdata$treatment, group=simdata$cluster) # contingency table of observed treatment vs group
#>              group
#> treatment      1  2  3  4
#>   treatment_1 11 12 15 12
#>   treatment_2  9 10 14  8
#>   treatment_3 17 18 11  8
#>   treatment_4 13 15 11 16

Estimate propensity and prognostic scores

Next, we will show how to estimate propensity and prognostic scores using the random forest model. We take subjects in group 1 as an example:

## extract features for subjects in group 1
group = 1
data_covariate = simdata %>%
  filter(cluster == group) %>%
  select(ID, cluster, reward, treatment, starts_with("feature"))
n_feature = ncol(data_covariate) - 4

Prognostic scores

Train a random forest model to estimate prognostic scores. We selected the tuning parameters (mtry and ntree) by 10-fold cross-validation with 3 repeats.

# ?rfcv2 # a customized function for using cross-validation to train random forests
## tuning parameters
metric = "RMSE"
tunegrid = expand.grid(.mtry=seq(1,n_feature/3, 1), .ntree=seq(500,2000,500))
control = trainControl(method="repeatedcv", number=10, repeats=3)
## train model (around 1 minute)
set.seed(rseed)
prognostic_fit = train(
  reward~., data=data_covariate %>% select(-ID, -cluster, -treatment),
  method=rfcv2("Regression"), metric=metric,
  tuneGrid=tunegrid, trControl=control
)
print(prognostic_fit)
#> 50 samples
#> 20 predictors
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 3 times) 
#> Summary of sample sizes: 44, 44, 46, 44, 46, 45, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  ntree  RMSE      Rsquared   MAE      
#>   1      500   1.263909  0.2374684  1.0026379
#>   1     1000   1.265846  0.2393138  1.0031806
#>   1     1500   1.265537  0.2690098  0.9996474
#>   1     2000   1.263070  0.2842502  1.0005525
#>   2      500   1.261495  0.2689129  1.0001293
#>   2     1000   1.259356  0.2693475  0.9959385
#>   2     1500   1.269358  0.2825926  1.0054877
#>   2     2000   1.266920  0.2703527  1.0049415
#>   3      500   1.265764  0.2692819  1.0019152
#>   3     1000   1.263893  0.2760030  1.0050117
#>   3     1500   1.263505  0.2996385  0.9989403
#>   3     2000   1.262860  0.2807268  1.0022529
#>   4      500   1.263106  0.3021249  1.0038525
#>   4     1000   1.261288  0.3009510  1.0047467
#>   4     1500   1.257289  0.2793284  1.0022257
#>   4     2000   1.261132  0.2931092  1.0017244
#>   5      500   1.268665  0.3075777  1.0083188
#>   5     1000   1.264453  0.2865868  1.0082328
#>   5     1500   1.260397  0.3052117  1.0038140
#>   5     2000   1.263131  0.3047422  1.0095299
#>   6      500   1.261791  0.3289496  1.0074004
#>   6     1000   1.264546  0.2682642  1.0127456
#>   6     1500   1.263643  0.2895696  1.0096727
#>   6     2000   1.261639  0.2971942  1.0095640
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 4 and ntree = 1500.
## estimators
sub_prog = predict(
  prognostic_fit, newdata=data_covariate %>% select(-ID, -cluster, -treatment)
)

Propensity scores

Train a random forest model to estimate propensity scores. We selected the tuning parameters (mtry and ntree) by 10-fold cross-validation with 3 repeats.

## tuning parameters
metric = "Kappa"
tunegrid = expand.grid(.mtry=seq(1,sqrt(n_feature), 1), .ntree=seq(500,2000,500))
control = trainControl(method="repeatedcv", number=10, repeats=3)
## train model (around 1 minute)
set.seed(rseed)
propensity_fit = train(
  treatment~., data=data_covariate %>% select(-ID, -cluster, -reward),
  method=rfcv2("Classification"), metric=metric,
  tuneGrid=tunegrid, trControl=control
)
print(propensity_fit)
#> 50 samples
#> 20 predictors
#>  4 classes: 'treatment_1', 'treatment_2', 'treatment_3', 'treatment_4' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 3 times) 
#> Summary of sample sizes: 46, 46, 45, 46, 45, 44, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  ntree  Accuracy   Kappa        
#>   1      500   0.2327778  -0.0593179571
#>   1     1000   0.2572222  -0.0343456824
#>   1     1500   0.2683333  -0.0192242028
#>   1     2000   0.2616667  -0.0329631332
#>   2      500   0.2394444  -0.0452919214
#>   2     1000   0.2433333  -0.0381937624
#>   2     1500   0.2627778  -0.0181223106
#>   2     2000   0.2422222  -0.0399217290
#>   3      500   0.2694444   0.0046937196
#>   3     1000   0.2816667   0.0200164949
#>   3     1500   0.2750000   0.0118664062
#>   3     2000   0.2638889  -0.0074735034
#>   4      500   0.2622222  -0.0008315157
#>   4     1000   0.2611111  -0.0084169858
#>   4     1500   0.2761111   0.0126289334
#>   4     2000   0.2694444   0.0028106787
#> 
#> Kappa was used to select the optimal model using the largest value.
#> The final values used for the model were mtry = 3 and ntree = 1000.
## estimators
sub_prop = predict(
  propensity_fit, newdata=data_covariate %>% select(-ID, -cluster, -reward), type = "prob"
)

Estimate ITRs using the proposed M-learning model

We performs a nested cross-validation (on tuning parameters) of weighted support vector machines (SVMs) to fit the M-learning model and estimate ITRs.

Create the final dataset and compute distance between subjects

Incorporate the estimated propensity and prognostics scores to the feature set, and compute the Euclidean distance between subjects based on the features.

## final data frame
data_feature = data.frame(
  data_covariate %>% mutate(reward_res = reward),
  sub_prop %>% select(-levels(simdata$treatment)[1]), # remove propensity scores of the first treatment to avoid colinearity of propensity scores
  prog = sub_prog
)

## compute the Euclidean distance between features of subjects
dist_feature = data_feature %>%
  select(-ID, -cluster, -treatment, -reward, -reward_res) %>%
  dist() %>%
  as.matrix()
summary(c(dist_feature))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   0.000   3.498   3.869   3.810   4.235   6.276

## Create an index data frame to select the corresponding row of the distance matrix
idx_data_feature = data.frame(ID=data_feature$ID, index=1:nrow(data_feature))

Fit the M-learning model

We split the data to 3 folds by letting nfolds_outer=3. In each test fold, the tuning parameter in SVMs (cost of constraints violation) is selected from \(\{2^k:k=0,\pm1,\ldots,\pm8\}\) using another layer of 3-fold cross-validation (nfolds_inner=3).

## User inputs
ksvm.grid = expand.grid(C = 2^(-8:8)) # a data frame of the tuning parameter
nfolds_outer = 3 # number of folds in the outer layer of the nested cross-validation
nfolds_inner = 3 # number of folds in the inner layer of the nested cross-validation
g_func = function(x){abs(x)} # weights on the outcome in support vector machines
max_size = 1 # maximum size of the matched set

## Fit M-learning model (around 1 minute)
set.seed(rseed)
# ?mlearn.wsvm.cv # explanation of arguments
ITR = mlearn.wsvm.cv(
  data=data_feature, idx=idx_data_feature,
  trts=levels(data_feature$treatment), max_size=max_size,
  delta=max(dist_feature), dist_mat=dist_feature, g_func=g_func,
  kernel="rbfdot", kpar="automatic",
  nfolds_outer=nfolds_outer, nfolds_inner=nfolds_inner, tuneGrid=ksvm.grid, propensity=sub_prop,
  foldid_outer=NULL
)
#> [1] "Outer fold 1 has 33 training samples."
#> [1] "Outer fold 1 is done."
#> [1] "Outer fold 2 has 33 training samples."
#> [1] "Outer fold 2 is done."
#> [1] "Outer fold 3 has 34 training samples."
#> [1] "Outer fold 3 is done."
summary(ITR)
#>              Length Class      Mode   
#> fit           3     -none-     list   
#> foldid_outer 50     -none-     numeric
#> prediction    5     data.frame list
head(ITR$prediction) # "vote" represents the recommended treatments for the subjects. "treatment" represents the observed treatments.
#>   ID    reward   treatment        vote fold
#> 1  2 2.7538025 treatment_1 treatment_1    2
#> 2  5 0.3607558 treatment_3 treatment_1    1
#> 3  7 3.5833142 treatment_1 treatment_1    3
#> 4 14 0.8700615 treatment_2 treatment_1    1
#> 5 15 0.4640756 treatment_2 treatment_1    1
#> 6 16 2.5734038 treatment_3 treatment_1    2

Summarize results

We extract the recommended treatments of the ITR, and then calculate the empirical value function (EVF) and misclassification rate as equation (3) and equation (4) in the supplementary material.

## create a function to calculate EVF
evf.func = function(reward, pi, x, y){
  flag = (x==y)
  return(sum((reward/pi)[flag])/sum((1/pi)[flag]))
}

## EVF and misclassification rate of observed and recommended treatments
summary_evf = ITR$prediction %>%
  left_join(simdata %>% select(ID, pi_true, contains("opt")), # join the simdata to use true propensity score and optimal treatment information
            by = "ID") %>%
  group_by(fold) %>%
  summarize(
      evf_obs = evf.func(reward, pi_true, treatment, treatment),
      mis_obs = 1-sum(treatment_opt == treatment)/n(),
      evf_rec = evf.func(reward, pi_true, treatment, vote), 
      mis_rec = 1-sum(treatment_opt == vote)/n(),
    n=n() # sample size in each fold
  ) %>%
  as.data.frame

summary(summary_evf)
#>       fold        evf_obs          mis_obs          evf_rec          mis_rec 
#>  Min.   :1.0   Min.   :0.8164   Min.   :0.7059   Min.   :0.8785   Min.   :0  
#>  1st Qu.:1.5   1st Qu.:0.9893   1st Qu.:0.7592   1st Qu.:1.4472   1st Qu.:0  
#>  Median :2.0   Median :1.1622   Median :0.8125   Median :2.0159   Median :0  
#>  Mean   :2.0   Mean   :1.1692   Mean   :0.7806   Mean   :2.1850   Mean   :0  
#>  3rd Qu.:2.5   3rd Qu.:1.3456   3rd Qu.:0.8180   3rd Qu.:2.8382   3rd Qu.:0  
#>  Max.   :3.0   Max.   :1.5289   Max.   :0.8235   Max.   :3.6604   Max.   :0  
#>        n        
#>  Min.   :16.00  
#>  1st Qu.:16.50  
#>  Median :17.00  
#>  Mean   :16.67  
#>  3rd Qu.:17.00  
#>  Max.   :17.00

Compared with the observed treatments, the recommended treatments by M-learning have a greater EVF (2.0159 vs 1.1622) and a lower misclassification rate (0 vs. 0.7806) on average. To reproduce the results in Table S1 and Figure S1 in Section S.1.3 of the supplementary material, you need to run the above scripts for group 2 to 4, and repeat the procedure for 100 times using rseed=1,...,100, ntree=seq(500,5000,500), ksvm.grid = expand.grid(C = 2^(-15:15)), nfolds_outer = 5, and nfolds_outer = 5. Each replication can be done by parallel computing.