A flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation fold construction
You can install the development version of pipeML from
GitHub with:
# install.packages("pak")
pak::pkg_install("VeraPancaldiLab/pipeML")pipeML is a flexible and leakage-aware machine learning framework for
R designed for predictive modeling in high-dimensional biological data.
The package integrates all key steps of the machine learning workflow —
feature selection, model training, validation, prediction, and
interpretation — into a single reproducible pipeline.
A key design goal of pipeML is to support fold-aware feature
construction, allowing features that depend on the dataset
(e.g. enrichment scores, correlation-based features, or network-derived
features) to be recomputed within each cross-validation fold. This
prevents information leakage and ensures reliable performance
estimation.
The framework is designed to integrate naturally with R/Bioconductor workflows, making it particularly suitable for omics and biomedical machine learning applications.
Figure 1. General structure of the pipeML machine learning
pipeline.
- Integrated pipeline for feature selection, model training, validation, prediction, and interpretation
- Custom cross-validation fold construction
- Support for fold-aware feature recomputation
- Prevents information leakage when using dataset-dependent features
- Repeated and stratified k-fold cross-validation
- Leave-one-dataset-out (LODO) evaluation for cross-cohort generalization
- Boruta-based feature selection
- Optional correlation-based feature filtering
-
Automatic optimization based on:
- AUROC
- AUPRC
- Accuracy
- SHAP-based feature importance
- Variable importance summaries
- Performance visualization (ROC and PR curves)
- Model stacking
- Multi-core support for faster model training and cross-validation
- Users can define custom fold construction functions
- These functions can receive a bestTune argument after hyperparameter optimization to retrain models on the full training dataset.
For classification tasks, we implemented a diverse set of classification
algorithms that are benchmarked on the fly making extensive use of the R
package caret.
- Bagged classification trees
- Random forests
- C5.0 decision trees
- Regularized logistic regression (elastic net)
- k-nearest neighbors (KNN)
- Classification and regression trees (CART)
- Lasso regression
- Ridge regression
- Support vector machines with linear and radial kernels
- Extreme Gradient Boosting (XGBoost)
For time-to-event outcomes, pipeML implements a unified survival
modeling framework based on the parsnip and workflows ecosystems,
enabling consistent training, hyperparameter tuning, and evaluation
across multiple survival model families.
- Cox proportional hazards model
- Elastic net–regularized Cox regression
- Parametric accelerated failure time (AFT) models
- Conditional inference survival trees
- Bagged CART survival models
- Random survival forests
- Gradient boosting for censored outcomes
Below are basic examples showing how to use pipeML
For a detailed tutorial, see Get started
library(pipeML)res <- compute_features.training.ML(features_train = X_train,
target_var = y_train,
task_type = "classification",
trait.positive = "1",
metric = "AUROC",
k_folds = 5,
n_rep = 10,
return = F)pred = compute_prediction(model = res$Model,
test_data = X_test,
target_var = y_test,
task_type = "classification",
trait.positive = "1",
return = F)res <- compute_features.ML(features_train = X_train,
features_test = X_test,
coldata = data,
task_type = "classification",
trait = "target",
trait.positive = "1",
metric = "AUROC",
k_folds = 5,
n_rep = 10,
ncores = 2)If you encounter any problems or have questions about the package, we encourage you to open an issue here. We’ll do our best to assist you!
pipeML was developed by Marcelo
Hurtado in supervision of Vera
Pancaldi and is part of the
Pancaldi team. Currently, Marcelo
is the primary maintainer of this package.
If you use pipeML in a scientific publication, we would appreciate
citation to the :
Hurtado M, Pancaldi V (2026). pipeML: A flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation fold construction. R package version 0.0.1, https://verapancaldilab.github.io/pipeML