pipeML

A flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation fold construction

Installation

You can install the development version of pipeML from GitHub with:

# install.packages("pak")
pak::pkg_install("VeraPancaldiLab/pipeML")

Description

pipeML is a flexible and leakage-aware machine learning framework for R designed for predictive modeling in high-dimensional biological data. The package integrates all key steps of the machine learning workflow — feature selection, model training, validation, prediction, and interpretation — into a single reproducible pipeline.

A key design goal of pipeML is to support fold-aware feature construction, allowing features that depend on the dataset (e.g. enrichment scores, correlation-based features, or network-derived features) to be recomputed within each cross-validation fold. This prevents information leakage and ensures reliable performance estimation.

The framework is designed to integrate naturally with R/Bioconductor workflows, making it particularly suitable for omics and biomedical machine learning applications.

Figure 1. General structure of the pipeML machine learning pipeline.

Key Features

End-to-end ML workflow

Integrated pipeline for feature selection, model training, validation, prediction, and interpretation

Leakage-aware validation

Custom cross-validation fold construction
Support for fold-aware feature recomputation
Prevents information leakage when using dataset-dependent features

Flexible model evaluation

Repeated and stratified k-fold cross-validation
Leave-one-dataset-out (LODO) evaluation for cross-cohort generalization

Feature selection

Boruta-based feature selection
Optional correlation-based feature filtering

Hyperparameter tuning

Automatic optimization based on:
- AUROC
- AUPRC
- Accuracy

Model interpretation

SHAP-based feature importance
Variable importance summaries
Performance visualization (ROC and PR curves)

Ensemble learning

Model stacking

Parallel computing

Multi-core support for faster model training and cross-validation

Custom workflows

Users can define custom fold construction functions
These functions can receive a bestTune argument after hyperparameter optimization to retrain models on the full training dataset.

Supported Machine Learning Methods

Classification algorithms:

For classification tasks, we implemented a diverse set of classification algorithms that are benchmarked on the fly making extensive use of the R package caret.

Bagged classification trees
Random forests
C5.0 decision trees
Regularized logistic regression (elastic net)
k-nearest neighbors (KNN)
Classification and regression trees (CART)
Lasso regression
Ridge regression
Support vector machines with linear and radial kernels
Extreme Gradient Boosting (XGBoost)

Survival algorithms:

For time-to-event outcomes, pipeML implements a unified survival modeling framework based on the parsnip and workflows ecosystems, enabling consistent training, hyperparameter tuning, and evaluation across multiple survival model families.

Cox proportional hazards model
Elastic net–regularized Cox regression
Parametric accelerated failure time (AFT) models
Conditional inference survival trees
Bagged CART survival models
Random survival forests
Gradient boosting for censored outcomes

General usage

Below are basic examples showing how to use pipeML

For a detailed tutorial, see Get started

library(pipeML)

Training models

res <- compute_features.training.ML(features_train = X_train, 
                                    target_var = y_train,
                                    task_type = "classification",
                                    trait.positive = "1",
                                    metric = "AUROC",
                                    k_folds = 5,
                                    n_rep = 10,
                                    return = F)

Predicting on new data

pred = compute_prediction(model = res$Model, 
                          test_data = X_test, 
                          target_var = y_test, 
                          task_type = "classification",
                          trait.positive = "1", 
                          return = F)

Training and Testing Workflow

res <- compute_features.ML(features_train = X_train, 
                           features_test = X_test, 
                           coldata = data,
                           task_type = "classification",
                           trait = "target",
                           trait.positive = "1",
                           metric = "AUROC",
                           k_folds = 5,
                           n_rep = 10,
                           ncores = 2)

Issues

If you encounter any problems or have questions about the package, we encourage you to open an issue here. We’ll do our best to assist you!

Authors

pipeML was developed by Marcelo Hurtado in supervision of Vera Pancaldi and is part of the Pancaldi team. Currently, Marcelo is the primary maintainer of this package.

Citing pipeML

If you use pipeML in a scientific publication, we would appreciate citation to the :

Hurtado M, Pancaldi V (2026). pipeML: A flexible and modular machine learning framework designed to support leakage-free model training through custom cross-validation fold construction. R package version 0.0.1, https://verapancaldilab.github.io/pipeML

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.github		.github
R		R
data		data
man		man
pkgdown/favicon		pkgdown/favicon
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
pipeML.Rproj		pipeML.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pipeML

Installation

Description

Key Features

End-to-end ML workflow

Leakage-aware validation

Flexible model evaluation

Feature selection

Hyperparameter tuning

Model interpretation

Ensemble learning

Parallel computing

Custom workflows

Supported Machine Learning Methods

Classification algorithms:

Survival algorithms:

General usage

Training models

Predicting on new data

Training and Testing Workflow

Issues

Authors

Citing pipeML

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pipeML

Installation

Description

Key Features

End-to-end ML workflow

Leakage-aware validation

Flexible model evaluation

Feature selection

Hyperparameter tuning

Model interpretation

Ensemble learning

Parallel computing

Custom workflows

Supported Machine Learning Methods

Classification algorithms:

Survival algorithms:

General usage

Training models

Predicting on new data

Training and Testing Workflow

Issues

Authors

Citing pipeML

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages