metabolomics-workflow.RmdThis vignette demonstrates a complete metabolomics analysis workflow using MetaboStatR, from raw data import to final results export. We’ll work through a realistic example that covers all major steps in metabolomics data analysis.
For this tutorial, we’ll analyze a hypothetical clinical study examining metabolic differences between:
The study includes:
The first step is to examine your raw data and ensure it meets MetaboStatR’s format requirements.
MetaboStatR expects data in a specific format with 8 metadata rows followed by metabolite data:
# Your CSV file should look like this:
# Row 1: Sample names (must be unique)
# Row 2: Subject ID (for paired/longitudinal studies)
# Row 3: Technical replicate IDs
# Row 4: Group assignments
# Row 5: Batch numbers
# Row 6: Injection sequence
# Row 7: Normalization values
# Row 8: Response variable (for regression)
# Row 9+: Metabolite data
# Load and perform initial quality check
data_qc <- perform_DataQualityCheck("path/to/your/metabolomics_data.csv")
# This function will:
# - Check data format and structure
# - Identify potential issues
# - Generate summary statistics
# View the structure of imported data
str(data_qc)The perform_DataQualityCheck() function returns a list
containing:
# Example of what data_qc contains:
names(data_qc)
# [1] "raw_data" # Original data matrix
# [2] "metadata_summary" # Sample information (first 8 rows)
# [3] "processing_log" # Metabolite abundance matrix
# [4] "quality_checked_data # The quality checked dataThe preprocessing pipeline in MetaboStatR follows a carefully designed sequence to ensure optimal data quality.
The 10-step preprocessing pipeline:
# Use default preprocessing parameters (recommended for most studies)
processed_data <- perform_PreprocessingPeakData(
raw_data = data_qc
)
# View preprocessing summary
print(processed_data$preprocessing_summary)For advanced users, parameters can be customized:
# Example with custom parameters
processed_data <- perform_PreprocessingPeakData(
raw_data = data_qc,
# Step 1: Outlier removal
outliers = c("Sample1, Sample2, Sample3"),
# Step 2: Feature filtering
filterMissing = 0.2, # Remove features missing in >20% samples in all group (since "filterMissing_by_group" = TRUE)
filterMissing_by_group = TRUE,
# Step 3: Missing value imputation
denMissing = 5, # 1/5 of minimum value
# Step 4: Drift/batch correction
driftBatchCorrection = TRUE,
# Step 6: Normalization
dataNormalize = "Normalization", # Use values from row 7
# Step 7: Transformation
dataTransform = "log10", # or "vsn", "sqrt", "log2", "sqrt", "cbrt"
# Step 8: Scaling
dataScalePCA = "meanSD", # mean centering + unit variance (for PCA and other analyses)
dataScalePLS = "mean2SD", # pareto scaling (exclusive for PLS-related analyses)
# Step 9: RSD filtering
filterMaxRSD = 30, # Remove features with RSD > 30%
filterMaxRSD_by = "EQC", # Perform it specifically on "EQC" QC samples
# Step 10: Variance filtering
filterMaxVarSD = 10 # Remove low-variance features
)This should output a plot of six (6) random features of the before- and after the data preprocessing steps. This plot is specific to drift- and batch correction.
# Compare before and after preprocessing
plot_BeforeAfter(
data = processed_data,
scaled = "PCA",
group_by = "Sample", # Plot the samples or features/metabolites using "Features"
n_random_samples = 30L, # Plot 30 random sample
n_random_features = 30L, # Plot 30 random features/metabolites
seed = NULL # Keep it random
)PCA provides an overview of data structure, sample clustering, and potential batch effects.
# Perform PCA with multiple component comparisons
pca_results <- perform_PCA(
processed_data,
# Compare different PC combinations
pc_combinations = list(
c(1, 2), # PC1 vs PC2
c(1, 3), # PC1 vs PC3
c(2, 3) # PC2 vs PC3
)
)
# Display all PCA plots together
display_AllPlots(pca_results)Orthogonal Partial Least Squares Discriminant Analysis identifies metabolites that discriminate between groups.
# Perform OPLS-DA with group arrangement (severe → control)
plsda_results <- perform_PLS(
data = processed_data,
method = "oplsda",
arrangeLevels = c("Severe Case", "Less Severe Case", "Control")
)
# View OPLS-DA summary
print(plsda_results$model_summary)Calculate effect sizes and magnitude of metabolite changes between groups.
# Calculate fold changes (severe → control arrangement)
fc_results <- perform_FoldChange(
data = processed_data,
arrangeLevels = c("Severe Case", "Less Severe Case", "Control")
)One of the results of this analysis are the fold changes and the log2 of them. It will just be a matter of grabbing them them using the dollar sign ($).
Perform comparative analysis with appropriate statistical tests and multiple testing correction. This step will automatically detect how many groups you have. For 2 groups, it will perform t-tests and/or Wilcoxon Sum Rank test; for 3 or more groups, it will perform Analysis of Variance (ANOVA) or Kruskal-Wallis test. The choice depends on the assumption tests.
# Perform statistical analysis with Benjamini-Hochberg correction
stats_results <- perform_ComparativeAnalysis(
data = processed_data,
adjust_p_method = "BH" # False Discovery Rate control
)
# View statistical summary
print("Statistical Analysis Summary:")
print(stats_results$summary_statistics)Assess the discriminatory power of individual metabolites using Area Under ROC Curve.
# Calculate AUROC for all metabolites (control → severe arrangement)
auroc_results <- perform_AUROC(
data_PP = processed_data, # Preprocessed data
data_DR = plsda_results, # OPLS-DA results
data_FC = fc_results, # Fold change results
data_CA = stats_results, # Statistical results
arrangeLevels = c("Control", "Less Severe Case", "Severe Case")
)This function also outputs the top 5 (or as set by you) features/metabolites with the highest AUC values.
Build predictive models using regularized regression methods. This package primarily provides 2 regularized regression methods, elastic net regression and Least Absolute Shrinkage and Selection Operator (LASSO).
# Perform Elastic Net regression
regression_results <- perform_Regression(
data = processed_data,
method = "enet" # Elastic Net (combines Ridge and LASSO)
)
# Perform LASSO regression for feature selection
lasso_results <- perform_Regression(
data = processed_data,
method = "lasso"
)Create publication-ready plots to visualize your results.
# Create volcano plot combining fold changes and statistical significance
plot_Volcano(
PPData = processed_data,
FCData = fc_results,
CAData = stats_results,
arrangeLevels = c("Severe Case", "Less Severe Case", "Control"),
fcUP = 2, # Upper fold change threshold
fcDown = 0.5, # Lower fold change threshold
adjpvalue = 0.05 # Adjusted p-value threshold
)Export all data frames and visualizations for further analysis or publication.
# Export all analysis results to Excel workbooks
perform_Export2Excel(
results = list(
processed_data,
pca_results,
plsda_results,
fc_results,
stats_results,
auroc_results,
regression_results
),
folder_name = "MetaboStatR_Results",
file_name = c(
"01_ProcessedData",
"02_PCA_Analysis",
"03_OPLSDA_Analysis",
"04_FoldChange_Analysis",
"05_Statistical_Analysis",
"06_AUROC_Analysis",
"07_Regression_Analysis"
)
)
# Export all plots to PNG files
perform_ExportPlots2Image(
results = list(
processed_data,
pca_results,
plsda_results,
fc_results,
stats_results,
auroc_results,
regression_results
),
folder_name = "MetaboStatR_Plots",
file_prefix = c(
"ProcessedData",
"PCA",
"OPLSDA",
"FoldChange",
"Statistics",
"AUROC",
"Regression"
),
image_format = "png",
width = 15,
height = 12,
dpi = 300
)If OPLS-DA Q2 < 0.4 or regression accuracy < 0.7:
This function is not yet available. To be added.
# Perform pathway enrichment analysis
pathway_results <- perform_PathwayAnalysis(
significant_features = significant_features,
fold_changes = fc_results$log2_fold_changes,
database = "KEGG" # or "BioCyc", "Reactome"
)
sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=English_Philippines.utf8 LC_CTYPE=English_Philippines.utf8
#> [3] LC_MONETARY=English_Philippines.utf8 LC_NUMERIC=C
#> [5] LC_TIME=English_Philippines.utf8
#>
#> time zone: Asia/Manila
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] knitr_1.50 ggplot2_3.5.2 dplyr_1.1.4
#> [4] MetaboStatR_0.0.0.9000
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 rstudioapi_0.17.1
#> [3] jsonlite_2.0.0 shape_1.4.6.1
#> [5] MultiAssayExperiment_1.32.0 magrittr_2.0.3
#> [7] farver_2.1.2 rmarkdown_2.29
#> [9] fs_1.6.6 zlibbioc_1.52.0
#> [11] ragg_1.4.0 vctrs_0.6.5
#> [13] MultiDataSet_1.34.0 base64enc_0.1-3
#> [15] htmltools_0.5.8.1 S4Arrays_1.6.0
#> [17] cellranger_1.1.0 SparseArray_1.6.0
#> [19] Formula_1.2-5 pROC_1.18.5
#> [21] caret_7.0-1 sass_0.4.10
#> [23] parallelly_1.45.0 bslib_0.9.0
#> [25] htmlwidgets_1.6.4 desc_1.4.3
#> [27] plyr_1.8.9 plotly_4.11.0
#> [29] lubridate_1.9.4 cachem_1.1.0
#> [31] qqman_0.1.9 igraph_2.1.4
#> [33] lifecycle_1.0.4 iterators_1.0.14
#> [35] ropls_1.40.0 pkgconfig_2.0.3
#> [37] webshot2_0.1.2 Matrix_1.7-0
#> [39] R6_2.6.1 fastmap_1.2.0
#> [41] GenomeInfoDbData_1.2.13 MatrixGenerics_1.18.1
#> [43] future_1.67.0 digest_0.6.37
#> [45] colorspace_2.1-1 rARPACK_0.11-0
#> [47] ps_1.9.1 patchwork_1.3.1
#> [49] S4Vectors_0.44.0 RSpectra_0.16-2
#> [51] textshaping_1.0.1 Hmisc_5.2-3
#> [53] ellipse_0.5.0 GenomicRanges_1.58.0
#> [55] timechange_0.3.0 httr_1.4.7
#> [57] abind_1.4-8 compiler_4.4.1
#> [59] withr_3.0.2 htmlTable_2.4.3
#> [61] backports_1.5.0 BiocParallel_1.40.2
#> [63] MASS_7.3-60.2 lava_1.8.1
#> [65] DelayedArray_0.32.0 corpcor_1.6.10
#> [67] chromote_0.5.1 ModelMetrics_1.2.2.2
#> [69] tools_4.4.1 foreign_0.8-86
#> [71] future.apply_1.20.0 nnet_7.3-19
#> [73] glue_1.8.0 promises_1.3.3
#> [75] nlme_3.1-164 grid_4.4.1
#> [77] checkmate_2.3.2 cluster_2.1.6
#> [79] reshape2_1.4.4 generics_0.1.4
#> [81] recipes_1.3.1 gtable_0.3.6
#> [83] class_7.3-22 websocket_1.4.4
#> [85] tidyr_1.3.1 data.table_1.17.8
#> [87] XVector_0.46.0 BiocGenerics_0.52.0
#> [89] ggrepel_0.9.6 foreach_1.5.2
#> [91] pillar_1.11.0 stringr_1.5.1
#> [93] limma_3.62.2 later_1.4.2
#> [95] splines_4.4.1 lattice_0.22-6
#> [97] survival_3.6-4 tidyselect_1.2.1
#> [99] mixOmics_6.30.0 gridExtra_2.3
#> [101] IRanges_2.40.0 SummarizedExperiment_1.36.0
#> [103] stats4_4.4.1 xfun_0.52
#> [105] Biobase_2.66.0 statmod_1.5.0
#> [107] hardhat_1.4.1 timeDate_4041.110
#> [109] matrixStats_1.5.0 pheatmap_1.0.13
#> [111] stringi_1.8.7 UCSC.utils_1.2.0
#> [113] lazyeval_0.2.2 yaml_2.3.10
#> [115] evaluate_1.0.4 codetools_0.2-20
#> [117] tibble_3.3.0 cli_3.6.5
#> [119] rpart_4.1.23 systemfonts_1.2.3
#> [121] processx_3.8.6 jquerylib_0.1.4
#> [123] Rcpp_1.1.0 GenomeInfoDb_1.42.3
#> [125] readxl_1.4.5 globals_0.18.0
#> [127] parallel_4.4.1 pkgdown_2.1.3
#> [129] gower_1.0.2 calibrate_1.7.7
#> [131] listenv_0.9.1 glmnet_4.1-10
#> [133] viridisLite_0.4.2 ipred_0.9-15
#> [135] scales_1.4.0 prodlim_2025.04.28
#> [137] purrr_1.1.0 crayon_1.5.3
#> [139] rlang_1.1.6This completes the comprehensive MetaboStatR workflow. The analysis pipeline provides a robust foundation for metabolomics biomarker discovery and should be adapted based on your specific research questions and study design.
For additional support or advanced applications, please refer to the other vignettes or contact the package maintainer.