Complete Metabolomics Analysis Workflow

This vignette demonstrates a complete metabolomics analysis workflow using MetaboStatR, from raw data import to final results export. We’ll work through a realistic example that covers all major steps in metabolomics data analysis.

Study Design Example

For this tutorial, we’ll analyze a hypothetical clinical study examining metabolic differences between:

  • Control group: Healthy individuals (n=30)
  • Less Severe Case: Patients with mild condition (n=25)
  • Severe Case: Patients with severe condition (n=20)

The study includes:

  • 3 analytical batches
  • Technical replicates for quality control
  • 150 metabolites measured using LC-MS
  • Normalization weights (e.g., sample volume, protein concentration)

Step 1: Data Quality Check and Import

The first step is to examine your raw data and ensure it meets MetaboStatR’s format requirements.

Understanding the Data Structure

MetaboStatR expects data in a specific format with 8 metadata rows followed by metabolite data:

# Your CSV file should look like this:
# Row 1: Sample names (must be unique)
# Row 2: Subject ID (for paired/longitudinal studies)  
# Row 3: Technical replicate IDs
# Row 4: Group assignments
# Row 5: Batch numbers
# Row 6: Injection sequence
# Row 7: Normalization values
# Row 8: Response variable (for regression)
# Row 9+: Metabolite data

Load and Inspect Raw Data

# Load and perform initial quality check
data_qc <- perform_DataQualityCheck("path/to/your/metabolomics_data.csv")

# This function will:
# - Check data format and structure
# - Identify potential issues
# - Generate summary statistics

# View the structure of imported data
str(data_qc)

Expected Output Structure

The perform_DataQualityCheck() function returns a list containing:

# Example of what data_qc contains:
names(data_qc)
# [1] "raw_data"            # Original data matrix
# [2] "metadata_summary"    # Sample information (first 8 rows)
# [3] "processing_log"      # Metabolite abundance matrix
# [4] "quality_checked_data # The quality checked data

Step 2: Comprehensive Data Preprocessing

The preprocessing pipeline in MetaboStatR follows a carefully designed sequence to ensure optimal data quality.

Understanding the Preprocessing Steps

The 10-step preprocessing pipeline:

  1. Outlier removal - Remove obvious outlier samples
  2. Feature filtering - Remove metabolites with excessive missing values
  3. Missing value imputation - Fill remaining missing values
  4. Drift/batch correction - Correct for analytical drift and batch effects
  5. Uncorrected feature removal - Remove features that couldn’t be corrected
  6. Data normalization - Account for technical variations
  7. Data transformation - Stabilize variance (log10, vsn)
  8. Data scaling - Standardize feature ranges
  9. RSD filtering - Remove features with high technical variability
  10. Low variance filtering - Remove uninformative features

Run Preprocessing with Default Parameters

# Use default preprocessing parameters (recommended for most studies)
processed_data <- perform_PreprocessingPeakData(
  raw_data = data_qc
)

# View preprocessing summary
print(processed_data$preprocessing_summary)

Customizing Preprocessing Parameters

For advanced users, parameters can be customized:

# Example with custom parameters
processed_data <- perform_PreprocessingPeakData(
  raw_data = data_qc,
  
  # Step 1: Outlier removal
  outliers = c("Sample1, Sample2, Sample3"),
  
  # Step 2: Feature filtering  
  filterMissing = 0.2,  # Remove features missing in >20% samples in all group (since "filterMissing_by_group" = TRUE)
  filterMissing_by_group = TRUE,
  
  # Step 3: Missing value imputation
  denMissing = 5,  # 1/5 of minimum value
  
  # Step 4: Drift/batch correction
  driftBatchCorrection = TRUE,
  
  # Step 6: Normalization
  dataNormalize = "Normalization",  # Use values from row 7
  
  # Step 7: Transformation
  dataTransform = "log10",  # or "vsn", "sqrt", "log2", "sqrt", "cbrt"
  
  # Step 8: Scaling
  dataScalePCA = "meanSD", # mean centering + unit variance (for PCA and other analyses)
  dataScalePLS = "mean2SD",  # pareto scaling (exclusive for PLS-related analyses)
  
  # Step 9: RSD filtering
  filterMaxRSD = 30,  # Remove features with RSD > 30%
  filterMaxRSD_by = "EQC", # Perform it specifically on "EQC" QC samples
  
  # Step 10: Variance filtering
  filterMaxVarSD = 10  # Remove low-variance features
)

This should output a plot of six (6) random features of the before- and after the data preprocessing steps. This plot is specific to drift- and batch correction.

Preprocessing Quality Assessment

# Compare before and after preprocessing
plot_BeforeAfter(
  data = processed_data,
  scaled = "PCA",
  group_by = "Sample", # Plot the samples or features/metabolites using "Features"
  n_random_samples = 30L, # Plot 30 random sample
  n_random_features = 30L, # Plot 30 random features/metabolites
  seed = NULL # Keep it random
)

Step 3: Principal Component Analysis (PCA)

PCA provides an overview of data structure, sample clustering, and potential batch effects.

Perform PCA Analysis

# Perform PCA with multiple component comparisons
pca_results <- perform_PCA(
  processed_data,
  # Compare different PC combinations
  pc_combinations = list(
    c(1, 2),  # PC1 vs PC2
    c(1, 3),  # PC1 vs PC3  
    c(2, 3)   # PC2 vs PC3
  )
)

# Display all PCA plots together
display_AllPlots(pca_results)

Step 4: OPLS-DA Analysis

Orthogonal Partial Least Squares Discriminant Analysis identifies metabolites that discriminate between groups.

Perform OPLS-DA

# Perform OPLS-DA with group arrangement (severe → control)
plsda_results <- perform_PLS(
  data = processed_data,
  method = "oplsda",
  arrangeLevels = c("Severe Case", "Less Severe Case", "Control")
)

# View OPLS-DA summary
print(plsda_results$model_summary)

Model Validation and Performance

You can check the OPSL-DA model performances via the output in the console or the output in the “Plots” or “Viewer” pane. Performance metrics include

  • R2X
  • R2Y
  • Q2

are all the things you need to consider.

VIP Scores and Feature Importance

VIP plots are automatically outputted in the “Plots” pane. Features/Metabolites with VIP score > 1 are generally considered important.

Step 5: Fold Change Analysis

Calculate effect sizes and magnitude of metabolite changes between groups.

Calculate Fold Changes

# Calculate fold changes (severe → control arrangement)
fc_results <- perform_FoldChange(
  data = processed_data,
  arrangeLevels = c("Severe Case", "Less Severe Case", "Control")
)

One of the results of this analysis are the fold changes and the log2 of them. It will just be a matter of grabbing them them using the dollar sign ($).

Step 6: Statistical Analysis

Perform comparative analysis with appropriate statistical tests and multiple testing correction. This step will automatically detect how many groups you have. For 2 groups, it will perform t-tests and/or Wilcoxon Sum Rank test; for 3 or more groups, it will perform Analysis of Variance (ANOVA) or Kruskal-Wallis test. The choice depends on the assumption tests.

Comprehensive Statistical Testing

# Perform statistical analysis with Benjamini-Hochberg correction
stats_results <- perform_ComparativeAnalysis(
  data = processed_data,
  adjust_p_method = "BH"  # False Discovery Rate control
)

# View statistical summary
print("Statistical Analysis Summary:")
print(stats_results$summary_statistics)

Step 7: AUROC Analysis

Assess the discriminatory power of individual metabolites using Area Under ROC Curve.

Perform AUROC Analysis

# Calculate AUROC for all metabolites (control → severe arrangement)
auroc_results <- perform_AUROC(
  data_PP = processed_data,     # Preprocessed data
  data_DR = plsda_results,      # OPLS-DA results
  data_FC = fc_results,         # Fold change results
  data_CA = stats_results,      # Statistical results
  arrangeLevels = c("Control", "Less Severe Case", "Severe Case")
)

This function also outputs the top 5 (or as set by you) features/metabolites with the highest AUC values.

Step 8: Regression Analysis

Build predictive models using regularized regression methods. This package primarily provides 2 regularized regression methods, elastic net regression and Least Absolute Shrinkage and Selection Operator (LASSO).

Elastic Net Regression

# Perform Elastic Net regression
regression_results <- perform_Regression(
  data = processed_data,
  method = "enet"  # Elastic Net (combines Ridge and LASSO)
)

LASSO Regression

# Perform LASSO regression for feature selection
lasso_results <- perform_Regression(
  data = processed_data,
  method = "lasso"
)

Step 9: Comprehensive Visualization

Create publication-ready plots to visualize your results.

Volcano Plot

# Create volcano plot combining fold changes and statistical significance
plot_Volcano(
  PPData = processed_data,
  FCData = fc_results,
  CAData = stats_results,
  arrangeLevels = c("Severe Case", "Less Severe Case", "Control"),
  fcUP = 2,           # Upper fold change threshold
  fcDown = 0.5,       # Lower fold change threshold  
  adjpvalue = 0.05    # Adjusted p-value threshold
)

Step 10: Export Results

Export all data frames and visualizations for further analysis or publication.

Export Data Tables to Excel

# Export all analysis results to Excel workbooks
perform_Export2Excel(
  results = list(
    processed_data,
    pca_results,
    plsda_results,
    fc_results,
    stats_results,
    auroc_results,
    regression_results
  ),
  folder_name = "MetaboStatR_Results",
  file_name = c(
    "01_ProcessedData",
    "02_PCA_Analysis", 
    "03_OPLSDA_Analysis",
    "04_FoldChange_Analysis",
    "05_Statistical_Analysis",
    "06_AUROC_Analysis",
    "07_Regression_Analysis"
  )
)

Export Plots to High-Resolution Images

# Export all plots to PNG files
perform_ExportPlots2Image(
  results = list(
    processed_data,
    pca_results,
    plsda_results,
    fc_results,
    stats_results,
    auroc_results,
    regression_results
  ),
  folder_name = "MetaboStatR_Plots",
  file_prefix = c(
    "ProcessedData",
    "PCA",
    "OPLSDA", 
    "FoldChange",
    "Statistics",
    "AUROC",
    "Regression"
  ),
  image_format = "png",
  width = 15,
  height = 12,
  dpi = 300
)

Results Interpretation Guide

Key Findings to Report

  1. Data Quality:
    • Original vs. final dataset dimensions
    • Preprocessing steps applied and their impact
    • Batch effect correction effectiveness
  2. Exploratory Analysis:
    • PCA variance explained by each component
    • Group separation in PCA space
    • Identification of outlier samples
  3. Discriminant Analysis:
    • OPLS-DA model performance (R2X, R2Y, Q2)
    • Cross-validation results
    • Top discriminating metabolites (VIP > 1.0)
  4. Biomarker Discovery:
    • Number of significantly different metabolites
    • Effect sizes (fold changes)
    • Biomarker performance (AUROC values)
  5. Predictive Modeling:
    • Regression model performance
    • Selected metabolite panels
    • Model generalizability

Troubleshooting Common Issues

Low Model Performance

If OPLS-DA Q2 < 0.4 or regression accuracy < 0.7:

  • Check for batch effects in PCA plots
  • Consider additional preprocessing steps
  • Verify group assignments are correct - Check for outlier samples

Few Significant Features

If < 10% of features are significant:

  • Review statistical power and sample sizes
  • Consider less stringent significance thresholds
  • Check data transformation effectiveness
  • Validate with orthogonal analytical methods

Poor Group Separation

If groups don’t separate well in PCA:

  • Ensure biological differences exist between groups
  • Check for confounding variables (age, sex, BMI)
  • Consider stratified analysis
  • Validate sample collection and storage protocols

Advanced Analysis Options

Pathway Analysis

This function is not yet available. To be added.

# Perform pathway enrichment analysis
pathway_results <- perform_PathwayAnalysis(
  significant_features = significant_features,
  fold_changes = fc_results$log2_fold_changes,
  database = "KEGG"  # or "BioCyc", "Reactome"
)

Time-Series Analysis

This function is not yet available. To be added.

# For longitudinal studies with multiple time points
timeseries_results <- perform_TimeSeriesAnalysis(
  data = processed_data,
  time_column = "TimePoint",
  subject_column = "SubjectID"
)

Network Analysis

This function is not yet available. To be added.

# Build metabolite correlation networks
network_results <- perform_NetworkAnalysis(
  data = processed_data,
  significant_features = significant_features,
  correlation_method = "spearman"
)

Session Information

sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_Philippines.utf8  LC_CTYPE=English_Philippines.utf8   
#> [3] LC_MONETARY=English_Philippines.utf8 LC_NUMERIC=C                        
#> [5] LC_TIME=English_Philippines.utf8    
#> 
#> time zone: Asia/Manila
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] knitr_1.50             ggplot2_3.5.2          dplyr_1.1.4           
#> [4] MetaboStatR_0.0.0.9000
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          rstudioapi_0.17.1          
#>   [3] jsonlite_2.0.0              shape_1.4.6.1              
#>   [5] MultiAssayExperiment_1.32.0 magrittr_2.0.3             
#>   [7] farver_2.1.2                rmarkdown_2.29             
#>   [9] fs_1.6.6                    zlibbioc_1.52.0            
#>  [11] ragg_1.4.0                  vctrs_0.6.5                
#>  [13] MultiDataSet_1.34.0         base64enc_0.1-3            
#>  [15] htmltools_0.5.8.1           S4Arrays_1.6.0             
#>  [17] cellranger_1.1.0            SparseArray_1.6.0          
#>  [19] Formula_1.2-5               pROC_1.18.5                
#>  [21] caret_7.0-1                 sass_0.4.10                
#>  [23] parallelly_1.45.0           bslib_0.9.0                
#>  [25] htmlwidgets_1.6.4           desc_1.4.3                 
#>  [27] plyr_1.8.9                  plotly_4.11.0              
#>  [29] lubridate_1.9.4             cachem_1.1.0               
#>  [31] qqman_0.1.9                 igraph_2.1.4               
#>  [33] lifecycle_1.0.4             iterators_1.0.14           
#>  [35] ropls_1.40.0                pkgconfig_2.0.3            
#>  [37] webshot2_0.1.2              Matrix_1.7-0               
#>  [39] R6_2.6.1                    fastmap_1.2.0              
#>  [41] GenomeInfoDbData_1.2.13     MatrixGenerics_1.18.1      
#>  [43] future_1.67.0               digest_0.6.37              
#>  [45] colorspace_2.1-1            rARPACK_0.11-0             
#>  [47] ps_1.9.1                    patchwork_1.3.1            
#>  [49] S4Vectors_0.44.0            RSpectra_0.16-2            
#>  [51] textshaping_1.0.1           Hmisc_5.2-3                
#>  [53] ellipse_0.5.0               GenomicRanges_1.58.0       
#>  [55] timechange_0.3.0            httr_1.4.7                 
#>  [57] abind_1.4-8                 compiler_4.4.1             
#>  [59] withr_3.0.2                 htmlTable_2.4.3            
#>  [61] backports_1.5.0             BiocParallel_1.40.2        
#>  [63] MASS_7.3-60.2               lava_1.8.1                 
#>  [65] DelayedArray_0.32.0         corpcor_1.6.10             
#>  [67] chromote_0.5.1              ModelMetrics_1.2.2.2       
#>  [69] tools_4.4.1                 foreign_0.8-86             
#>  [71] future.apply_1.20.0         nnet_7.3-19                
#>  [73] glue_1.8.0                  promises_1.3.3             
#>  [75] nlme_3.1-164                grid_4.4.1                 
#>  [77] checkmate_2.3.2             cluster_2.1.6              
#>  [79] reshape2_1.4.4              generics_0.1.4             
#>  [81] recipes_1.3.1               gtable_0.3.6               
#>  [83] class_7.3-22                websocket_1.4.4            
#>  [85] tidyr_1.3.1                 data.table_1.17.8          
#>  [87] XVector_0.46.0              BiocGenerics_0.52.0        
#>  [89] ggrepel_0.9.6               foreach_1.5.2              
#>  [91] pillar_1.11.0               stringr_1.5.1              
#>  [93] limma_3.62.2                later_1.4.2                
#>  [95] splines_4.4.1               lattice_0.22-6             
#>  [97] survival_3.6-4              tidyselect_1.2.1           
#>  [99] mixOmics_6.30.0             gridExtra_2.3              
#> [101] IRanges_2.40.0              SummarizedExperiment_1.36.0
#> [103] stats4_4.4.1                xfun_0.52                  
#> [105] Biobase_2.66.0              statmod_1.5.0              
#> [107] hardhat_1.4.1               timeDate_4041.110          
#> [109] matrixStats_1.5.0           pheatmap_1.0.13            
#> [111] stringi_1.8.7               UCSC.utils_1.2.0           
#> [113] lazyeval_0.2.2              yaml_2.3.10                
#> [115] evaluate_1.0.4              codetools_0.2-20           
#> [117] tibble_3.3.0                cli_3.6.5                  
#> [119] rpart_4.1.23                systemfonts_1.2.3          
#> [121] processx_3.8.6              jquerylib_0.1.4            
#> [123] Rcpp_1.1.0                  GenomeInfoDb_1.42.3        
#> [125] readxl_1.4.5                globals_0.18.0             
#> [127] parallel_4.4.1              pkgdown_2.1.3              
#> [129] gower_1.0.2                 calibrate_1.7.7            
#> [131] listenv_0.9.1               glmnet_4.1-10              
#> [133] viridisLite_0.4.2           ipred_0.9-15               
#> [135] scales_1.4.0                prodlim_2025.04.28         
#> [137] purrr_1.1.0                 crayon_1.5.3               
#> [139] rlang_1.1.6

This completes the comprehensive MetaboStatR workflow. The analysis pipeline provides a robust foundation for metabolomics biomarker discovery and should be adapted based on your specific research questions and study design.

For additional support or advanced applications, please refer to the other vignettes or contact the package maintainer.