Perform Comprehensive Data Quality Check for Metabolomics Data

This function imports and validates metabolomics data from Excel, CSV, TSV, or text files, performing comprehensive quality controls to ensure data integrity and compatibility with downstream analysis functions. The function validates data structure, checks for required metadata rows, ensures uniqueness constraints, cleans special characters from identifiers, and prepares data for preprocessing pipelines.

Key validation checks include:

Presence of required metadata rows (Sample, SubjectID, Replicate, Group, Batch, Injection, Normalization, Response)
Uniqueness of sample names, injection sequences, and feature/metabolite identifiers
Proper QC sample annotation (empty values in SubjectID, Replicate, Normalization, Response for QC samples)
Numeric validation of feature/metabolite data
Character cleaning and standardization of identifiers

perform_DataQualityCheck(
  file_location = NULL,
  sheet_name = NULL,
  skip_rows = 0,
  separator = ",",
  validate_qc = TRUE,
  allow_missing_optional = TRUE,
  clean_names = TRUE,
  verbose = TRUE
)

Arguments

file_location: Character string specifying the file path. If NULL, an interactive file selection dialog will open. Supported formats: .xlsx, .csv, .tsv, .txt
sheet_name: Character string specifying the Excel worksheet name. Ignored for non-Excel files. If NULL for Excel files, the first sheet is used.
skip_rows: Integer specifying the number of rows to skip when reading the file. Default is 0.
separator: Character string specifying the field separator for delimited files. Common values: "," (comma), "\t" (tab). Default is ",". Ignored for Excel files.
validate_qc: Logical indicating whether to enforce QC validation rules. Default is TRUE.
allow_missing_optional: Logical indicating whether to allow missing values in optional metadata rows (SubjectID, Replicate, Normalization, Response). Default is TRUE.
clean_names: Logical indicating whether to clean special characters from names. Default is TRUE.
verbose: Logical indicating whether to display progress messages. Default is TRUE.

Value

A list containing:

raw_data: Data frame with the original data as loaded from the file
quality_checked_data: Data frame with validated and cleaned data, sorted by injection sequence
metadata_summary: Summary statistics of the metadata
validation_report: Detailed validation results
file_info: Information about the source file
processing_log: Log of all processing steps performed

Details

The input data must follow a specific structure:

Row 1: "Sample" - Unique sample identifiers (no spaces recommended)
Row 2: "SubjectID" - Numeric subject identifiers (can be non-unique)
Row 3: "Replicate" - Replicate identifiers (can be non-unique)
Row 4: "Group" - Group assignments including QC samples
Row 5: "Batch" - Batch numbers
Row 6: "Injection" - Unique injection sequence numbers
Row 7: "Normalization" - Concentration markers (e.g., osmolality)
Row 8: "Response" - Response variable values
Rows 9+: Feature/metabolite data (e.g., m/z@retention_time format)

Missing values should be left blank or encoded as 0. QC samples in the Group row must have empty values in SubjectID, Replicate, Normalization, and Response rows.

Author

John Lennon L. Calorio

Examples

if (FALSE) { # \dontrun{
# Basic usage with file selection dialog
result <- perform_DataQualityCheck()

# Specify file location directly
result <- perform_DataQualityCheck(
  file_location = "path/to/metabolomics_data.xlsx",
  sheet_name = "Sheet1"
)

# CSV file with custom separator
result <- perform_DataQualityCheck(
  file_location = "path/to/data.csv",
  separator = ";",
  skip_rows = 1
)

# Access results
clean_data <- result$raw_data
validation_summary <- result$validation_report
} # }

Arguments

Value

Details

See also

Author

Examples