This function imports and validates metabolomics data from Excel, CSV, TSV, or text files, performing comprehensive quality controls to ensure data integrity and compatibility with downstream analysis functions. The function validates data structure, checks for required metadata rows, ensures uniqueness constraints, cleans special characters from identifiers, and prepares data for preprocessing pipelines.

Key validation checks include:

  • Presence of required metadata rows (Sample, SubjectID, Replicate, Group, Batch, Injection, Normalization, Response)

  • Uniqueness of sample names, injection sequences, and feature/metabolite identifiers

  • Proper QC sample annotation (empty values in SubjectID, Replicate, Normalization, Response for QC samples)

  • Numeric validation of feature/metabolite data

  • Character cleaning and standardization of identifiers

perform_DataQualityCheck(
  file_location = NULL,
  sheet_name = NULL,
  skip_rows = 0,
  separator = ",",
  validate_qc = TRUE,
  allow_missing_optional = TRUE,
  clean_names = TRUE,
  verbose = TRUE
)

Arguments

file_location

Character string specifying the file path. If NULL, an interactive file selection dialog will open. Supported formats: .xlsx, .csv, .tsv, .txt

sheet_name

Character string specifying the Excel worksheet name. Ignored for non-Excel files. If NULL for Excel files, the first sheet is used.

skip_rows

Integer specifying the number of rows to skip when reading the file. Default is 0.

separator

Character string specifying the field separator for delimited files. Common values: "," (comma), "\t" (tab). Default is ",". Ignored for Excel files.

validate_qc

Logical indicating whether to enforce QC validation rules. Default is TRUE.

allow_missing_optional

Logical indicating whether to allow missing values in optional metadata rows (SubjectID, Replicate, Normalization, Response). Default is TRUE.

clean_names

Logical indicating whether to clean special characters from names. Default is TRUE.

verbose

Logical indicating whether to display progress messages. Default is TRUE.

Value

A list containing:

raw_data

Data frame with the original data as loaded from the file

quality_checked_data

Data frame with validated and cleaned data, sorted by injection sequence

metadata_summary

Summary statistics of the metadata

validation_report

Detailed validation results

file_info

Information about the source file

processing_log

Log of all processing steps performed

Details

The input data must follow a specific structure:

  • Row 1: "Sample" - Unique sample identifiers (no spaces recommended)

  • Row 2: "SubjectID" - Numeric subject identifiers (can be non-unique)

  • Row 3: "Replicate" - Replicate identifiers (can be non-unique)

  • Row 4: "Group" - Group assignments including QC samples

  • Row 5: "Batch" - Batch numbers

  • Row 6: "Injection" - Unique injection sequence numbers

  • Row 7: "Normalization" - Concentration markers (e.g., osmolality)

  • Row 8: "Response" - Response variable values

  • Rows 9+: Feature/metabolite data (e.g., m/z@retention_time format)

Missing values should be left blank or encoded as 0. QC samples in the Group row must have empty values in SubjectID, Replicate, Normalization, and Response rows.

See also

perform_PreprocessingPeakData for the next step in the analysis pipeline

Author

John Lennon L. Calorio

Examples

if (FALSE) { # \dontrun{
# Basic usage with file selection dialog
result <- perform_DataQualityCheck()

# Specify file location directly
result <- perform_DataQualityCheck(
  file_location = "path/to/metabolomics_data.xlsx",
  sheet_name = "Sheet1"
)

# CSV file with custom separator
result <- perform_DataQualityCheck(
  file_location = "path/to/data.csv",
  separator = ";",
  skip_rows = 1
)

# Access results
clean_data <- result$raw_data
validation_summary <- result$validation_report
} # }