Performs a complete data preprocessing workflow to prepare raw data for downstream analysis. This function applies preprocessing steps sequentially in the order specified by the parameters to ensure optimal data quality and analytical readiness.

perform_PreprocessingPeakData(
  raw_data,
  outliers = NULL,
  filterMissing = 20,
  filterMissing_by_group = TRUE,
  filterMissing_includeQC = FALSE,
  denMissing = 5,
  driftBatchCorrection = TRUE,
  spline_smooth_param = 0,
  spline_smooth_param_limit = c(-1.5, 1.5),
  log_scale = TRUE,
  min_QC = 5,
  removeUncorrectedFeatures = TRUE,
  dataNormalize = "Normalization",
  refSample = NULL,
  groupSample = NULL,
  reference_method = "mean",
  dataTransform = "vsn",
  dataScalePCA = "meanSD",
  dataScalePLS = "mean2SD",
  filterMaxRSD = 30,
  filterMaxRSD_by = "EQC",
  filterMaxVarSD = 10,
  verbose = TRUE
)

Arguments

raw_data

List. Quality-checked data from the perform_DataQualityCheck function.

outliers

Vector. Biological samples and/or QC samples considered as outliers. Example format: c('Sample1', 'Sample2', 'QC1', 'QC2', ...). Defaults to NULL.

filterMissing

Numeric. Minimum percentage of missing values across all groups required to remove a feature.

filterMissing_by_group

Boolean. Determines whether filterMissing should assess group-specific missingness before feature removal.

filterMissing_includeQC

Boolean. If FALSE (default), QC samples are excluded when implementing filterMissing.

denMissing

Numeric. Denominator value used in the fraction 1/denMissing to replace missing values.

driftBatchCorrection

Boolean. If TRUE (default), perform QC-RSC algorithm for signal drift and batch effect correction.

spline_smooth_param

Numeric. Spline smoothing parameter ranging from 0 to 1.

spline_smooth_param_limit

Vector. A vector of format c(min, max) for spline parameter limits.

log_scale

Boolean. If TRUE (default), performs signal correction fit on log-scaled data.

min_QC

Numeric. Minimum number of QC samples required for signal correction per batch.

removeUncorrectedFeatures

Boolean. If TRUE (default), removes features that were not corrected by QCRSC due to insufficient QC samples meeting the min_QC threshold.

dataNormalize

String. Data normalization method. Options:

  • "none": No normalization

  • "Normalization": Using the values from "Normalization" row

  • "sum": By sum

  • "median": By median

  • "PQN1": By median of reference spectrum

  • "PQN2": By reference sample supplied in refSample

  • "groupPQN": By group in c("SQC", "EQC", "QC"), if both (default) then all QCs are considered as QC

  • "quantile": By quantile

Default: "Normalization" (if present, otherwise, "sum")

refSample

String. Reference sample for dataNormalize = "PQN2".

groupSample

String. Used only if dataNormalize = "groupPQN".

reference_method

String. Method for computing reference from QC samples in dataNormalize = "quantile". Options:

  • "mean": Default

  • "median"

dataTransform

String. Data transformation method applied after dataNormalize. Options:

  • "none": No transformation

  • "log2": log base 2

  • "log10": log base 10

  • "sqrt": Square-root

  • "cbrt": Cube-root

  • "vsn": Variance Stabilizing Normalization

dataScalePCA

String. Data scaling for PCA analysis. Options:

  • "none": No data scaling

  • "mean": Scale by mean (average)

  • "meanSD": Scale by mean divided by standard deviation (SD)

  • "mean2SD": Pareto-scaling. Scale by mean divided by square-root of SD. Always use this for PLS-type analysis.

dataScalePLS

String. Data scaling for PLS analysis. Same options as dataScalePCA.

filterMaxRSD

Numeric. Threshold for RSD filtering.

filterMaxRSD_by

String. Which QC samples to use for RSD filtering. Options:

  • "SQC": Filter by sample QC

  • "EQC": Filter by extract QC (default)

  • "both": Filter by by both SQC and EQC (or QC altogether)

filterMaxVarSD

Numeric. Remove nth percentile of features with lowest variability.

verbose

Logical. Whether to print detailed progress messages. Default TRUE.

Value

A list containing results from all preprocessing steps.

See also

perform_DataQualityCheck for the data quality check

Author

John Lennon L. Calorio