===== Usage ===== The **Genome-wide Data Cleaner (GDC)** is a comprehensive pipeline designed for cleaning genotype data and performing advanced genomic analyses. It integrates standard quality control procedures with optional advanced features like genome remapping and ancestry estimation. Workflow Overview ----------------- The pipeline processes input data through a structured sequence of stages: .. figure:: images/workflow_diagram.jpg :alt: GDC Genomics QC Workflow Diagram :align: center :width: 600px Overview of the GDC Genomics QC Pipeline stages. 1. **Reference Update (Optional)**: Update genome builds from GRCh37 to GRCh38 using **CrossMap**. 2. **Alignment (Optional)**: Ensure genotypes are aligned using **Genotype Harmonizer**. 3. **Core QC Steps**: Execute standard quality control filters using **PLINK**, **Primus**, and **R**. 4. **Ancestry & PCA**: Perform ancestry prediction and PCA analysis using **Fraposa**. 5. **Per-Ancestry QC**: Refine results with specific QC filters based on predicted ancestry. 6. **Reporting (Optional)**: Generate a final PDF report of all QC steps and changes via **Quarto**. Running the Pipeline -------------------- The pipeline is managed through a configuration file and executed via the ``Run.sh`` script. Configuration File ~~~~~~~~~~~~~~~~~~ Before running the pipeline, ensure your ``config/config.yaml`` file is correctly configured with project-wide paths and tool-specific parameters. A typical configuration looks like this: .. code-block:: yaml # Project-wide paths INPUT_FILE: "/projects/standard/gdc/public/Ref/toyData/1kgSynthetic" OUT_DIR: "/scratch.global/coffm049/toyPipeline" REF: "/projects/standard/gdc/public/Ref" # Tool-specific parameters relatedness: method: "0" SEX_CHECK: false RFMIX: true thin: true You can find the template and additional parameters in the `official configuration file `_. Command-Line Flags ------------------ You can append the following flags to the run command to customize the execution: * **--set_working_directory**: Provide the path where output files should be stored. * **--input_directory**: Provide the path to the directory containing your ``.bim``, ``.bed``, and ``.fam`` data. * **--input_file_name**: Provide the common prefix shared by your genotype files. * **--path_to_github_repo**: Provide the local path to the GDCGenomicsQC pipeline repository. * **--user_x500**: Provide your x500 email (e.g., ``samp213@umn.edu``) to receive updates regarding sbatch submissions. * **--use_crossmap**: Enter '1' to update the reference genome build to GRCh38. * **--use_genome_harmonizer**: Enter '1' to update strand alignment. * **--use_king**: Enter '1' to utilize KING for relatedness estimation. * **--use_rfmix**: Enter '1' to utilize RFMix for ancestry estimation. * **--make_report**: Enter '1' to generate an automated report of the QC steps and changes. * **--custom_qc**: Enter '1' to apply custom settings for marker and sample filtering. High-Performance Computing (SLURM) ---------------------------------- For large datasets, the script can be executed via ``sbatch`` to leverage cluster resources: .. code-block:: console $ sbatch ./GDCGenomicsQC/Run.sh --set_working_directory $PWD --input_directory ${PWD}/input_data --input_file_name study_stem