Quality control and filtering of NGS data

Objective Understand the importance of quality control (QC) in NGS data analysis....

Objective

  • Understand the importance of quality control (QC) in NGS data analysis.
  • Learn about common QC metrics and techniques for NGS data.
  • Explore filtering strategies to remove low-quality or irrelevant data.

Importance of Quality Control (QC)

  • QC is a critical step in NGS data analysis to ensure reliable and accurate results.
  • Proper QC helps identify and mitigate potential issues such as sequencing errors, adapter contamination, and low-quality reads.

Common QC Metrics

  1. Read Quality Scores: Assess the confidence of base calls using Phred scores.
  2. Read Length Distribution: Examine the distribution of read lengths.
  3. GC Content: Evaluate the distribution of GC content in reads.
  4. Adapter Contamination: Detect the presence of adapter sequences in reads.
  5. Duplicated Reads: Identify and remove duplicated reads.
  6. Sequence Complexity: Measure the complexity or repetitiveness of sequences.

QC Techniques and Tools

  1. FastQC: A popular tool for assessing various QC metrics and generating quality reports.
  2. Trimmomatic: Used for read trimming, adapter removal, and quality filtering.
  3. Cutadapt: Specifically designed for adapter trimming in NGS data.
  4. Biopython: Provides modules like SeqIO and SeqUtils for NGS data manipulation and QC analysis.

Filtering Strategies

  1. Quality Filtering: Remove reads with low-quality scores below a threshold.
  2. Adapter Trimming: Remove adapter sequences that can affect downstream analysis.
  3. Length Filtering: Discard reads that are too short or too long based on specific requirements.
  4. Duplicated Read Removal: Remove duplicate reads to reduce redundancy.
  5. Ambiguous Base Filtering: Eliminate reads containing too many ambiguous bases (N’s).

Example: Quality Filtering using Biopython

from Bio import SeqIO

def quality_filter(input_file, output_file, min_quality):
    with open(output_file, 'w') as out_handle:
        for record in SeqIO.parse(input_file, 'fastq'):
            if min(record.letter_annotations['phred_quality']) >= min_quality:
                SeqIO.write(record, out_handle, 'fastq')

# Usage
input_file = 'raw_reads.fastq'
output_file = 'filtered_reads.fastq'
min_quality = 20

quality_filter(input_file, output_file, min_quality)
  • The code snippet demonstrates a simple quality filtering function using Biopython.
  • The function reads a FASTQ file and writes only the reads with minimum quality scores above the specified threshold to a new file.
  • The SeqIO module from Biopython is used for parsing and writing FASTQ records.

Summary

  • Quality control (QC) is essential for ensuring reliable and accurate NGS data analysis.
  • Common QC metrics include read quality scores, read length distribution, GC content, adapter contamination, duplicated reads, and sequence complexity.
  • QC techniques and tools like FastQC, Trimmomatic, Cutadapt, and Biopython facilitate QC analysis.
  • Filtering strategies involve quality filtering, adapter trimming, length filtering, duplicated read removal, and ambiguous base filtering.
Join the conversation