About Lesson
Objective
- Understand the importance of quality control (QC) in NGS data analysis.
- Learn about common QC metrics and techniques for NGS data.
- Explore filtering strategies to remove low-quality or irrelevant data.
Importance of Quality Control (QC)
- QC is a critical step in NGS data analysis to ensure reliable and accurate results.
- Proper QC helps identify and mitigate potential issues such as sequencing errors, adapter contamination, and low-quality reads.
Common QC Metrics
- Read Quality Scores: Assess the confidence of base calls using Phred scores.
- Read Length Distribution: Examine the distribution of read lengths.
- GC Content: Evaluate the distribution of GC content in reads.
- Adapter Contamination: Detect the presence of adapter sequences in reads.
- Duplicated Reads: Identify and remove duplicated reads.
- Sequence Complexity: Measure the complexity or repetitiveness of sequences.
QC Techniques and Tools
- FastQC: A popular tool for assessing various QC metrics and generating quality reports.
- Trimmomatic: Used for read trimming, adapter removal, and quality filtering.
- Cutadapt: Specifically designed for adapter trimming in NGS data.
- Biopython: Provides modules like SeqIO and SeqUtils for NGS data manipulation and QC analysis.
Filtering Strategies
- Quality Filtering: Remove reads with low-quality scores below a threshold.
- Adapter Trimming: Remove adapter sequences that can affect downstream analysis.
- Length Filtering: Discard reads that are too short or too long based on specific requirements.
- Duplicated Read Removal: Remove duplicate reads to reduce redundancy.
- Ambiguous Base Filtering: Eliminate reads containing too many ambiguous bases (N’s).
Example: Quality Filtering using Biopython
from Bio import SeqIO def quality_filter(input_file, output_file, min_quality): with open(output_file, 'w') as out_handle: for record in SeqIO.parse(input_file, 'fastq'): if min(record.letter_annotations['phred_quality']) >= min_quality: SeqIO.write(record, out_handle, 'fastq') # Usage input_file = 'raw_reads.fastq' output_file = 'filtered_reads.fastq' min_quality = 20 quality_filter(input_file, output_file, min_quality)
- The code snippet demonstrates a simple quality filtering function using Biopython.
- The function reads a FASTQ file and writes only the reads with minimum quality scores above the specified threshold to a new file.
- The
SeqIO
module from Biopython is used for parsing and writing FASTQ records.
Summary
- Quality control (QC) is essential for ensuring reliable and accurate NGS data analysis.
- Common QC metrics include read quality scores, read length distribution, GC content, adapter contamination, duplicated reads, and sequence complexity.
- QC techniques and tools like FastQC, Trimmomatic, Cutadapt, and Biopython facilitate QC analysis.
- Filtering strategies involve quality filtering, adapter trimming, length filtering, duplicated read removal, and ambiguous base filtering.