About Lesson
- Introduction to common biological data formats supported by Biopython, including FASTA, GenBank, FASTQ, and PDB.
- Structure and features of each data format.
- Reading and writing sequences and other biological data using Biopython’s
SeqIO
module.
Introduction to Biological Data Formats
- Biological data formats are used to represent and store biological information.
- Various file formats are used in bioinformatics and computational biology.
- Biopython provides support for handling multiple biological data formats.
Common Biological Data Formats:
- FASTA Format:
- Simple text-based format for representing nucleotide or protein sequences.
- Consists of a header line starting with ‘>’ and the sequence data.
- GenBank Format:
- Standard format for representing DNA or RNA sequences along with annotations.
- Contains sequence data, features, and metadata in a structured manner.
- FASTQ Format:
- Used to store high-throughput sequencing data, including DNA reads and their quality scores.
- Contains sequence reads, base qualities, and additional information.
- PDB Format:
- Protein Data Bank format for representing protein structures.
- Contains atomic coordinates, atom types, and other structural information.
Reading and Writing Biological Data with SeqIO
- Biopython’s
SeqIO
module provides a convenient way to read and write biological data in various formats. SeqIO.read()
reads a single record from a file.SeqIO.parse()
reads multiple records from a file.SeqIO.write()
writes sequences to a file in a specified format.
Reading Sequences from a FASTA File
from Bio import SeqIO fasta_file = "sequences.fasta" for record in SeqIO.parse(fasta_file, "fasta"): print(f"Header: {record.id}") print(f"Sequence: {record.seq}") print()
- The
SeqIO.parse()
function reads multiple sequences from a FASTA file. - Each
record
object represents a single sequence with attributes likeid
(header) andseq
(sequence data).