Course Content
Biopython Fundamentals
About Lesson
  • Introduction to common biological data formats supported by Biopython, including FASTA, GenBank, FASTQ, and PDB.
  • Structure and features of each data format.
  • Reading and writing sequences and other biological data using Biopython’s SeqIO module.

Introduction to Biological Data Formats

  • Biological data formats are used to represent and store biological information.
  • Various file formats are used in bioinformatics and computational biology.
  • Biopython provides support for handling multiple biological data formats.

Common Biological Data Formats:

  1. FASTA Format:
    • Simple text-based format for representing nucleotide or protein sequences.
    • Consists of a header line starting with ‘>’ and the sequence data.
  2. GenBank Format:
    • Standard format for representing DNA or RNA sequences along with annotations.
    • Contains sequence data, features, and metadata in a structured manner.
  3. FASTQ Format:
    • Used to store high-throughput sequencing data, including DNA reads and their quality scores.
    • Contains sequence reads, base qualities, and additional information.
  4. PDB Format:
    • Protein Data Bank format for representing protein structures.
    • Contains atomic coordinates, atom types, and other structural information.

Reading and Writing Biological Data with SeqIO

  • Biopython’s SeqIO module provides a convenient way to read and write biological data in various formats.
  • SeqIO.read() reads a single record from a file.
  • SeqIO.parse() reads multiple records from a file.
  • SeqIO.write() writes sequences to a file in a specified format.
Reading Sequences from a FASTA File
from Bio import SeqIO
fasta_file = "sequences.fasta"
for record in SeqIO.parse(fasta_file, "fasta"):
    print(f"Header: {record.id}")
    print(f"Sequence: {record.seq}")
    print()
  • The SeqIO.parse() function reads multiple sequences from a FASTA file.
  • Each record object represents a single sequence with attributes like id (header) and seq (sequence data).