Retrieving data from NCBI databases

Objectives Understand the importance of NCBI (National Center for Biotechnology Information) databases...

Objectives

  • Understand the importance of NCBI (National Center for Biotechnology Information) databases in bioinformatics.
  • Learn about the available NCBI databases and their applications.
  • Explore Biopython’s capabilities for accessing and retrieving data from NCBI databases.

Introduction to NCBI Databases

  • The NCBI provides a collection of databases containing a wealth of biological data, including sequences, genes, proteins, and literature references.
  • These databases are widely used by researchers for data retrieval and analysis.

Commonly Used NCBI Databases

  1. GenBank: A comprehensive database of DNA sequences.
  2. PubMed: A collection of biomedical literature references and abstracts.
  3. RefSeq: A curated database of reference sequences for genes, transcripts, and proteins.
  4. UniGene: A database of transcript clusters, providing a unified view of gene expression.
  5. Protein: A database of protein sequences and related information.

Accessing NCBI Databases with Biopython

  • Biopython provides modules and functions to access and retrieve data from NCBI databases.
  • The main module for NCBI database access in Biopython is Bio.Entrez.

Retrieving Data from NCBI Databases

from Bio import Entrez

# Provide your email address for Entrez
Entrez.email = "your_email@example.com"

# Search and retrieve data from PubMed
handle = Entrez.esearch(db="pubmed", term="biopython", retmax=10)
record = Entrez.read(handle)

# Print the retrieved PubMed IDs
pubmed_ids = record["IdList"]
print("PubMed IDs:")
print(pubmed_ids)
  • Set your email address for Entrez using Entrez.email.
  • Use Entrez.esearch() to search and retrieve data from a specific NCBI database (e.g., PubMed).
  • Specify the database (“pubmed”), search terms (e.g., “biopython”), and the maximum number of records to retrieve (retmax).
  • Read and process the search results using Entrez.read().

Retrieving Full Records from NCBI Databases

from Bio import Entrez

# Provide your email address for Entrez
Entrez.email = "your_email@example.com"

# Retrieve full records from GenBank
handle = Entrez.efetch(db="nucleotide", id="NC_000913", rettype="gb", retmode="text")
record = SeqIO.read(handle, "gb")

# Print the retrieved GenBank record
print("GenBank Record:")
print(record)
  • Set your email address for Entrez using Entrez.email.
  • Use Entrez.efetch() to retrieve full records from a specific NCBI database (e.g., GenBank).
  • Specify the database (“nucleotide”), unique identifiers (e.g., “NC_000913”), and the desired output format (rettype and retmode).
  • Read and process the retrieved record using SeqIO.read().

Summary

  • NCBI databases are valuable resources for biological data retrieval and analysis.
  • Biopython’s Bio.Entrez module provides functionalities for accessing and retrieving data from NCBI databases.
  • Utilize the power of Biopython to search, retrieve, and analyze data from various NCBI databases.


Join the conversation