Home > bioinformatics, python > Download genomes from Genbank

Download genomes from Genbank

Problem

For a project, I had to download a bunch of records from the NCBI (National Center for Biotechnology Information) website. A record looks like this: CP002059.1 (almost 5 MB):

LOCUS       CP002059             5354700 bp    DNA ...
DEFINITION  'Nostoc azollae' 0708, complete genome.
ACCESSION   CP002059 ACIR01000000 ACIR01000001-ACIR01000216
VERSION     CP002059.1  GI:298231532
DBLINK      Project: 30807
...
ORIGIN
//

I needed this data in text format.

Solution #1
My first idea was to download the page with wget. However, I was surprised to see that the downloaded file was less than 100 KB instead of 5 MB! When I looked at the source, it turned out that it’s full of AJAX calls. That is, the browser downloads this short HTML and then it is expanded. If you save the page with File -> Save as…, you have the complete HTML but how to automate the download process? How to get the post-AJAX version of a web page?

I will write about this problem and its general solution in another post.

Solution #2
Fortunately, there is a CGI program at NCBI that can return us the required data. For instance, the data of CP002059.1 can be retrieved via the following URL:


http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=CP002059.1&rettype=gb

A (very) short overview of the EFetch CGI is here.

If you use Biopython, you can download this record like this:

from Bio import Entrez

# ref.: http://wilke.openwetware.org/Parsing_Genbank_files_with_Biopython.html

# replace with your real email (optional):
Entrez.email = 'whatever@mail.com'
# accession id works, returns genbank format, looks in the 'nucleotide' database:
handle=Entrez.efetch(db='nucleotide',id='CP002059.1',rettype='gb')
# store locally:
local_file=open('CP002059.1.gb', 'w')
local_file.write(handle.read())
handle.close()
local_file.close()

Solution #3 (in Perl)
Let’s see the same thing in Perl too, using the BioPerl package. Thanks Alix for the Perl code.

#!/usr/bin/perl

use Bio::Perl;
#use Bio::Seq;
#use Bio::Tools::Run::RemoteBlast;
use Bio::DB::GenBank;
#use Data::Dumper;

use strict;

my $gb = new Bio::DB::GenBank;

my $id = 'CP002059.1';

my $seq = $gb->get_Stream_by_acc($id);
while( my $seq_elt =  $seq->next_seq ) {
    write_sequence(">$id.gb", 'genbank', $seq_elt);
}

Update (20110706)
I forgot to mention how to install Biopython:

sudo pip install biopython
About these ads
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 72 other followers

%d bloggers like this: