Posts Tagged ‘perl’

Side-by-side comparison of PHP, Perl, Python and Ruby

May 18, 2012 3 comments
Categories: python Tags: , ,

Download genomes from Genbank

April 12, 2011 1 comment


For a project, I had to download a bunch of records from the NCBI (National Center for Biotechnology Information) website. A record looks like this: CP002059.1 (almost 5 MB):

LOCUS       CP002059             5354700 bp    DNA ...
DEFINITION  'Nostoc azollae' 0708, complete genome.
ACCESSION   CP002059 ACIR01000000 ACIR01000001-ACIR01000216
VERSION     CP002059.1  GI:298231532
DBLINK      Project: 30807

I needed this data in text format.

Solution #1
My first idea was to download the page with wget. However, I was surprised to see that the downloaded file was less than 100 KB instead of 5 MB! When I looked at the source, it turned out that it’s full of AJAX calls. That is, the browser downloads this short HTML and then it is expanded. If you save the page with File -> Save as…, you have the complete HTML but how to automate the download process? How to get the post-AJAX version of a web page?

I will write about this problem and its general solution in another post.

Solution #2
Fortunately, there is a CGI program at NCBI that can return us the required data. For instance, the data of CP002059.1 can be retrieved via the following URL:

A (very) short overview of the EFetch CGI is here.

If you use Biopython, you can download this record like this:

from Bio import Entrez

# ref.:

# replace with your real email (optional): = ''
# accession id works, returns genbank format, looks in the 'nucleotide' database:
# store locally:
local_file=open('', 'w')

Solution #3 (in Perl)
Let’s see the same thing in Perl too, using the BioPerl package. Thanks Alix for the Perl code.


use Bio::Perl;
#use Bio::Seq;
#use Bio::Tools::Run::RemoteBlast;
use Bio::DB::GenBank;
#use Data::Dumper;

use strict;

my $gb = new Bio::DB::GenBank;

my $id = 'CP002059.1';

my $seq = $gb->get_Stream_by_acc($id);
while( my $seq_elt =  $seq->next_seq ) {
    write_sequence(">$", 'genbank', $seq_elt);

Update (20110706)
I forgot to mention how to install Biopython:

sudo pip install biopython

chomp() functionality in Python

October 11, 2010 1 comment

In Perl there is a function called chomp() which is very useful when reading a text file line by line. It removes the newline character ('\n') at the end of lines. How to do the same thing with Python?

Solution #1

For having the same effect, remove the '\n' from each line:

#!/usr/bin/env python

f = open('test.txt', 'r')
for line in f:
    line = line.replace('\n', '')    # remove '\n' only
    # do something with line

This will replace the '\n' with an empty string.

Solution #2

There is a function called rstrip() which removes ALL whitespace characters on the right side of a string. This is not entirely the same as the previous because it will remove all whitespace characters on the right side, not only the '\n'. However, if you don’t need those whitespace characters, you can use this solution too.

#!/usr/bin/env python

f = open('test.txt', 'r')
for line in f:
    line = line.rstrip()    # remove ALL whitespaces on the right side, including '\n'
    # do something with line

Update (20111011): As it was pointed out by John C in the comments, “rstrip() also accepts a string of characters…, so line.rstrip('\n') will remove just trailing newline characters.” More info on rstrip here.

Categories: python Tags: , , ,