Archive

Archive for December, 2013

2013 in review

December 31, 2013 4 comments

The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.

Here’s an excerpt:

The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 140,000 times in 2013. If it were an exhibit at the Louvre Museum, it would take about 6 days for that many people to see it.

Click here to see the complete report.

Categories: python Tags: ,

Parallelism in (almost) one line

December 30, 2013 Leave a comment

See this post: https://medium.com/p/40e9b2b36148.

TODO: write more about it

reddit discussion

Categories: python

Insert the path of the Python interpreter easily

December 15, 2013 Leave a comment

Problem
How many times have you written “#!/usr/bin/env python” in your life? A few hundred times? :) How to insert this line easily?

Solution #1
I mainly use vim. So far I have done it the following way.

In vim, type this on the first line in normal mode. This is the mode that vim will usually start in, which you can usually get back to with ESC. At the end press ENTER.

!!which env<ENTER>

!!” brings you to command mode and the current line (which was empty) is replaced with the output of “which env“. Thus the first line becomes “/usr/bin/env“. All you need to do is add “#!” and “python” manually.

Fine, but I got fed up with this. It’s still too much typing. How to do it easier?

Solution #2
Write a bash script called “py” with the following content:

echo "#!`which env` python"

Put it somewhere in your PATH and make it executable.

Then, in vim type this on the first line:

!!py<ENTER>

Phew. I should have thought of it years ago.

Categories: python Tags: , ,

XPath to CSS

December 15, 2013 Leave a comment

Problem
You have an XPath expression and you want to convert it to a CSS selector.

Solution
Try cssify. It also runs in the browser.

Command line usage:

$ ./cssify.py '//a'
a
$ ./cssify.py '//a[@id="bleh"]'
a#bleh
Categories: python Tags: , ,

web scraping: BS4 supports CSS select

December 15, 2013 Leave a comment

BeautifulSoup is an excellent tool for web scraping. The development of BeautifulSoup 3 stopped in 2012, its author concentrates on BeautifulSoup 4 since then.

In this post I want to show how to use CSS selectors. With CSS selectors you can select part of a webpage, which is what we need when we do web scraping. Another possibility is to use XPath. I find CSS selectors easier to use. You can read this post too for a comparison: Why CSS Locators are the way to go vs XPath.

Exercise
Let’s go through a concrete example, that way it will be easier to understand.

The page http://developerexcuses.com/ prints a funny line that developers can use as an excuse. Let’s extract this line.

Visit the page, start Firebug, and click on the line (steps 1 and 2 on the figure below):

cssselect

Right click on the orange line (“<a style=...“) and choose “Copy CSS Path”. Now the CSS path of the selected HTML element is on the clipboard, which is “html body div.wrapper center a” in this example.

Now let’s write a script that prints this part of the HTML source:

import requests
import bs4

def main():
    r = requests.get("http://developerexcuses.com/")
    soup = bs4.BeautifulSoup(r.text)
    print soup.select("html body div.wrapper center a")[0].text

if __name__ == "__main__":
    main()

ChDir: a context manager for switching working directories

December 15, 2013 3 comments

Problem
In your program you want to change the working directory temporarily, do some job there, then switch back to the original directory. Say you want to download some images to /tmp. When done, you want to get back to the original location correctly, even if an exception was raised at the temp. location.

Naïve way
Let’s see the following example. We have a script, say at /home/jabba/python/fetcher.py . We want to download some images to /tmp, then work with them. After the download we want to create a subfolder “process” in the same directory where the script fetcher.py is located. We want to collect some extra info about the downloaded images and we want to store these pieces of information in the “process” folder.

import os

def download(li, folder):
    try:
        backup = os.getcwd()
        os.chdir(folder)
        for img in li:
            # download img somehow
        os.chdir(backup)
    except:
        # problem with download, handle it

def main():
    # step 1: download images to /tmp
    li = ["http://...1.jpg", "http://...2.jpg", "http://...3.jpg"]
    download(li, "/tmp")
    # step 2: create a "process" dir. HERE (where the script was launched)
    os.mkdir("process")
    # ...do some extra work...

There is a problem with the download method. If an image cannot be downloaded correctly and an exception occurs, we return from the method. However, os.chdir(backup) is not executed and we remain in the /tmp folder! In main() in step 2 the process directory will be created in /tmp and not in the folder where we wanted it to be.

Well, you can always add a finally block to the exception handler and place os.chdir(backup) there, but it’s easy to forget. Is there an easier solution?

Solution
Yes, there is an easier solution. Use a context manager.

The previous example with a context manager:

import os

def download(li, folder):
    with ChDir(folder):
        for img in li:
            # download img somehow

def main():
    # step 1: download images to /tmp
    li = ["http://...1.jpg", "http://...2.jpg", "http://...3.jpg"]
    download(li, "/tmp")
    # step 2: create a "process" dir. HERE (where the script was launched)
    os.mkdir("process")
    # ...do some extra work...

And now the source code of ChDir:

import os

class ChDir(object):
    """
    Step into a directory temporarily.
    """
    def __init__(self, path):
        self.old_dir = os.getcwd()
        self.new_dir = path

    def __enter__(self):
        os.chdir(self.new_dir)

    def __exit__(self, *args):
        os.chdir(self.old_dir)

Since ChDir is a context manager, you use it in a with block. At the beginning of the block you enter the given folder. When you leave the with block (even if you leave because of an exception), you are put back to the folder where you were before entering the with block.

Update
Following this discussion thread @reddit, someone suggested using the PyFilesytem library. I think PyFilesytem is a very good solution but it may be too much for a short script. It’s like shooting a sparrow with a cannon :) For a simple script ChDir is good enough for me. For a serious application, check out PyFilesytem.

painless download with the wget module

December 15, 2013 Leave a comment

Problem
You want to download something to your local machine.

Solution
You can use the wget module for this purpose:

import wget

wget.download(url)

It cannot be any simpler. You can find wget on PyPi. Installation via pip.

Note that the module uses urllib.urlretrieve for downloading, not wget.

Categories: python Tags: ,

import this

December 9, 2013 Leave a comment

The easter egg “import this” is well-known. However, what is “this.s“?

>>> import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
>>>
>>> print this.s
Gur Mra bs Clguba, ol Gvz Crgref

Ornhgvshy vf orggre guna htyl.
Rkcyvpvg vf orggre guna vzcyvpvg.
Fvzcyr vf orggre guna pbzcyrk.
Pbzcyrk vf orggre guna pbzcyvpngrq.
Syng vf orggre guna arfgrq.
Fcnefr vf orggre guna qrafr.
Ernqnovyvgl pbhagf.
Fcrpvny pnfrf nera'g fcrpvny rabhtu gb oernx gur ehyrf.
Nygubhtu cenpgvpnyvgl orngf chevgl.
Reebef fubhyq arire cnff fvyragyl.
Hayrff rkcyvpvgyl fvyraprq.
Va gur snpr bs nzovthvgl, ershfr gur grzcgngvba gb thrff.
Gurer fubhyq or bar-- naq cersrenoyl bayl bar --boivbhf jnl gb qb vg.
Nygubhtu gung jnl znl abg or boivbhf ng svefg hayrff lbh'er Qhgpu.
Abj vf orggre guna arire.
Nygubhtu arire vf bsgra orggre guna *evtug* abj.
Vs gur vzcyrzragngvba vf uneq gb rkcynva, vg'f n onq vqrn.
Vs gur vzcyrzragngvba vf rnfl gb rkcynva, vg znl or n tbbq vqrn.
Anzrfcnprf ner bar ubaxvat terng vqrn -- yrg'f qb zber bs gubfr!
>>>

Well, this.s is the rot13 encoded version of the original text. Here is how to decode it:

# Python 2
>>> print this.s.decode("rot13")

# Python 3
>>> import codecs
>>> print(codecs.decode(this.s, 'rot-13'))

Found @reddit.

Categories: fun Tags: , , ,