Archive

Archive for December, 2013

2013 in review

December 31, 2013 4 comments

The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.

Here’s an excerpt:

The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 140,000 times in 2013. If it were an exhibit at the Louvre Museum, it would take about 6 days for that many people to see it.

Click here to see the complete report.

Advertisements
Categories: python Tags: ,

Parallelism in (almost) one line

December 30, 2013 Leave a comment

See this post: https://medium.com/p/40e9b2b36148.

TODO: write more about it

reddit discussion

Categories: python

Insert the path of the Python interpreter easily

December 15, 2013 Leave a comment

Problem
How many times have you written “#!/usr/bin/env python” in your life? A few hundred times? :) How to insert this line easily?

Solution #1
I mainly use vim. So far I have done it the following way.

In vim, type this on the first line in normal mode. This is the mode that vim will usually start in, which you can usually get back to with ESC. At the end press ENTER.

!!which env<ENTER>

!!” brings you to command mode and the current line (which was empty) is replaced with the output of “which env“. Thus the first line becomes “/usr/bin/env“. All you need to do is add “#!” and “python” manually.

Fine, but I got fed up with this. It’s still too much typing. How to do it easier?

Solution #2
Write a bash script called “py” with the following content:

echo "#!`which env` python"

Put it somewhere in your PATH and make it executable.

Then, in vim type this on the first line:

!!py<ENTER>

Phew. I should have thought of it years ago.

Categories: python Tags: , ,

XPath to CSS

December 15, 2013 Leave a comment

Problem
You have an XPath expression and you want to convert it to a CSS selector.

Solution
Try cssify. It also runs in the browser.

Command line usage:

$ ./cssify.py '//a'
a
$ ./cssify.py '//a[@id="bleh"]'
a#bleh
Categories: python Tags: , ,

web scraping: BS4 supports CSS select

December 15, 2013 Leave a comment

BeautifulSoup is an excellent tool for web scraping. The development of BeautifulSoup 3 stopped in 2012, its author concentrates on BeautifulSoup 4 since then.

In this post I want to show how to use CSS selectors. With CSS selectors you can select part of a webpage, which is what we need when we do web scraping. Another possibility is to use XPath. I find CSS selectors easier to use. You can read this post too for a comparison: Why CSS Locators are the way to go vs XPath.

Exercise
Let’s go through a concrete example, that way it will be easier to understand.

The page http://developerexcuses.com/ prints a funny line that developers can use as an excuse. Let’s extract this line.

Visit the page, start Firebug, and click on the line (steps 1 and 2 on the figure below):

cssselect

Right click on the orange line (“<a style=...“) and choose “Copy CSS Path”. Now the CSS path of the selected HTML element is on the clipboard, which is “html body div.wrapper center a” in this example.

Now let’s write a script that prints this part of the HTML source:

import requests
import bs4

def main():
    r = requests.get("http://developerexcuses.com/")
    soup = bs4.BeautifulSoup(r.text)
    print soup.select("html body div.wrapper center a")[0].text

if __name__ == "__main__":
    main()

ChDir: a context manager for switching working directories

December 15, 2013 1 comment

Problem
In your program you want to change the working directory temporarily, do some job there, then switch back to the original directory. Say you want to download some images to /tmp. When done, you want to get back to the original location correctly, even if an exception was raised at the temp. location.

Naïve way
Let’s see the following example. We have a script, say at /home/jabba/python/fetcher.py . We want to download some images to /tmp, then work with them. After the download we want to create a subfolder “process” in the same directory where the script fetcher.py is located. We want to collect some extra info about the downloaded images and we want to store these pieces of information in the “process” folder.

import os

def download(li, folder):
    try:
        backup = os.getcwd()
        os.chdir(folder)
        for img in li:
            # download img somehow
        os.chdir(backup)
    except:
        # problem with download, handle it

def main():
    # step 1: download images to /tmp
    li = ["http://...1.jpg", "http://...2.jpg", "http://...3.jpg"]
    download(li, "/tmp")
    # step 2: create a "process" dir. HERE (where the script was launched)
    os.mkdir("process")
    # ...do some extra work...

There is a problem with the download method. If an image cannot be downloaded correctly and an exception occurs, we return from the method. However, os.chdir(backup) is not executed and we remain in the /tmp folder! In main() in step 2 the process directory will be created in /tmp and not in the folder where we wanted it to be.

Well, you can always add a finally block to the exception handler and place os.chdir(backup) there, but it’s easy to forget. Is there an easier solution?

Solution
Yes, there is an easier solution. Use a context manager.

The previous example with a context manager:

import os

def download(li, folder):
    with ChDir(folder):
        for img in li:
            # download img somehow

def main():
    # step 1: download images to /tmp
    li = ["http://...1.jpg", "http://...2.jpg", "http://...3.jpg"]
    download(li, "/tmp")
    # step 2: create a "process" dir. HERE (where the script was launched)
    os.mkdir("process")
    # ...do some extra work...

And now the source code of ChDir:

import os

class ChDir(object):
    """
    Step into a directory temporarily.
    """
    def __init__(self, path):
        self.old_dir = os.getcwd()
        self.new_dir = path

    def __enter__(self):
        os.chdir(self.new_dir)

    def __exit__(self, *args):
        os.chdir(self.old_dir)

Since ChDir is a context manager, you use it in a with block. At the beginning of the block you enter the given folder. When you leave the with block (even if you leave because of an exception), you are put back to the folder where you were before entering the with block.

Update
Following this discussion thread @reddit, someone suggested using the PyFilesytem library. I think PyFilesytem is a very good solution but it may be too much for a short script. It’s like shooting a sparrow with a cannon :) For a simple script ChDir is good enough for me. For a serious application, check out PyFilesytem.

painless download with the wget module

December 15, 2013 Leave a comment

Problem
You want to download something to your local machine.

Solution
You can use the wget module for this purpose:

import wget

wget.download(url)

It cannot be any simpler. You can find wget on PyPi. Installation via pip.

Note that the module uses urllib.urlretrieve for downloading, not wget.

Categories: python Tags: ,