Archive

Archive for October, 2011

Send a post to reddit from Python

October 30, 2011 1 comment

Problem
How to send a post to reddit.com from a Python script? Motivation: when you send a post, you have to wait 8 minutes before you could send the next one. Imagine you have 10 posts to submit. It’d be nice to launch a script at night which would send everything till next morning.

Submit a post
Now I only show how to send one post. Batch processing is left as a future project.

The official Reddit API is here. There is a wrapper for it called reddit_api, which greatly simplifies its usage.

Install reddit_api:

sudo pip install reddit

Submit a post:

#!/usr/bin/env python

import reddit

subreddit = '...' # name of the subreddit where to send the post
url = '...'       # what you want to send
title = '...'     # title of your post

# change user_agent if you want:
r = reddit.Reddit(user_agent="my_cool_application")
# your username and password on reddit:
r.login(user="...", password="...")

# the output is a JSON text that contains the link to your post:
print r.submit(subreddit, url, title)

Submit a comment (update, 20111107)
Let’s see how to add a comment to a post. First, we need the URL of a post.

Example: http://www.reddit.com/r/thewalkingdead/comments/lkycy/that_look_on_the_kids_face/. Here, the last part of the URL is just garbage, the following URL is equivalent with it: http://www.reddit.com/r/thewalkingdead/comments/lkycy. The unique ID of the post is the last part: “lkycy”. Thus, this image can be accessed via this URL too: http://www.reddit.com/lkycy.

Now, let’s log in to reddit, fetch the post by its ID and add a comment.

def get_reddit_id(url):
    result = re.search('/comments/(.*?)/', url)
    return result.group(1)

def add_comment(r, reddit_url):
    reddit_id = get_reddit_id(reddit_url)
    post = r.get_submission_by_id(reddit_id)
    comment = "first"   # just to make reddit happy ;)
    post.add_comment(comment)
    print '# comment added:', comment

def main():
    r = reddit.Reddit(user_agent="my_cool_application")
    r.login(user="...", password="...")
    reddit_url = ...
    add_comment(r, reddit_url)

Convert a date string to a date object

October 30, 2011 Leave a comment

Problem
I have a date string (‘Sat, 29 Oct 2011 18:32:56 GMT’) that I want to convert to a timestamp (‘2011_10_29′). I want to convert ‘Oct’ to ’10’ using the standard library, I don’t want to create a string array for this with the names of the months.

Solution

>>> from datetime import datetime
>>> s = 'Sat, 29 Oct 2011 18:32:56 GMT'
>>> s.split()[1:4]
['29', 'Oct', '2011']
>>> year, month, day = s.split()[1:4][::-1]
>>> year, month, day
('2011', 'Oct', '29')
>>> if len(day) == 1:
...     day = '0' + day
...   
>>> date = datetime.strptime("{yyyy} {Mmm} {dd}".format(yyyy=year, Mmm=month, dd=day), "%Y %b %d")
>>> template = "{year}_{month:02}_{day:02}"
>>> template.format(year=date.year, month=date.month, day=date.day)
'2011_10_29'

I used this conversion in this source code.

You can find the list of datetime formatting directives (%Y, %b, etc.) at the bottom of this page.

Using __str__(), print all the attributes of an object

October 30, 2011 Leave a comment

Problem
I have a class which has some attributes. I use the objects of this class as “beans”, i.e. they simply group some data. When I print such an object, I want to see the variables as “name=value” pairs but I don’t want to code that manually. I want to iterate all the attributes and produce such a string representation of the object.

Solution
Attributes are stored in a dictionary-like __dict__ in the object. Furthermore, __dict__ contains only the user-provided attributes. Read more here. Thus, all we have to do is printing __dict__:

def __str__(self):
    sb = []
    for key in self.__dict__:
        sb.append("{key}='{value}'".format(key=key, value=self.__dict__[key]))

    return ', '.join(sb)

def __repr__(self):
    return self.__str__() 

A full example
Read an RSS feed and store item data in beans. Print the beans.

#!/usr/bin/env python

import untangle

XML = 'http://planet.python.org/rss20.xml'

from jabbapylib.text.ascii import unicode_to_ascii

class Item:
    def __init__(self, item):
        self.title = unicode_to_ascii(item.title.cdata)
        self.link = item.link.cdata

    def __str__(self):
        sb = []
        for key in self.__dict__:
            sb.append("{key}='{value}'".format(key=key, value=self.__dict__[key]))

        return ', '.join(sb)

    def __repr__(self):
        return self.__str__()

def main():
    li = []

    o = untangle.parse(XML)
    for item in o.rss.channel.item:
        if item.link.cdata:
            li.append(Item(item))

    for e in li:
        print e

if __name__ == "__main__":
    main()

Sample output:

link='http://techspot.zzzeek.org/2011/10/29/value-agnostic-types-part-ii', title='Michael Bayer: Value Agnostic Types, Part II'
...

To learn more about reading XMLs with untangle, see my previous post. The call unicode_to_ascii simply converts Unicode to ASCII characters, which is needed if you want to print the result on the terminal.

Read XML painlessly

October 30, 2011 3 comments

Problem
I had an XML file (an RSS feed) from which I wanted to extract some data. I tried some XML libraries but I didn’t like any of them. Is there a simple, brain-friendly way for this? After all, it’s Python, so everything should be simple.

Solution
Yes, there is a simple library for reading XML called “untangle“, developed by Chris Stefanescu. It’s in PyPI, so installation is very easy:

sudo pip install untangle

For some examples, visit the project page.

Use Case
Let’s see a simple, real-world example. From the RSS feed of Planet Python, let’s extract the post titles and their URLs.

#!/usr/bin/env python

import untangle

#XML = 'examples/planet_python.xml'     # can read a file too
XML = 'http://planet.python.org/rss20.xml'

o = untangle.parse(XML)
for item in o.rss.channel.item:
    title = item.title.cdata
    link = item.link.cdata
    if link:
        print title
        print '   ', link

It couldn’t be any simpler :)

Limitations
According to Chris, untangle doesn’t support documents with namespaces (yet).

Related posts

Alternatives (update 20111031)
Here are some alternatives (thanks reddit).

lxml and amara are heavyweight solutions and are built upon C libraries so you may not be able to use them everywhere. untangle is a lightweight parser that can be a perfect choice to read a small and simple XML file.

Categories: python Tags: , , , , ,

Scraping fuskator.com

October 30, 2011 Leave a comment

I made a simple scraper for the site fuskator.com (NSFW). You can find it here, in the “fuskator.com” folder. It relies on my “jabbapylib” library.

Customize “config.py“, then launch “01_download_galleries_on_first_page.py“. It will download all the galleries on the main page.

I’ve only tested it under Linux.

Categories: python Tags: , , , ,

Language detection with Google’s Compact Language Detector

October 27, 2011 Leave a comment
Categories: python Tags: , ,

Speed up Python with Cython

October 22, 2011 4 comments

This post is based on a conversation with our local Python guru, Yves :)

Problem
You have a script that you would like to speed up. For instance, there is a function that is called lots of times and you suspect it causes a bottleneck.

Solution
With Cython, it is possible to compile a module to C source that you can then compile with GCC. The resulting binary can be imported in your Python script just as if it were a normal module. Since it’s a compiled module, you can expect some speed gains.

Example #01 (pure Python)
Let’s see the following simple script. It enumerates numbers up to a given threshold and tests if the given number is prime. At the end it prints the number of primes found.

Pure Python solution:

#!/usr/bin/env python

from prime import is_prime

UPTO = 10**7 / 4

def main():
    i = 1
    cnt = 0
    while i <= UPTO:
        if is_prime(i):
            cnt += 1

        i += 1

    print cnt


if __name__=="__main__":
    main()

prime.py:

def is_prime(n):
    if n == 2:
        return True
    if n % 2 == 0:
        return False

    i = 3
    maxi = n**0.5 + 1
    while i <= maxi:
        if n % i == 0:
            return False
        i += 2

    return True

According to the Unix time command, the execution time is 29.69 sec. on my laptop.

Example #02 (with Cython, first try)
Now let’s compile python.py.

cython prime.py

This will produce prime.c. Now compile it:

gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o prime.so prime.c

The output is the binary prime.so.

There is nothing to change in the main file, “from prime import is_prime” will import prime.so first.

Execution time: 21.11 sec.

Example #03 (with Cython, second try)
This paragraph is an update (20111027), incorporating the remarks of James.

By adding static type declarations, Cython can perform much better.

Controller part:

#!/usr/bin/env python

import pyximport             # add it here, before importing cython code
pyximport.install()

from prime import is_prime   # cython code is imported here

UPTO = 10**7 / 4

def main():
    i = 1
    cnt = 0
    while i <= UPTO:
        if is_prime(i):
            cnt += 1

        i += 1

    print cnt


if __name__=="__main__":
    main()

prime.pyx (notice the .pyx extension):

def is_prime(int n):
    if n == 2:
        return True
    if n % 2 == 0:
        return False

    cdef int i = 3
    cdef double maxi = n**0.5 + 1
    while i <= maxi:
        if n % i == 0:
            return False
        i += 2

    return True

The line in the first code “import pyximport; pyximport.install()” will ensure that the cython module is automatically built when imported, thus there is no need to run cython or gcc.

Execution time: 2.15 sec. Lesson learned: use static type declarations in your Cython code whenever possible.

Example (with PyPy)
Just out of curiosity, I tried to launch the script with PyPy too. PyPy is a fast, compliant alternative implementation of the Python language, written in Python itself. Since it uses a JIT compiler, PyPy is often faster than the standard Python interpreter (see a presentation here).

Execution time (hang on!): 2.35 sec.

Well, the difference is quite spectacular in the case of this example but it doesn’t mean that PyPy is always faster. In a completely different problem setting the end result can be just the opposite. So always make some tests and then choose the solution which is best for you.

Conclusion
If your program seems to run slowly, first try to polish the code and use some better algorithms / data structures. If it’s still slow, you can try to compile some parts of it with Cython. However, bare in mind that you hurt portability. But before transforming your program to a half Python / half C monster, try PyPy too. Maybe you don’t need Cython at all.

Categories: python Tags: , , ,
Follow

Get every new post delivered to your Inbox.

Join 75 other followers