Archive

Archive for March, 2011

Python cheat sheets

March 31, 2011 Leave a comment
Categories: python Tags: ,

Princess Python

March 29, 2011 Leave a comment

Princess Python (Zelda DuBois) is a fictional character, a supervillain in the Marvel Comics Universe, most notably as a member of the Circus of Crime. She has no superhuman abilities, but rather relies on her snake charming skills and her 25-foot (7.6 m) pet rock python snake. She has fought several superheroes, ranging from the Avengers to Iron Man. She is also notable as she was the first villainess that Spider-Man has faced. She first appeared in The Amazing Spider-Man #22 (Mar 1965).” (source)

More images here.

Useful Python modules

March 27, 2011 Leave a comment
Categories: python Tags:

Static HTML filelist generator

March 26, 2011 1 comment

Problem

On our webserver I had some files in a directory that I wanted to browse online. However, the webserver didn’t generate a list of links on these files when I pointed the browser to this directory.

Solution

Idea: write a script that traverses the current directory recursively and prints all files with a link to them. I didn’t need any fancy features so the script can produce a very simple output.

Download link: here. Source code:

#!/usr/bin/env python

# index_gen.py

import os
import os.path
import sys

class SimpleHtmlFilelistGenerator:
    # start from this directory
    base_dir = None

    def __init__(self, dir):
        self.base_dir = dir

    def print_html_header(self):
        print """<html>
<body>
<code>
""",

    def print_html_footer(self):
        print """</code>
</body>
</html>
""",

    def processDirectory ( self, args, dirname, filenames ):
        print '<strong>', dirname + '/', '</strong>', '<br>'
        for filename in sorted(filenames):
            rel_path = os.path.join(dirname, filename)
            if rel_path in [sys.argv[0], './index.html']:
                continue   # exclude this generator script and the generated index.html
            if os.path.isfile(rel_path):
                href = "<a href=\"%s\">%s</a>" % (rel_path, filename)
                print '&nbsp;' * 4, href, '<br>'

    def start(self):
        self.print_html_header()
        os.path.walk( self.base_dir, self.processDirectory, None )
        self.print_html_footer()

# class SimpleHtmlFilelistGenerator

if __name__ == "__main__":
    base_dir = '.'
    if len(sys.argv) > 1:
        base_dir = sys.argv[1]
    gen = SimpleHtmlFilelistGenerator(base_dir)
    gen.start()

Usage:

Simply launch it in the directory where you need the filelist. Redirect the output to index.html:

./index_gen.py >index.html

Don’t forget to set the rights of index.html (chmod 644 index.html).

Demo:

Update (20141202)
This version here works but it’s quite primitive. We made a much better version; check it out here: https://pythonadventures.wordpress.com/2014/12/02/static-html-file-browser-for-dropbox/.

Traversing a directory recursively

March 26, 2011 1 comment

Problem

You want to traverse a directory recursively.

Solution #1

#!/usr/bin/env python

import os

def processDirectory ( args, dirname, filenames ):
    print dirname
    for filename in filenames:
        print " " * 4 + filename

base_dir = "."
os.path.walk( base_dir, processDirectory, None )

os.path.walk() works with a callback: processDirectory() will be called for each directory encountered.

Sample output with base_dir = '/etc':

/etc/gimp
    2.0
/etc/gimp/2.0
    ps-menurc
    sessionrc
    unitrc

Solution #2, manual method (update at 20110509)

#!/usr/bin/env python

import os
import sys

symlinks = 0

def skip_symlink(entry):
    """Symlinks are skipped."""
    global symlinks
    symlinks += 1
    print "# skip symlink {0}".format(entry)


def process_dir(d, depth):
    print d, "[DIR]"


def process_file(f, depth):
    if depth > 0:
        print ' ' * 4, 
    print f


def traverse(directory, depth=0):
    """Traverse directory recursively. Symlinks are skipped."""
    #content = [os.path.abspath(os.path.join(directory, x)) for x in os.listdir(directory)]
    try:
        content = [os.path.join(directory, x) for x in os.listdir(directory)]
    except OSError:
        print >>sys.stderr, "# problem with {0}".format(directory)
        return

    dirs = sorted([x for x in content if os.path.isdir(x)])
    files = sorted([x for x in content if os.path.isfile(x)])

    for d in dirs:
        if os.path.islink(d):
            skip_symlink(d)
            continue
        # else
        dir_name = os.path.split(d)[1]
        process_dir(d, depth)
        traverse(d, depth + 1)
    
    for f in files:
        if os.path.islink(f):
            skip_symlink(f)
            continue
        # else
        process_file(f, depth)


def main():
    """Controller."""
    start_dir = '.'
    traverse(start_dir)
    print "# skipped symlinks: {0}".format(symlinks)

####################

if __name__ == "__main__":
    main()

Solution #3 (update at 20130705)

import os
import sys

for root, _, files in os.walk(sys.argv[1]):
    for f in files:
        fname = os.path.join(root, f)
        print fname
        # Remove *.pyc files, compress images, count lines of code
        # calculate folder size, check for repeated files, etc.
        # A lot of nice things can be done here
        # credits: m_tayseer @reddit

Get the RottenTomatoes rating of a movie

March 26, 2011 1 comment

Problem

In the previous post we saw how to extract the IMDB rating of a movie. Now let’s see the same thing with the RottenTomatoes website. Their rating looks like this:

Solution

Download link: https://github.com/jabbalaci/Movie-Ratings. Source code:

#!/usr/bin/env python

# RottenTomatoesRating
# Laszlo Szathmary, 2011 (jabba.laci@gmail.com)

from BeautifulSoup import BeautifulSoup
import sys
import re
import urllib
import urlparse

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

class RottenTomatoesRating:
    # title of the movie
    title = None
    # RT URL of the movie
    url = None
    # RT tomatometer rating of the movie
    tomatometer = None
    # RT audience rating of the movie
    audience = None
    # Did we find a result?
    found = False

    # for fetching webpages
    myopener = MyOpener()
    # Should we search and take the first hit?
    search = True

    # constant
    BASE_URL = 'http://www.rottentomatoes.com'
    SEARCH_URL = '%s/search/full_search.php?search=' % BASE_URL

    def __init__(self, title, search=True):
        self.title = title
        self.search = search
        self._process()

    def _search_movie(self):
        movie_url = ""

        url = self.SEARCH_URL + self.title
        page = self.myopener.open(url)
        result = re.search(r'(/m/.*)', page.geturl())
        if result:
            # if we are redirected
            movie_url = result.group(1)
        else:
            # if we get a search list
            soup = BeautifulSoup(page.read())
            ul = soup.find('ul', {'id' : 'movie_results_ul'})
            if ul:
                div = ul.find('div', {'class' : 'media_block_content'})
                if div:
                    movie_url = div.find('a', href=True)['href']

        return urlparse.urljoin( self.BASE_URL, movie_url )

    def _process(self):
        if not self.search:
            movie = '_'.join(self.title.split())

            url = "%s/m/%s" % (self.BASE_URL, movie)
            soup = BeautifulSoup(self.myopener.open(url).read())
            if soup.find('title').contents[0] == "Page Not Found":
                url = self._search_movie()
        else:
            url = self._search_movie()

        try:
            self.url = url
            soup = BeautifulSoup( self.myopener.open(url).read() )
            self.title = soup.find('meta', {'property' : 'og:title'})['content']
            if self.title: self.found = True

            self.tomatometer = soup.find('span', {'id' : 'all-critics-meter'}).contents[0]
            self.audience = soup.find('span', {'class' : 'meter popcorn numeric '}).contents[0]

            if self.tomatometer.isdigit():
                self.tomatometer += "%"
            if self.audience.isdigit():
                self.audience += "%"
        except:
            pass

if __name__ == "__main__":
    if len(sys.argv) == 1:
        print "Usage: %s 'Movie title'" % (sys.argv[0])
    else:
        rt = RottenTomatoesRating(sys.argv[1])
        if rt.found:
            print rt.url
            print rt.title
            print rt.tomatometer
            print rt.audience

Usage:

The constructor has an optional parameter, which is True by default (search=True). It means that first we use the search function of the RT website and then we try to follow the first link. If search=False, the script tries to access the movie page directly. If it fails, then it falls back to the first case, i.e. it will try to find the movie via search.

Which version is better? It depends :) If there are several movies with the same title, then with search=True you will get the latest movie. If search=False, then you will usually get the oldest movie with that title.

For instance, for me “Star Wars” means episode 4, thus with the title “star wars”, search=False will return the relevant hit. But for “up in the air”, I would like to get the movie from 2009, not from 1940, thus in this case search=True would be better.

If you are in doubt, use the default case, i.e. search=True.

Related links

Update (20110329):

You will find the latest version of the script at https://github.com/jabbalaci/Movie-Ratings.

[ @reddit ]

Get the IMDB rating of a movie

March 25, 2011 2 comments

Problem

You want to get the IMDB rating of a movie. For instance, you have a large collection of movies, and you want to figure out their ratings. An IMDB rating looks like this:
Solution

Here is a script that extracts the rating of a movie from IMDB. The script was inspired by the work of Rag Sagar.

Download link: https://github.com/jabbalaci/Movie-Ratings. Source code:

#!/usr/bin/env python

# ImdbRating

import os
import sys
import re
import urllib
import urlparse

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

class ImdbRating:
    # title of the movie
    title = None
    # IMDB URL of the movie
    url = None
    # IMDB rating of the movie
    rating = None
    # Did we find a result?
    found = False

    # constant
    BASE_URL = 'http://www.imdb.com'

    def __init__(self, title):
        self.title = title
        self._process()

    def _process(self):
        movie = '+'.join(self.title.split())
        br = Browser()
        url = "%s/find?s=tt&q=%s" % (self.BASE_URL, movie)
        br.open(url)

        if re.search(r'/title/tt.*', br.geturl()):
            self.url = "%s://%s%s" % urlparse.urlparse(br.geturl())[:3]
            soup = BeautifulSoup( MyOpener().open(url).read() )
        else:
            link = br.find_link(url_regex = re.compile(r'/title/tt.*'))
            res = br.follow_link(link)
            self.url = urlparse.urljoin(self.BASE_URL, link.url)
            soup = BeautifulSoup(res.read())

        try:
            self.title = soup.find('h1').contents[0].strip()
            self.rating = soup.find('span',attrs='rating-rating').contents[0]
            self.found = True
        except:
            pass

# class ImdbRating

if __name__ == "__main__":
    if len(sys.argv) == 1:
        print "Usage: %s 'Movie title'" % (sys.argv[0])
    else:
        imdb = ImdbRating(sys.argv[1])
        if imdb.found:
            print imdb.url
            print imdb.title
            print imdb.rating

Related links

Update (20110329):

You will find the latest version of the script at https://github.com/jabbalaci/Movie-Ratings.

[ @reddit ]

Related posts (update 20120222)

Categories: python Tags: , , , ,