Archive

Posts Tagged ‘imdbpy’

Get the IMDb Top 250 list

August 19, 2016 Leave a comment

Problem
From IMDb you want to get the list of the Top 100 movies.

Solution
There is a Top 250 list here: http://akas.imdb.com/chart/top. To access IMDb info, I use the excellent imdbpy package. It has a get_top250_movies() function but it returns an empty list :)

During my research I found this post on SO. It suggests that one should download the official IMDb dump from here. The Top 250 list is in the file ratings.list.gz. However, this file doesn’t contain the IMDb IDs of the movies, so it’s good for nothing :(

There was only one solution left: let’s do some scraping. Here is the Python code that did the job for me. I didn’t use BeautifulSoup just plain ol’ regular expressions:

import requests
import re

top250_url = "http://akas.imdb.com/chart/top"

def get_top250():
    r = requests.get(top250_url)
    html = r.text.split("\n")
    result = []
    for line in html:
        line = line.rstrip("\n")
        m = re.search(r'data-titleid="tt(\d+?)">', line)
        if m:
            _id = m.group(1)
            result.append(_id)
    #
    return result

It returns the IMDb IDs of the Top 250 movies. Then, using the imdbpy package you can ask all the information about a movie, since you have the movie ID.

Links

Categories: python Tags: , , , ,

Get the IMDB rating of a movie

March 25, 2011 2 comments

Problem

You want to get the IMDB rating of a movie. For instance, you have a large collection of movies, and you want to figure out their ratings. An IMDB rating looks like this:
Solution

Here is a script that extracts the rating of a movie from IMDB. The script was inspired by the work of Rag Sagar.

Download link: https://github.com/jabbalaci/Movie-Ratings. Source code:

#!/usr/bin/env python

# ImdbRating

import os
import sys
import re
import urllib
import urlparse

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

class ImdbRating:
    # title of the movie
    title = None
    # IMDB URL of the movie
    url = None
    # IMDB rating of the movie
    rating = None
    # Did we find a result?
    found = False

    # constant
    BASE_URL = 'http://www.imdb.com'

    def __init__(self, title):
        self.title = title
        self._process()

    def _process(self):
        movie = '+'.join(self.title.split())
        br = Browser()
        url = "%s/find?s=tt&q=%s" % (self.BASE_URL, movie)
        br.open(url)

        if re.search(r'/title/tt.*', br.geturl()):
            self.url = "%s://%s%s" % urlparse.urlparse(br.geturl())[:3]
            soup = BeautifulSoup( MyOpener().open(url).read() )
        else:
            link = br.find_link(url_regex = re.compile(r'/title/tt.*'))
            res = br.follow_link(link)
            self.url = urlparse.urljoin(self.BASE_URL, link.url)
            soup = BeautifulSoup(res.read())

        try:
            self.title = soup.find('h1').contents[0].strip()
            self.rating = soup.find('span',attrs='rating-rating').contents[0]
            self.found = True
        except:
            pass

# class ImdbRating

if __name__ == "__main__":
    if len(sys.argv) == 1:
        print "Usage: %s 'Movie title'" % (sys.argv[0])
    else:
        imdb = ImdbRating(sys.argv[1])
        if imdb.found:
            print imdb.url
            print imdb.title
            print imdb.rating

Related links

Update (20110329):

You will find the latest version of the script at https://github.com/jabbalaci/Movie-Ratings.

[ @reddit ]

Related posts (update 20120222)

Categories: python Tags: , , , ,