Home > python > Get the IMDb Top 250 list

Get the IMDb Top 250 list

From IMDb you want to get the list of the Top 100 movies.

There is a Top 250 list here: http://akas.imdb.com/chart/top. To access IMDb info, I use the excellent imdbpy package. It has a get_top250_movies() function but it returns an empty list :)

During my research I found this post on SO. It suggests that one should download the official IMDb dump from here. The Top 250 list is in the file ratings.list.gz. However, this file doesn’t contain the IMDb IDs of the movies, so it’s good for nothing :(

There was only one solution left: let’s do some scraping. Here is the Python code that did the job for me. I didn’t use BeautifulSoup just plain ol’ regular expressions:

import requests
import re

top250_url = "http://akas.imdb.com/chart/top"

def get_top250():
    r = requests.get(top250_url)
    html = r.text.split("\n")
    result = []
    for line in html:
        line = line.rstrip("\n")
        m = re.search(r'data-titleid="tt(\d+?)">', line)
        if m:
            _id = m.group(1)
    return result

It returns the IMDb IDs of the Top 250 movies. Then, using the imdbpy package you can ask all the information about a movie, since you have the movie ID.


Categories: python Tags: , , , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: