From IMDb you want to get the list of the Top 100 movies.
During my research I found this post on SO. It suggests that one should download the official IMDb dump from here. The Top 250 list is in the file
ratings.list.gz. However, this file doesn’t contain the IMDb IDs of the movies, so it’s good for nothing :(
There was only one solution left: let’s do some scraping. Here is the Python code that did the job for me. I didn’t use BeautifulSoup just plain ol’ regular expressions:
import requests import re top250_url = "http://akas.imdb.com/chart/top" def get_top250(): r = requests.get(top250_url) html = r.text.split("\n") result =  for line in html: line = line.rstrip("\n") m = re.search(r'data-titleid="tt(\d+?)">', line) if m: _id = m.group(1) result.append(_id) # return result
It returns the IMDb IDs of the Top 250 movies. Then, using the imdbpy package you can ask all the information about a movie, since you have the movie ID.
- IMDB -> JSON, if you want to work with the dump files
See the Jellyfish project: “Jellyfish is a python library for doing approximate and phonetic matching of strings“.
Jellyfish implements the following algorithms: Levenshtein Distance, Damerau-Levenshtein Distance, Jaro Distance, Jaro-Winkler Distance, Match Rating Approach Comparison, Hamming Distance.
See the project page for more info.
lxml doesn’t want to compile on Ubuntu 16.04.
$ sudo apt install libxml2-dev libxslt1-dev python-dev zlib1g-dev
I was getting the error “
/usr/bin/ld: cannot find -lz“. It turned out that the package
zlib1g-dev was the cure…
Note that this is for Python 2. For Python 3 you might need to install the package
I’ve updated my Digital Ocean Flask notes on GitHub. Now it includes information about installing a Flask webapp on a Digital Ocean Ubuntu 16.04 box using Systemd.