Get the IMDb Top 250 list
Problem
From IMDb you want to get the list of the Top 100 movies.
Solution
There is a Top 250 list here: http://akas.imdb.com/chart/top. To access IMDb info, I use the excellent imdbpy package. It has a get_top250_movies()
function but it returns an empty list :)
During my research I found this post on SO. It suggests that one should download the official IMDb dump from here. The Top 250 list is in the file ratings.list.gz
. However, this file doesn’t contain the IMDb IDs of the movies, so it’s good for nothing :(
There was only one solution left: let’s do some scraping. Here is the Python code that did the job for me. I didn’t use BeautifulSoup just plain ol’ regular expressions:
import requests import re top250_url = "http://akas.imdb.com/chart/top" def get_top250(): r = requests.get(top250_url) html = r.text.split("\n") result = [] for line in html: line = line.rstrip("\n") m = re.search(r'data-titleid="tt(\d+?)">', line) if m: _id = m.group(1) result.append(_id) # return result
It returns the IMDb IDs of the Top 250 movies. Then, using the imdbpy package you can ask all the information about a movie, since you have the movie ID.
Links
- IMDB -> JSON, if you want to work with the dump files
string distances
See the Jellyfish project: “Jellyfish is a python library for doing approximate and phonetic matching of strings“.
Jellyfish implements the following algorithms: Levenshtein Distance, Damerau-Levenshtein Distance, Jaro Distance, Jaro-Winkler Distance, Match Rating Approach Comparison, Hamming Distance.
See the project page for more info.
compile lxml on Ubuntu 16.04
Problem
lxml
doesn’t want to compile on Ubuntu 16.04.
Solution
$ sudo apt install libxml2-dev libxslt1-dev python-dev zlib1g-dev
I was getting the error “/usr/bin/ld: cannot find -lz
“. It turned out that the package zlib1g-dev
was the cure…
Note that this is for Python 2. For Python 3 you might need to install the package python3-dev
.
installing a Flask webapp on a Digital Ocean Ubuntu 16.04 box using Systemd
I’ve updated my Digital Ocean Flask notes on GitHub. Now it includes information about installing a Flask webapp on a Digital Ocean Ubuntu 16.04 box using Systemd.