Archive

Archive for November, 2013

Python packages you should know about

November 13, 2013 Leave a comment

The site http://pythonwheels.com/ shows the top 360 most-downloaded packages on PyPI.

This site marks packages whether they are distributed in a so-called wheel format or not, but at the moment we are not interedted in that :)

So, if you are bored and you want to discover some cool packages, just start with this list.

Python video channel

November 11, 2013 Leave a comment

http://www.youtube.com/user/sentdex/videos

Some topics:

  • Pygame basics
  • Image recognition
  • Machine learning
  • etc.
Categories: python Tags:

You can compare two dictionaries

November 8, 2013 Leave a comment

In Python you can compare two dictionaries. Proof:

>>> a
{'a': 1, 'c': 3}
>>> b
{'a': 1, 'c': 3}
>>> a == b
True
>>> b['c'] = 4
>>> a
{'a': 1, 'c': 3}
>>> b
{'a': 1, 'c': 4}
>>> a == b
False

(Note that comparison works between two strings and between two lists too.)

Thanks to Eszter S. for the tip.

Extracting relevant images from XXX galleries using text clustering

November 8, 2013 1 comment

Warning! This post includes some links to NSFW (not suitable for work) galleries. You had better study this post at home :)


Problem
On the web you can find lots of free XXX galleries. There are also sites that collect these galleries and update their list at a daily frequence. When you visit such a gallery, you get either (1) images, or (2) links to images through thumbnails. But! Beside these relevant images, there is always some noise: banners, other thumbnails, links to other galleries, etc.

How to write a universal scraper that gets the URL of a gallery and it extracts just the relevant images without any noise? How to separate real content from noise?

Example
Let’s see a soft gallery: http://biertijd.xxx/index.php?itemid=44329 (NSFW!). Extracting all the images we get the following list:

  "urls": [
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_logo.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titlestart.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_top1.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titleend.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGleft.png", 
    "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/../banners/1.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/../banners/2.jpg", 
    "http://biertijd.com/nucleus/plugins/rating/4.gif", 
    "http://biertijd.com/action.php?action=plugin&name=Captcha&type=captcha&key=0da28fe49a3d6d2fa7e17d15b9a05d28", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGright.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomleft.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomBG.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomright.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomfill.png", 
    "http://s4.histats.com/stats/0.gif?37757&1"
  ]

As you can see, the relevant images conform to this pattern: “http://media01.biertijd.com/galleries/metart/131107_night/{01..20}.jpg“. Altogether we have 35 images of which only 20 are relevant. How to find these 20 only?

Solution
The good news is that the relevant images usually follow a pattern and thus they don’t differ much. As seen above, in this example just the numbering of the images were different.

Relevant images can be separated from the others using text clustering. I found a great solution here by Rajesh M. Rajesh uses this method for clustering article titles. We will use it to cluster URLs, which are also just strings.

I put my solution in a class. Here it is:

#!/usr/bin/env python

# based on:
# http://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/

from helper import lev_dist as distance
from pprint import pprint

DISTANCE = 10

class Cluster(object):
    """
    Clustering a list of (sorted!) strings.

    I use it for clustering URLs. After extracting all the links (or images)
    from a web page, I use this class to group together similar URLs. It also
    identifies the largest cluster.
    """
    def __init__(self):
        self.clusters = {'clusters': {}}

    def clustering(self, elems):
        """
        Clusterize the input elements.

        Input: list of words (e.g. list of URLs). It MUST be sorted!

        Process: build a dictionary where keys are cluster IDs (int) and
                 values are lists (elements in the given cluster)
        """
        clusters = {}
        cid = 0

        for i, line in enumerate(elems):
            if i == 0:
                clusters[cid] = []
                clusters[cid].append(line)
            else:
                last = clusters[cid][-1]
                if distance(last, line)  maxi_v:
                    maxi_v = len(v)
                    maxi_k = k
        #
        return clusters[maxi_k]

    def show(self):
        pprint(self.clusters)

def get_clusters(elems):
    elems = sorted(elems)
    cl = Cluster()
    cl.clustering(elems)
    return cl.clusters['clusters']

#############################################################################

if __name__ == "__main__":
    import sys
    template = "https://jabbalaci.herokuapp.com/all_images?url={url}?&clusters=1"
    if len(sys.argv) == 1:
        print "Usage: {0} URL".format(sys.argv[0])
        sys.exit(1)
    # else
    url = template.format(url=sys.argv[1])
    import requests
    r = requests.get(url)
    li = sorted(r.json()['urls'])

    cl = Cluster()
    cl.clustering(li)
    cl.show()

The extracted URLs are sorted first. Then, they are put in clusters. The idea is simple. Put the first element in the current cluster, which is the first cluster. If the next element is similar, put it into the first cluster again. If it’s different, create a new cluster (it will be the current cluster) and add to it. And so on.

To tell how similar two strings are, we use the Levenshtein distance. You can find an implementation here.

Demo
This method is implemented as a web service. It has two versions: you can cluster links, or you can cluster images. Which one to use? It depends on the gallery. If it includes the relevant images, then extract the images. If it contains thumbnails that point to images, then extract links.

Don’t forget to switch on the “text clustering” option. In the output you will get the clusters and to facilitate your life, the largest cluster is also indicated. In most of the cases, this is the cluster that contains the relevant images!

Sample output:

...
"clusters": {
    "0": [
      "http://biertijd.com/action.php?action=plugin&name=Captcha&type=captcha&key=d079746fd366f6f3509532688d595fcb"
    ], 
    "1": [
      "http://biertijd.com/nucleus/plugins/rating/4.gif"
    ], 
    "2": [
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGleft.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGright.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomBG.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomfill.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomleft.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomright.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_logo.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titleend.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titlestart.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_top1.png"
    ], 
    "3": [
      "http://media01.biertijd.com/galleries/metart/131107_night/../banners/1.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/../banners/2.jpg"
    ], 
    "4": [
      "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg"
    ], 
    "5": [
      "http://s4.histats.com/stats/0.gif?37757&1"
    ], 
    "largest": [
      "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg"
    ], 
    "number_of_clusters": 6
  }, 
...

Demo for the lazy pigs
I made a page that extracts relevant links/images from a gallery and presents them in a cleaned gallery. It’s available here: https://jabbalaci.herokuapp.com/gallery .

Usage: insert the gallery’s URL then click on the first button. If you click on an image and it’s just a thumbnail, then click on the second button.

It extracts the largest cluster and it gives good results in most cases.

Feedbacks are welcome.

Links

funny Python snippet

November 5, 2013 Leave a comment

Found here.

>>> {}['no lock']
Categories: fun, python

Heroku: strange client IP addresses

November 3, 2013 Leave a comment

Problem
In Flask, you can ask the client’s IP address with request.remote_addr . If you try to print this value on Heroku, you will get strange IP addresses that have nothing to do with the client’s IP.

Why?
It’s because your app. at Heroku is behind proxies and your app. will see the proxies’ IP, not the real client’s IP.

Fortunately there is a fix for this problem here: http://flask.pocoo.org/docs/deploying/others/#proxy-setups. You just need to insert these two lines in the production code:

from werkzeug.contrib.fixers import ProxyFix
app.wsgi_app = ProxyFix(app.wsgi_app)
Categories: python Tags: , , ,

Determine the dimensions of an image on the web without downloading it entirely

November 3, 2013 Leave a comment

Problem
You have a list of image URLs and you want to do something with them. You need their dimensions (width, height) BUT you don’t want to download them completely.

Solution
I found a nice working solution here (see the bottom of the linked page).

I copy the code here for future references:

#!/usr/bin/env python

import urllib
import ImageFile

def getsizes(uri):
    # get file size *and* image size (None if not known)
    file = urllib.urlopen(uri)
    size = file.headers.get("content-length")
    if size: 
        size = int(size)
    p = ImageFile.Parser()
    while True:
        data = file.read(1024)
        if not data:
            break
        p.feed(data)
        if p.image:
            return size, p.image.size
            break
    file.close()
    return size, None

##########

if __name__ == "__main__":
    url = "https://upload.wikimedia.org/wikipedia/commons/1/12/Baobob_tree.jpg"
    print getsizes(url)

Sample output:

(1866490, (1164, 1738))

Where the first value is the size of the file in bytes, and the second is a tuple with width and height of the image in pixels.

Update (20140406)
I had to figure out the dimensions of some image files on my local filesystem. Here is the slightly modified version of the code above:

import ImageFile

def getsizes(fname):
    # get file size *and* image size (None if not known)
    file = open(fname)
    size = os.path.getsize(fname)
    p = ImageFile.Parser()
    while True:
        data = file.read(1024)
        if not data:
            break
        p.feed(data)
        if p.image:
            return size, p.image.size
            break
    file.close()
    return size, None

Usage:

size = getsizes(fname)[1]
if size:
    # process it