November | 2013 | Python Adventures

Python packages you should know about

November 13, 2013 Jabba Laci Leave a comment

The site http://pythonwheels.com/ shows the top 360 most-downloaded packages on PyPI.

_{This site marks packages whether they are distributed in a so-called wheel format or not, but at the moment we are not interedted in that :)}

So, if you are bored and you want to discover some cool packages, just start with this list.

Categories: python Tags: packages, top pypi packages, wheel, wheels

Python video channel

November 11, 2013 Jabba Laci Leave a comment

http://www.youtube.com/user/sentdex/videos

Some topics:

Pygame basics
Image recognition
Machine learning
etc.

Categories: python Tags: video tutorial

You can compare two dictionaries

November 8, 2013 Jabba Laci Leave a comment

In Python you can compare two dictionaries. Proof:

>>> a
{'a': 1, 'c': 3}
>>> b
{'a': 1, 'c': 3}
>>> a == b
True
>>> b['c'] = 4
>>> a
{'a': 1, 'c': 3}
>>> b
{'a': 1, 'c': 4}
>>> a == b
False

(Note that comparison works between two strings and between two lists too.)

_{Thanks to Eszter S. for the tip.}

Categories: python Tags: comparing dictionaries, dict, dictionary

Extracting relevant images from XXX galleries using text clustering

November 8, 2013 Jabba Laci 1 comment

Warning! This post includes some links to NSFW (not suitable for work) galleries. You had better study this post at home :)

Problem
On the web you can find lots of free XXX galleries. There are also sites that collect these galleries and update their list at a daily frequence. When you visit such a gallery, you get either (1) images, or (2) links to images through thumbnails. But! Beside these relevant images, there is always some noise: banners, other thumbnails, links to other galleries, etc.

How to write a universal scraper that gets the URL of a gallery and it extracts just the relevant images without any noise? How to separate real content from noise?

Example
Let’s see a soft gallery: http://biertijd.xxx/index.php?itemid=44329 (NSFW!). Extracting all the images we get the following list:

  "urls": [
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_logo.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titlestart.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_top1.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titleend.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGleft.png", 
    "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/../banners/1.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/../banners/2.jpg", 
    "http://biertijd.com/nucleus/plugins/rating/4.gif", 
    "http://biertijd.com/action.php?action=plugin&name=Captcha&type=captcha&key=0da28fe49a3d6d2fa7e17d15b9a05d28", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGright.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomleft.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomBG.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomright.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomfill.png", 
    "http://s4.histats.com/stats/0.gif?37757&1"
  ]

As you can see, the relevant images conform to this pattern: “http://media01.biertijd.com/galleries/metart/131107_night/{01..20}.jpg“. Altogether we have 35 images of which only 20 are relevant. How to find these 20 only?

Solution
The good news is that the relevant images usually follow a pattern and thus they don’t differ much. As seen above, in this example just the numbering of the images were different.

Relevant images can be separated from the others using text clustering. I found a great solution here by Rajesh M. Rajesh uses this method for clustering article titles. We will use it to cluster URLs, which are also just strings.

I put my solution in a class. Here it is:

#!/usr/bin/env python

# based on:
# http://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/

from helper import lev_dist as distance
from pprint import pprint

DISTANCE = 10

class Cluster(object):
    """
    Clustering a list of (sorted!) strings.

    I use it for clustering URLs. After extracting all the links (or images)
    from a web page, I use this class to group together similar URLs. It also
    identifies the largest cluster.
    """
    def __init__(self):
        self.clusters = {'clusters': {}}

    def clustering(self, elems):
        """
        Clusterize the input elements.

        Input: list of words (e.g. list of URLs). It MUST be sorted!

        Process: build a dictionary where keys are cluster IDs (int) and
                 values are lists (elements in the given cluster)
        """
        clusters = {}
        cid = 0

        for i, line in enumerate(elems):
            if i == 0:
                clusters[cid] = []
                clusters[cid].append(line)
            else:
                last = clusters[cid][-1]
                if distance(last, line)  maxi_v:
                    maxi_v = len(v)
                    maxi_k = k
        #
        return clusters[maxi_k]

    def show(self):
        pprint(self.clusters)

def get_clusters(elems):
    elems = sorted(elems)
    cl = Cluster()
    cl.clustering(elems)
    return cl.clusters['clusters']

#############################################################################

if __name__ == "__main__":
    import sys
    template = "https://jabbalaci.herokuapp.com/all_images?url={url}?&amp;clusters=1"
    if len(sys.argv) == 1:
        print "Usage: {0} URL".format(sys.argv[0])
        sys.exit(1)
    # else
    url = template.format(url=sys.argv[1])
    import requests
    r = requests.get(url)
    li = sorted(r.json()['urls'])

    cl = Cluster()
    cl.clustering(li)
    cl.show()

The extracted URLs are sorted first. Then, they are put in clusters. The idea is simple. Put the first element in the current cluster, which is the first cluster. If the next element is similar, put it into the first cluster again. If it’s different, create a new cluster (it will be the current cluster) and add to it. And so on.

To tell how similar two strings are, we use the Levenshtein distance. You can find an implementation here.

Demo
This method is implemented as a web service. It has two versions: you can cluster links, or you can cluster images. Which one to use? It depends on the gallery. If it includes the relevant images, then extract the images. If it contains thumbnails that point to images, then extract links.

Don’t forget to switch on the “text clustering” option. In the output you will get the clusters and to facilitate your life, the largest cluster is also indicated. In most of the cases, this is the cluster that contains the relevant images!

Sample output:

...
"clusters": {
    "0": [
      "http://biertijd.com/action.php?action=plugin&name=Captcha&type=captcha&key=d079746fd366f6f3509532688d595fcb"
    ], 
    "1": [
      "http://biertijd.com/nucleus/plugins/rating/4.gif"
    ], 
    "2": [
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGleft.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGright.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomBG.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomfill.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomleft.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomright.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_logo.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titleend.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titlestart.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_top1.png"
    ], 
    "3": [
      "http://media01.biertijd.com/galleries/metart/131107_night/../banners/1.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/../banners/2.jpg"
    ], 
    "4": [
      "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg"
    ], 
    "5": [
      "http://s4.histats.com/stats/0.gif?37757&1"
    ], 
    "largest": [
      "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg"
    ], 
    "number_of_clusters": 6
  }, 
...

Demo for the lazy pigs
I made a page that extracts relevant links/images from a gallery and presents them in a cleaned gallery. It’s available here: https://jabbalaci.herokuapp.com/gallery .

Usage: insert the gallery’s URL then click on the first button. If you click on an image and it’s just a thumbnail, then click on the second button.

It extracts the largest cluster and it gives good results in most cases.

Feedbacks are welcome.

Links

this post appeared in Python Weekly #114 (Nov. 2013)
discussion @reddit

Categories: python, scraping Tags: nsfw, relevant images, scraper, text clustering, xxx, xxx gallery

funny Python snippet

November 5, 2013 Jabba Laci Leave a comment

Found here.

>>> {}['no lock']

Categories: fun, python

Heroku: strange client IP addresses

November 3, 2013 Jabba Laci Leave a comment

Problem
In Flask, you can ask the client’s IP address with request.remote_addr . If you try to print this value on Heroku, you will get strange IP addresses that have nothing to do with the client’s IP.

Why?
It’s because your app. at Heroku is behind proxies and your app. will see the proxies’ IP, not the real client’s IP.

Fortunately there is a fix for this problem here: http://flask.pocoo.org/docs/deploying/others/#proxy-setups. You just need to insert these two lines in the production code:

from werkzeug.contrib.fixers import ProxyFix
app.wsgi_app = ProxyFix(app.wsgi_app)

Categories: python Tags: client IP, heroku, proxy, remote_addr

Determine the dimensions of an image on the web without downloading it entirely

November 3, 2013 Jabba Laci Leave a comment

Problem
You have a list of image URLs and you want to do something with them. You need their dimensions (width, height) BUT you don’t want to download them completely.

Solution
I found a nice working solution here (see the bottom of the linked page).

I copy the code here for future references:

#!/usr/bin/env python

import urllib
import ImageFile

def getsizes(uri):
    # get file size *and* image size (None if not known)
    file = urllib.urlopen(uri)
    size = file.headers.get("content-length")
    if size: 
        size = int(size)
    p = ImageFile.Parser()
    while True:
        data = file.read(1024)
        if not data:
            break
        p.feed(data)
        if p.image:
            return size, p.image.size
            break
    file.close()
    return size, None

##########

if __name__ == "__main__":
    url = "https://upload.wikimedia.org/wikipedia/commons/1/12/Baobob_tree.jpg"
    print getsizes(url)

Sample output:

(1866490, (1164, 1738))

Where the first value is the size of the file in bytes, and the second is a tuple with width and height of the image in pixels.

Update (20140406)
I had to figure out the dimensions of some image files on my local filesystem. Here is the slightly modified version of the code above:

import ImageFile

def getsizes(fname):
    # get file size *and* image size (None if not known)
    file = open(fname)
    size = os.path.getsize(fname)
    p = ImageFile.Parser()
    while True:
        data = file.read(1024)
        if not data:
            break
        p.feed(data)
        if p.image:
            return size, p.image.size
            break
    file.close()
    return size, None

Usage:

size = getsizes(fname)[1]
if size:
    # process it

Categories: python Tags: image dimension, image header, image size

Flask: cannot fetch a URL on localhost

November 2, 2013 Jabba Laci Leave a comment

Problem
I had a simple Flask application that included a web service, i.e. calling an address returns some value (a JSON result for instance). I wanted to reuse this service inside the app. by simply calling it (via the HTTP protocol) and getting the return value. However, this call never finished. The browser was loading and I got no result.

What happened?
I posted the problem here and it turned out that “the development server is single threaded, so when you call a url served by that application from within the application itself, you create a deadlock situation.” Hmm…

My first idea was to replace the dev. server with a more serious one. With gunicorn I could make it work:

gunicorn -w 4 -b 127.0.0.1:5000 hello:app

However, I deploy the app. on Heroku, where you have just 1 worker for free, so it behaves just like the dev. server!

Solution
I had to rewrite the code to eliminate this extra call. (Or, I could have kept this call if I had had at least 2 worker threads.)

Example
Here is a simplified code that demonstrates the problem:

#!/usr/bin/env python

# hello.py

from flask import Flask
from flask import url_for
import requests

app = Flask(__name__)

@app.route('/')
def hello_world():
    return "Hello, World!"

@app.route('/get')
def get():
    url = url_for("hello_world", _external=True)    # full URL
    print '!!!', url    # debug info
    r = requests.get(url)    # it hangs at this point
    return "from get: " + r.text

if __name__ == "__main__":
    app.run(debug=True)

Categories: flask Tags: development server, gunicorn, threads, web service, workers

Heroku: development and production settings

November 2, 2013 Jabba Laci Leave a comment

Problem
You have a project that you develop on your local machine and you deploy it on Heroku for instance. The two environments require different settings. For example, you test your app. with SQLite but in production you use PostgreSQL. How can the application configure itself to its environment?

Solution
I show you how to do it with Flask.

In your project folder:

$ heroku config:set HEROKU=1

It will create an environment variable at Heroku. These environment variables are persistent – they will remain in place across deploys and app restarts – so unless you need to change values, you only need to set them once.

Then create a config.py file in your project folder:

import os

class Config(object):
    DEBUG = False
    TESTING = False
    DATABASE_URI = 'sqlite://:memory:'

class ProductionConfig(Config):
    """
    Heroku
    """
    REDIS_URI = os.environ.get('REDISTOGO_URL')

class DevelopmentConfig(Config):
    """
    localhost
    """
    DEBUG = True
    REDIS_URI = 'redis://localhost:6379'

class TestingConfig(Config):
    TESTING = True

Of course, you will have to customize it with your own settings.

Then, in your main file:

...
app = Flask(__name__)

if 'HEROKU' in os.environ:
    # production on Heroku
    app.config.from_object('config.ProductionConfig')
else:
    # development on localhost
    app.config.from_object('config.DevelopmentConfig')
...

Now, if you want to access the configuration from different files of the project, use this:

from flask import current_app as app
...
app.config['MY_SETTINGS']

Redis
Let’s see how to use Redis for instance. Apply the same idea with other databases too. Opening and closing can go in the before_request and teardown_request functions:

from flask import g
import redis

@app.before_request
def before_request():
    g.redis = redis.from_url(app.config['REDIS_URI'])

@app.teardown_request
def teardown_request(exception):
    pass    # g.redis doesn't need to be closed explicitly

If you need to access redis from other files, just import g and use g.redis .

Links

Categories: flask, python Tags: config, deployment, development, heroku, production, redis

Python Adventures

Archive

Python packages you should know about

Python video channel

You can compare two dictionaries

Extracting relevant images from XXX galleries using text clustering

funny Python snippet

Heroku: strange client IP addresses

Determine the dimensions of an image on the web without downloading it entirely

Flask: cannot fetch a URL on localhost

Heroku: development and production settings

Blog Stats

Random Post

Recent Posts

Archives

Meta