Archive

Archive for the ‘scraping’ Category

web scraping: BS4 supports CSS select

December 15, 2013 Leave a comment

BeautifulSoup is an excellent tool for web scraping. The development of BeautifulSoup 3 stopped in 2012, its author concentrates on BeautifulSoup 4 since then.

In this post I want to show how to use CSS selectors. With CSS selectors you can select part of a webpage, which is what we need when we do web scraping. Another possibility is to use XPath. I find CSS selectors easier to use. You can read this post too for a comparison: Why CSS Locators are the way to go vs XPath.

Exercise
Let’s go through a concrete example, that way it will be easier to understand.

The page http://developerexcuses.com/ prints a funny line that developers can use as an excuse. Let’s extract this line.

Visit the page, start Firebug, and click on the line (steps 1 and 2 on the figure below):

cssselect

Right click on the orange line (“<a style=...“) and choose “Copy CSS Path”. Now the CSS path of the selected HTML element is on the clipboard, which is “html body div.wrapper center a” in this example.

Now let’s write a script that prints this part of the HTML source:

import requests
import bs4

def main():
    r = requests.get("http://developerexcuses.com/")
    soup = bs4.BeautifulSoup(r.text)
    print soup.select("html body div.wrapper center a")[0].text

if __name__ == "__main__":
    main()
Advertisements

Extracting relevant images from XXX galleries using text clustering

November 8, 2013 1 comment

Warning! This post includes some links to NSFW (not suitable for work) galleries. You had better study this post at home :)


Problem
On the web you can find lots of free XXX galleries. There are also sites that collect these galleries and update their list at a daily frequence. When you visit such a gallery, you get either (1) images, or (2) links to images through thumbnails. But! Beside these relevant images, there is always some noise: banners, other thumbnails, links to other galleries, etc.

How to write a universal scraper that gets the URL of a gallery and it extracts just the relevant images without any noise? How to separate real content from noise?

Example
Let’s see a soft gallery: http://biertijd.xxx/index.php?itemid=44329 (NSFW!). Extracting all the images we get the following list:

  "urls": [
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_logo.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titlestart.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_top1.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titleend.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGleft.png", 
    "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/../banners/1.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg", 
    "http://media01.biertijd.com/galleries/metart/131107_night/../banners/2.jpg", 
    "http://biertijd.com/nucleus/plugins/rating/4.gif", 
    "http://biertijd.com/action.php?action=plugin&name=Captcha&type=captcha&key=0da28fe49a3d6d2fa7e17d15b9a05d28", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGright.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomleft.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomBG.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomright.png", 
    "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomfill.png", 
    "http://s4.histats.com/stats/0.gif?37757&1"
  ]

As you can see, the relevant images conform to this pattern: “http://media01.biertijd.com/galleries/metart/131107_night/{01..20}.jpg“. Altogether we have 35 images of which only 20 are relevant. How to find these 20 only?

Solution
The good news is that the relevant images usually follow a pattern and thus they don’t differ much. As seen above, in this example just the numbering of the images were different.

Relevant images can be separated from the others using text clustering. I found a great solution here by Rajesh M. Rajesh uses this method for clustering article titles. We will use it to cluster URLs, which are also just strings.

I put my solution in a class. Here it is:

#!/usr/bin/env python

# based on:
# http://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/

from helper import lev_dist as distance
from pprint import pprint

DISTANCE = 10

class Cluster(object):
    """
    Clustering a list of (sorted!) strings.

    I use it for clustering URLs. After extracting all the links (or images)
    from a web page, I use this class to group together similar URLs. It also
    identifies the largest cluster.
    """
    def __init__(self):
        self.clusters = {'clusters': {}}

    def clustering(self, elems):
        """
        Clusterize the input elements.

        Input: list of words (e.g. list of URLs). It MUST be sorted!

        Process: build a dictionary where keys are cluster IDs (int) and
                 values are lists (elements in the given cluster)
        """
        clusters = {}
        cid = 0

        for i, line in enumerate(elems):
            if i == 0:
                clusters[cid] = []
                clusters[cid].append(line)
            else:
                last = clusters[cid][-1]
                if distance(last, line)  maxi_v:
                    maxi_v = len(v)
                    maxi_k = k
        #
        return clusters[maxi_k]

    def show(self):
        pprint(self.clusters)

def get_clusters(elems):
    elems = sorted(elems)
    cl = Cluster()
    cl.clustering(elems)
    return cl.clusters['clusters']

#############################################################################

if __name__ == "__main__":
    import sys
    template = "https://jabbalaci.herokuapp.com/all_images?url={url}?&amp;clusters=1"
    if len(sys.argv) == 1:
        print "Usage: {0} URL".format(sys.argv[0])
        sys.exit(1)
    # else
    url = template.format(url=sys.argv[1])
    import requests
    r = requests.get(url)
    li = sorted(r.json()['urls'])

    cl = Cluster()
    cl.clustering(li)
    cl.show()

The extracted URLs are sorted first. Then, they are put in clusters. The idea is simple. Put the first element in the current cluster, which is the first cluster. If the next element is similar, put it into the first cluster again. If it’s different, create a new cluster (it will be the current cluster) and add to it. And so on.

To tell how similar two strings are, we use the Levenshtein distance. You can find an implementation here.

Demo
This method is implemented as a web service. It has two versions: you can cluster links, or you can cluster images. Which one to use? It depends on the gallery. If it includes the relevant images, then extract the images. If it contains thumbnails that point to images, then extract links.

Don’t forget to switch on the “text clustering” option. In the output you will get the clusters and to facilitate your life, the largest cluster is also indicated. In most of the cases, this is the cluster that contains the relevant images!

Sample output:

...
"clusters": {
    "0": [
      "http://biertijd.com/action.php?action=plugin&name=Captcha&type=captcha&key=d079746fd366f6f3509532688d595fcb"
    ], 
    "1": [
      "http://biertijd.com/nucleus/plugins/rating/4.gif"
    ], 
    "2": [
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGleft.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_BGright.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomBG.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomfill.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomleft.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_bottomright.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_logo.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titleend.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_titlestart.png", 
      "http://biertijd.com/skins/biertijd06/mplayer/images/mplayer_top1.png"
    ], 
    "3": [
      "http://media01.biertijd.com/galleries/metart/131107_night/../banners/1.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/../banners/2.jpg"
    ], 
    "4": [
      "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg"
    ], 
    "5": [
      "http://s4.histats.com/stats/0.gif?37757&1"
    ], 
    "largest": [
      "http://media01.biertijd.com/galleries/metart/131107_night/01.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/02.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/03.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/04.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/05.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/06.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/07.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/08.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/09.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/10.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/11.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/12.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/13.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/14.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/15.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/16.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/17.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/18.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/19.jpg", 
      "http://media01.biertijd.com/galleries/metart/131107_night/20.jpg"
    ], 
    "number_of_clusters": 6
  }, 
...

Demo for the lazy pigs
I made a page that extracts relevant links/images from a gallery and presents them in a cleaned gallery. It’s available here: https://jabbalaci.herokuapp.com/gallery .

Usage: insert the gallery’s URL then click on the first button. If you click on an image and it’s just a thumbnail, then click on the second button.

It extracts the largest cluster and it gives good results in most cases.

Feedbacks are welcome.

Links

Scrape an HTML table

May 14, 2013 1 comment

Problem
You want to extract an HTML table and get it in .csv, .json, etc. format.

Solution
I found a nice solution in this SO thread. The script is here: tablescrape.py.

Categories: python, scraping Tags: