Archive

Posts Tagged ‘url’

extract all links from a file

June 17, 2014 Leave a comment

Problem
You want to extract all links (URLs) from a text file.

Solution

def extract_urls(fname):
    with open(fname) as f:
        return re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', f.read())
Categories: python Tags: , ,

Get URL info (file size, Content-Type, etc.)

October 18, 2010 2 comments

Problem

You have a URL and you want to get some info about it. For instance, you want to figure out the content type (text/html, image/jpeg, etc.) of the URL, or the file size without actually downloading the given page.

Solution

Let’s see an example with an image. Consider the URL http://www.geos.ed.ac.uk/homes/s0094539/remarkable_forest.preview.jpg .

#!/usr/bin/env python

import urllib

def get_url_info(url):
    d = urllib.urlopen(url)
    return d.info()

url = 'http://'+'www'+'.geos.ed.ac.uk'+'/homes/s0094539/remarkable_forest.preview.jpg'
print get_url_info(url)

Output:

Date: Mon, 18 Oct 2010 18:58:07 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 DAV/2 mod_fastcgi/2.4.6
X-Powered-By: Zope (www.zope.org), Python (www.python.org)
Last-Modified: Thu, 08 Nov 2007 09:56:19 GMT
Content-Length: 103984
Accept-Ranges: bytes
Connection: close
Content-Type: image/jpeg

That is, the size of the image is 103,984 bytes and its content type is indeed image/jpeg.

In the code d.info() is a dictionary, so the extraction of a specific field is very easy:

#!/usr/bin/env python

import urllib

def get_content_type(url):
    d = urllib.urlopen(url)
    return d.info()['Content-Type']

url = 'http://'+'www'+'.geos.ed.ac.uk'+'/homes/s0094539/remarkable_forest.preview.jpg'
print get_content_type(url)    # image/jpeg

This post is based on this thread.

Update (20121202)

With requests:

>>> import requests
>>> from pprint import pprint
>>> url = 'http://www.geos.ed.ac.uk/homes/s0094539/remarkable_forest.preview.jpg'
>>> r = requests.head(url)
>>> pprint(r.headers)
{'accept-ranges': 'none',
 'connection': 'close',
 'content-length': '103984',
 'content-type': 'image/jpeg',
 'date': 'Sun, 02 Dec 2012 21:05:57 GMT',
 'etag': 'ts94515779.19',
 'last-modified': 'Thu, 08 Nov 2007 09:56:19 GMT',
 'server': 'Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 DAV/2 mod_fastcgi/2.4.6',
 'x-powered-by': 'Zope (www.zope.org), Python (www.python.org)'}

check if URL exists

October 17, 2010 6 comments

Problem

You want to check if a URL exists without actually downloading the given file.

Solution

Update (20120124): There was something wrong with my previous solution, it didn’t work correctly. Here is my revised version.

import httplib
import urlparse

def get_server_status_code(url):
    """
    Download just the header of a URL and
    return the server's status code.
    """
    # http://stackoverflow.com/questions/1140661
    host, path = urlparse.urlparse(url)[1:3]    # elems [1] and [2]
    try:
        conn = httplib.HTTPConnection(host)
        conn.request('HEAD', path)
        return conn.getresponse().status
    except StandardError:
        return None

def check_url(url):
    """
    Check if a URL exists without downloading the whole file.
    We only check the URL header.
    """
    # see also http://stackoverflow.com/questions/2924422
    good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
    return get_server_status_code(url) in good_codes

Tests:

assert check_url('http://www.google.com')    # exists
assert not check_url('http://simile.mit.edu/crowbar/nothing_here.html')    # doesn't exist

We only get the header of a given URL and we check the response code of the web server.

Update (20121202)

With requests:

>>> import requests
>>>
>>> url = 'http://hup.hu'
>>> r = requests.head(url)
>>> r.status_code
200    # requests.codes.OK
>>> url = 'http://www.google.com'
>>> r = requests.head(url)
>>> r.status_code
302    # requests.codes.FOUND
>>> url = 'http://simile.mit.edu/crowbar/nothing_here.html'
>>> r = requests.head(url)
>>> r.status_code
404    # requests.codes.NOT_FOUND
Categories: python Tags: , ,