Home > python > Get URL info (file size, Content-Type, etc.)

Get URL info (file size, Content-Type, etc.)

Problem

You have a URL and you want to get some info about it. For instance, you want to figure out the content type (text/html, image/jpeg, etc.) of the URL, or the file size without actually downloading the given page.

Solution

Let’s see an example with an image. Consider the URL http://www.geos.ed.ac.uk/homes/s0094539/remarkable_forest.preview.jpg .

#!/usr/bin/env python

import urllib

def get_url_info(url):
    d = urllib.urlopen(url)
    return d.info()

url = 'http://'+'www'+'.geos.ed.ac.uk'+'/homes/s0094539/remarkable_forest.preview.jpg'
print get_url_info(url)

Output:

Date: Mon, 18 Oct 2010 18:58:07 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 DAV/2 mod_fastcgi/2.4.6
X-Powered-By: Zope (www.zope.org), Python (www.python.org)
Last-Modified: Thu, 08 Nov 2007 09:56:19 GMT
Content-Length: 103984
Accept-Ranges: bytes
Connection: close
Content-Type: image/jpeg

That is, the size of the image is 103,984 bytes and its content type is indeed image/jpeg.

In the code d.info() is a dictionary, so the extraction of a specific field is very easy:

#!/usr/bin/env python

import urllib

def get_content_type(url):
    d = urllib.urlopen(url)
    return d.info()['Content-Type']

url = 'http://'+'www'+'.geos.ed.ac.uk'+'/homes/s0094539/remarkable_forest.preview.jpg'
print get_content_type(url)    # image/jpeg

This post is based on this thread.

Update (20121202)

With requests:

>>> import requests
>>> from pprint import pprint
>>> url = 'http://www.geos.ed.ac.uk/homes/s0094539/remarkable_forest.preview.jpg'
>>> r = requests.head(url)
>>> pprint(r.headers)
{'accept-ranges': 'none',
 'connection': 'close',
 'content-length': '103984',
 'content-type': 'image/jpeg',
 'date': 'Sun, 02 Dec 2012 21:05:57 GMT',
 'etag': 'ts94515779.19',
 'last-modified': 'Thu, 08 Nov 2007 09:56:19 GMT',
 'server': 'Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 DAV/2 mod_fastcgi/2.4.6',
 'x-powered-by': 'Zope (www.zope.org), Python (www.python.org)'}
  1. October 10, 2014 at 22:47

    so how to extract the [‘content-length’] form pprint(r.headers)

    i tried pprint(r.headers)[‘content-length’]

    not working , getting error

    • October 10, 2014 at 23:59

      print(r.headers['content-length'])

  1. No trackbacks yet.

Leave a comment