Prettify HTML with BeautifulSoup

Home > python > Prettify HTML with BeautifulSoup

Prettify HTML with BeautifulSoup

April 3, 2011 Jabba Laci Leave a comment Go to comments

With the Python library BeautifulSoup (BS), you can extract information from HTML pages very easily. However, there is one thing you should keep in mind: HTML pages are usually malformed. BS tries to correct an HTML page, but it means that BS’s internal representation of the HTML page can be slightly different from the original source. Thus, when you want to localize a part of an HTML page, you should work with the internal representation.

The following script takes an HTML and prints it in a corrected form, i.e. it shows how BS stores the given page. You can also use it to prettify the source:

#!/usr/bin/env python

# prettify.py
# Usage: prettify <URL>

import sys
import urllib
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

def process(url):
    myopener = MyOpener()
    #page = urllib.urlopen(url)
    page = myopener.open(url)

    text = page.read()
    page.close()

    soup = BeautifulSoup(text)
    return soup.prettify()
# process(url)

def main():
    if len(sys.argv) == 1:
        print "Jabba's HTML Prettifier v0.1"
        print "Usage: %s <URL>" % sys.argv[0]
        sys.exit(-1)
    # else, if at least one parameter was passed
    print process(sys.argv[1])
# main()

if __name__ == "__main__":
    main()

You can find the latest version of the script at https://github.com/jabbalaci/Bash-Utils.

Categories: python Tags: beautifulsoup, beautify, html, prettify

Comments (0) Trackbacks (0) Leave a comment Trackback

No comments yet.

No trackbacks yet.

Python Adventures

Prettify HTML with BeautifulSoup

Leave a comment Cancel reply

Blog Stats

Random Post

Recent Posts

Archives

Meta

Python Adventures

Prettify HTML with BeautifulSoup

Share this:

Related

Leave a comment Cancel reply

Blog Stats

Random Post

Recent Posts

Archives

Meta