With the Python library BeautifulSoup (BS), you can extract information from HTML pages very easily. However, there is one thing you should keep in mind: HTML pages are usually malformed. BS tries to correct an HTML page, but it means that BS’s internal representation of the HTML page can be slightly different from the original source. Thus, when you want to localize a part of an HTML page, you should work with the internal representation.
The following script takes an HTML and prints it in a corrected form, i.e. it shows how BS stores the given page. You can also use it to prettify the source:
#!/usr/bin/env python # prettify.py # Usage: prettify <URL> import sys import urllib from BeautifulSoup import BeautifulSoup class MyOpener(urllib.FancyURLopener): version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:184.108.40.206) Gecko/20110303 Firefox/3.6.15' def process(url): myopener = MyOpener() #page = urllib.urlopen(url) page = myopener.open(url) text = page.read() page.close() soup = BeautifulSoup(text) return soup.prettify() # process(url) def main(): if len(sys.argv) == 1: print "Jabba's HTML Prettifier v0.1" print "Usage: %s <URL>" % sys.argv sys.exit(-1) # else, if at least one parameter was passed print process(sys.argv) # main() if __name__ == "__main__": main()
You can find the latest version of the script at https://github.com/jabbalaci/Bash-Utils.