BeautifulSoup is an excellent tool for web scraping. The development of BeautifulSoup 3 stopped in 2012, its author concentrates on BeautifulSoup 4 since then.
In this post I want to show how to use CSS selectors. With CSS selectors you can select part of a webpage, which is what we need when we do web scraping. Another possibility is to use XPath. I find CSS selectors easier to use. You can read this post too for a comparison: Why CSS Locators are the way to go vs XPath.
Let’s go through a concrete example, that way it will be easier to understand.
The page http://developerexcuses.com/ prints a funny line that developers can use as an excuse. Let’s extract this line.
Visit the page, start Firebug, and click on the line (steps 1 and 2 on the figure below):
Right click on the orange line (“
<a style=...“) and choose “Copy CSS Path”. Now the CSS path of the selected HTML element is on the clipboard, which is “
html body div.wrapper center a” in this example.
Now let’s write a script that prints this part of the HTML source:
import requests import bs4 def main(): r = requests.get("http://developerexcuses.com/") soup = bs4.BeautifulSoup(r.text) print soup.select("html body div.wrapper center a").text if __name__ == "__main__": main()
While parsing an HTML page with BeautifulSoup, I got a similar error message:
File ".../BeautifulSoup.py", line 1915, in _detectEncoding '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data) TypeError: expected string or buffer
In the code I had this:
text = get_page(url) soup = BeautifulSoup(text)
text = get_page(url) text = str(text) # here is the trick soup = BeautifulSoup(text)
Tip from here.
(20131215) This post is out-of-date. BeautifulSoup 4 has built-in support for CSS selectors. Check out this post.
A few days ago I started to explore lxml (it’s been on my list for a long time) and I really like its CSS selector. As I used BeautifulSoup a lot in the past, I wondered if it were possible to add this functionality to BS. I made a quick search on Google and here is what I found: https://code.google.com/p/soupselect/.
“A single function, select(soup, selector), that can be used to select items from a BeautifulSoup instance using CSS selector syntax. Currently supports type selectors, class selectors, id selectors, attribute selectors and the descendant combinator.“
Just what I needed :) You can also patch BS and integrate this new functionality:
>>> from BeautifulSoup import BeautifulSoup as Soup >>> import soupselect; soupselect.monkeypatch() >>> import urllib >>> soup = Soup(urllib.urlopen('http://slashdot.org/')) >>> soup.findSelect('div.title h3') [</pre> <h3>...
With the Python library BeautifulSoup (BS), you can extract information from HTML pages very easily. However, there is one thing you should keep in mind: HTML pages are usually malformed. BS tries to correct an HTML page, but it means that BS’s internal representation of the HTML page can be slightly different from the original source. Thus, when you want to localize a part of an HTML page, you should work with the internal representation.
The following script takes an HTML and prints it in a corrected form, i.e. it shows how BS stores the given page. You can also use it to prettify the source:
#!/usr/bin/env python # prettify.py # Usage: prettify <URL> import sys import urllib from BeautifulSoup import BeautifulSoup class MyOpener(urllib.FancyURLopener): version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:18.104.22.168) Gecko/20110303 Firefox/3.6.15' def process(url): myopener = MyOpener() #page = urllib.urlopen(url) page = myopener.open(url) text = page.read() page.close() soup = BeautifulSoup(text) return soup.prettify() # process(url) def main(): if len(sys.argv) == 1: print "Jabba's HTML Prettifier v0.1" print "Usage: %s <URL>" % sys.argv sys.exit(-1) # else, if at least one parameter was passed print process(sys.argv) # main() if __name__ == "__main__": main()
You can find the latest version of the script at https://github.com/jabbalaci/Bash-Utils.