You want to extract all the links from a web page. You need the links in absolute path format since you want to further process the extracted links.
Unix commands have a very nice philosophy: “do one thing and do it well”. Keeping that in mind, here is my link extractor:
from BeautifulSoup import BeautifulSoup
version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:220.127.116.11) Gecko/20110303 Firefox/3.6.15'
myopener = MyOpener()
#page = urllib.urlopen(url)
page = myopener.open(url)
text = page.read()
soup = BeautifulSoup(text)
for tag in soup.findAll('a', href=True):
tag['href'] = urlparse.urljoin(url, tag['href'])
if len(sys.argv) == 1:
print "Jabba's Link Extractor v0.1"
print "Usage: %s URL [URL]..." % sys.argv
# else, if at least one parameter was passed
for url in sys.argv[1:]:
if __name__ == "__main__":
You can find the up-to-date version of the script here.
The script will print the links to the standard output. The output can be refined with
grep for instance.
The HTML parsing is done with the BeautifulSoup (BS) library. If you get an error, i.e. BeautifulSoup cannot parse a tricky page, download the latest version of BS and put
BeautifulSoup.py in the same directory where
get_links.py is located. I had a problem with the version that came with Ubuntu 10.10 but I could solve the problem by upgrading to the latest version of BeautifulSoup.
Update (20110414): To update BS, first remove the package
python-beautifulsoup with Synaptic, then install the latest version from PyPI:
sudo pip install beautifulsoup.
Basic usage: get all links on a given page.
Basic usage: get all links from an HTML file. Yes, it also works on local files.
Number of links.
./get_links.py http://www.reddit.com/r/Python | wc -l
Filter result and keep only those links that you are interested in.
./get_links.py http://www.beach-hotties.com/ | grep -i jpg
./get_links.py http://www.beach-hotties.com/ | sort | uniq
Note: if the URL contains the special character “
&“, then put the URL between quotes.
Open (some) extracted links in your web browser. Here I use the script “
open_in_tabs.py” that I introduced in this post. You can also download “
./get_links.py http://www.beach-hotties.com/ | grep -i jpg | sort | uniq | ./open_in_tabs.py
You might be interested in another script called “
get_images.py” that extracts all image links from a webpage. Available here.