Home > python > Extract all links from a web page

Extract all links from a web page

Problem

You want to extract all the links from a web page. You need the links in absolute path format since you want to further process the extracted links.

Solution

Unix commands have a very nice philosophy: “do one thing and do it well”. Keeping that in mind, here is my link extractor:

#!/usr/bin/env python

# get_links.py

import re
import sys
import urllib
import urlparse
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

def process(url):
    myopener = MyOpener()
    #page = urllib.urlopen(url)
    page = myopener.open(url)

    text = page.read()
    page.close()

    soup = BeautifulSoup(text)

    for tag in soup.findAll('a', href=True):
        tag['href'] = urlparse.urljoin(url, tag['href'])
        print tag['href']
# process(url)

def main():
    if len(sys.argv) == 1:
        print "Jabba's Link Extractor v0.1"
        print "Usage: %s URL [URL]..." % sys.argv[0]
        sys.exit(-1)
    # else, if at least one parameter was passed
    for url in sys.argv[1:]:
        process(url)
# main()

if __name__ == "__main__":
    main()

You can find the up-to-date version of the script here.

The script will print the links to the standard output. The output can be refined with grep for instance.

Troubleshooting

The HTML parsing is done with the BeautifulSoup (BS) library. If you get an error, i.e. BeautifulSoup cannot parse a tricky page, download the latest version of BS and put BeautifulSoup.py in the same directory where get_links.py is located. I had a problem with the version that came with Ubuntu 10.10 but I could solve the problem by upgrading to the latest version of BeautifulSoup.
Update (20110414): To update BS, first remove the package python-beautifulsoup with Synaptic, then install the latest version from PyPI: sudo pip install beautifulsoup.

Examples

Basic usage: get all links on a given page.

./get_links.py http://www.reddit.com/r/Python

Basic usage: get all links from an HTML file. Yes, it also works on local files.

./get_links.py index.html

Number of links.

./get_links.py http://www.reddit.com/r/Python | wc -l

Filter result and keep only those links that you are interested in.

./get_links.py http://www.beach-hotties.com/ | grep -i jpg

Eliminate duplicates.

./get_links.py http://www.beach-hotties.com/ | sort | uniq

Note: if the URL contains the special character “&“, then put the URL between quotes.

./get_links.py "http://www.google.ca/search?hl=en&source=hp&q=python&aq=f&aqi=g10&aql=&oq="

Open (some) extracted links in your web browser. Here I use the script “open_in_tabs.py” that I introduced in this post. You can also download “open_in_tabs.pyhere.

./get_links.py http://www.beach-hotties.com/ | grep -i jpg | sort | uniq | ./open_in_tabs.py

Update (20110507)

You might be interested in another script called “get_images.py” that extracts all image links from a webpage. Available here.

About these ads
Categories: python Tags: , , , ,
  1. Charlie
    March 15, 2011 at 20:29

    Hi,

    Thanks for the script, is it possible to filter links containing a phrase?

    i.e returning:

    https://intranet.londonmet.ac.uk/module-catalogue/timeslots.cfm?campus=c&module=yd1005

    http://www.londonmet.ac.uk/module-catalogue/YD1006

    I only want http://www.londonmet.ac.uk/module-catalogue/ links returned

    • March 17, 2011 at 10:28

      To filter the result, you will have to use grep. I suggest that you look at the documentation of this tool (man grep). You will need regular expressions too.

  2. Ron
    June 7, 2011 at 18:23

    How about download all links from webpage?

    • June 7, 2011 at 18:29

      You can save the links in a file (down.txt). Then you can fetch them with wget like this: wget -i down.txt. If the links are images, it works fine.

  3. Serg
    August 11, 2011 at 07:59

    Hi, thanks for the script.
    Is it possible to upgrade script to pass through proxy?

    • August 11, 2011 at 08:06

      I didn’t need the proxy support and I don’t know much about that. I’m afraid you’ll have to look after that yourself :)

  4. October 3, 2011 at 16:11

    My, quite useful and pretty simple.

  5. drkkrobot
    December 3, 2011 at 18:49

    this was very useful for extracting all share links off sites, filesonic/rapidshare’s etc saving me hassle of copying a few 100 links to d/l individually

  6. esem
    February 9, 2013 at 10:54

    Hi,
    great script, one question I have is how would I incorporate this to use a website that needs a username and password using basic http authentication?

  7. Rhinoman21
    March 22, 2013 at 01:19

    In line 12 are you telling the code which type of browser, OS, etc. you are using? In other words, is this code compatible with Safari 5.1.8?

    Edit: Sorry about my previous post, I’m just wondering how I can make this code compatible with Safari 5.1.8.

    • March 22, 2013 at 08:25

      In line 12 I lie about the type of browser. This script will tell the server that “I’m a Mozilla browser”, and the server will say “OK”. If you don’t do this, the script will identify itself as a Python script, and some servers are unfortunately configured to block such clients :( This script has nothing to do with Safari 5.1.8. See this page too: http://whatsmyuseragent.com/ .

  8. Cole
    May 30, 2013 at 04:03

    FYI – I was having trouble getting this to work in FreeBSD. The solution was modifying the import line for BeautifulSoup to –

    “from bs4 import BeautifulSoup”

    credits – http://stackoverflow.com/questions/5663980/importerror-no-module-named-beautifulsoup

  9. james
    June 23, 2013 at 20:27

    Hi, tnx for this great script, does anyone one knows how to extract links from multiple urls, lets say 100 url at once, I tryed with ./get_links.py << urls.txt but it didt work?

    • June 24, 2013 at 13:59

      something like this:

      with open("urls.txt") as f:
          for line in f:
              url = line.rstrip("\n")
              process(url)    # call the "process" function above
      
  1. March 10, 2011 at 17:55

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 74 other followers

%d bloggers like this: