Extract all links from a web page
Problem
You want to extract all the links from a web page. You need the links in absolute path format since you want to further process the extracted links.
Solution
Unix commands have a very nice philosophy: “do one thing and do it well”. Keeping that in mind, here is my link extractor:
#!/usr/bin/env python # get_links.py import re import sys import urllib import urlparse from BeautifulSoup import BeautifulSoup class MyOpener(urllib.FancyURLopener): version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15' def process(url): myopener = MyOpener() #page = urllib.urlopen(url) page = myopener.open(url) text = page.read() page.close() soup = BeautifulSoup(text) for tag in soup.findAll('a', href=True): tag['href'] = urlparse.urljoin(url, tag['href']) print tag['href'] # process(url) def main(): if len(sys.argv) == 1: print "Jabba's Link Extractor v0.1" print "Usage: %s URL [URL]..." % sys.argv[0] sys.exit(-1) # else, if at least one parameter was passed for url in sys.argv[1:]: process(url) # main() if __name__ == "__main__": main()
You can find the up-to-date version of the script here.
The script will print the links to the standard output. The output can be refined with grep
for instance.
Troubleshooting
The HTML parsing is done with the BeautifulSoup (BS) library. If you get an error, i.e. BeautifulSoup cannot parse a tricky page, download the latest version of BS and put BeautifulSoup.py
in the same directory where get_links.py
is located. I had a problem with the version that came with Ubuntu 10.10 but I could solve the problem by upgrading to the latest version of BeautifulSoup.
Update (20110414): To update BS, first remove the package python-beautifulsoup
with Synaptic, then install the latest version from PyPI: sudo pip install beautifulsoup
.
Examples
Basic usage: get all links on a given page.
./get_links.py http://www.reddit.com/r/Python
Basic usage: get all links from an HTML file. Yes, it also works on local files.
./get_links.py index.html
Number of links.
./get_links.py http://www.reddit.com/r/Python | wc -l
Filter result and keep only those links that you are interested in.
./get_links.py http://www.beach-hotties.com/ | grep -i jpg
Eliminate duplicates.
./get_links.py http://www.beach-hotties.com/ | sort | uniq
Note: if the URL contains the special character “&
“, then put the URL between quotes.
./get_links.py "http://www.google.ca/search?hl=en&source=hp&q=python&aq=f&aqi=g10&aql=&oq="
Open (some) extracted links in your web browser. Here I use the script “open_in_tabs.py
” that I introduced in this post. You can also download “open_in_tabs.py
” here.
./get_links.py http://www.beach-hotties.com/ | grep -i jpg | sort | uniq | ./open_in_tabs.py
Update (20110507)
You might be interested in another script called “get_images.py
” that extracts all image links from a webpage. Available here.
Hi,
Thanks for the script, is it possible to filter links containing a phrase?
i.e returning:
https://intranet.londonmet.ac.uk/module-catalogue/timeslots.cfm?campus=c&module=yd1005
http://www.londonmet.ac.uk/module-catalogue/YD1006
I only want http://www.londonmet.ac.uk/module-catalogue/ links returned
To filter the result, you will have to use grep. I suggest that you look at the documentation of this tool (man grep). You will need regular expressions too.
How about download all links from webpage?
You can save the links in a file (
down.txt
). Then you can fetch them with wget like this:wget -i down.txt
. If the links are images, it works fine.how about searching a certain keyword in links and tweeting it out or sending in slack ?
You simply filter on your keyword and use the Twitter and Slack APIs.
Hi, thanks for the script.
Is it possible to upgrade script to pass through proxy?
I didn’t need the proxy support and I don’t know much about that. I’m afraid you’ll have to look after that yourself :)
My, quite useful and pretty simple.
this was very useful for extracting all share links off sites, filesonic/rapidshare’s etc saving me hassle of copying a few 100 links to d/l individually
Hi,
great script, one question I have is how would I incorporate this to use a website that needs a username and password using basic http authentication?
Combine it with the requests module. See this page for authentication.
In line 12 are you telling the code which type of browser, OS, etc. you are using? In other words, is this code compatible with Safari 5.1.8?
Edit: Sorry about my previous post, I’m just wondering how I can make this code compatible with Safari 5.1.8.
In line 12 I lie about the type of browser. This script will tell the server that “I’m a Mozilla browser”, and the server will say “OK”. If you don’t do this, the script will identify itself as a Python script, and some servers are unfortunately configured to block such clients :( This script has nothing to do with Safari 5.1.8. See this page too: http://whatsmyuseragent.com/ .
FYI – I was having trouble getting this to work in FreeBSD. The solution was modifying the import line for BeautifulSoup to –
“from bs4 import BeautifulSoup”
credits – http://stackoverflow.com/questions/5663980/importerror-no-module-named-beautifulsoup
Hi, tnx for this great script, does anyone one knows how to extract links from multiple urls, lets say 100 url at once, I tryed with ./get_links.py << urls.txt but it didt work?
something like this:
Sorry for the noob question but where do I input the URL that I would like to scrape?
Just pass it as a command-line argument. At the section “Examples” you can see some concrete examples.
This is just great. A best link extractor I have seen so far. So useful. It handles greatly with relative url’s. Thanks, Jabba.