The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.
Here’s an excerpt:
The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 140,000 times in 2013. If it were an exhibit at the Louvre Museum, it would take about 6 days for that many people to see it.
How many times have you written “
#!/usr/bin/env python” in your life? A few hundred times? :) How to insert this line easily?
I mainly use vim. So far I have done it the following way.
In vim, type this on the first line in normal mode. This is the mode that vim will usually start in, which you can usually get back to with ESC. At the end press ENTER.
!!” brings you to command mode and the current line (which was empty) is replaced with the output of “
which env“. Thus the first line becomes “
/usr/bin/env“. All you need to do is add “
#!” and “
Fine, but I got fed up with this. It’s still too much typing. How to do it easier?
Write a bash script called “py” with the following content:
echo "#!`which env` python"
Put it somewhere in your
PATH and make it executable.
Then, in vim type this on the first line:
Phew. I should have thought of it years ago.
You have an XPath expression and you want to convert it to a CSS selector.
Command line usage:
$ ./cssify.py '//a' a $ ./cssify.py '//a[@id="bleh"]' a#bleh
BeautifulSoup is an excellent tool for web scraping. The development of BeautifulSoup 3 stopped in 2012, its author concentrates on BeautifulSoup 4 since then.
In this post I want to show how to use CSS selectors. With CSS selectors you can select part of a webpage, which is what we need when we do web scraping. Another possibility is to use XPath. I find CSS selectors easier to use. You can read this post too for a comparison: Why CSS Locators are the way to go vs XPath.
Let’s go through a concrete example, that way it will be easier to understand.
The page http://developerexcuses.com/ prints a funny line that developers can use as an excuse. Let’s extract this line.
Visit the page, start Firebug, and click on the line (steps 1 and 2 on the figure below):
Right click on the orange line (“
<a style=...“) and choose “Copy CSS Path”. Now the CSS path of the selected HTML element is on the clipboard, which is “
html body div.wrapper center a” in this example.
Now let’s write a script that prints this part of the HTML source:
import requests import bs4 def main(): r = requests.get("http://developerexcuses.com/") soup = bs4.BeautifulSoup(r.text) print soup.select("html body div.wrapper center a").text if __name__ == "__main__": main()
In your program you want to change the working directory temporarily, do some job there, then switch back to the original directory. Say you want to download some images to
/tmp. When done, you want to get back to the original location correctly, even if an exception was raised at the temp. location.
Let’s see the following example. We have a script, say at
/home/jabba/python/fetcher.py . We want to download some images to
/tmp, then work with them. After the download we want to create a subfolder “
process” in the same directory where the script
fetcher.py is located. We want to collect some extra info about the downloaded images and we want to store these pieces of information in the “
import os def download(li, folder): try: backup = os.getcwd() os.chdir(folder) for img in li: # download img somehow os.chdir(backup) except: # problem with download, handle it def main(): # step 1: download images to /tmp li = ["http://...1.jpg", "http://...2.jpg", "http://...3.jpg"] download(li, "/tmp") # step 2: create a "process" dir. HERE (where the script was launched) os.mkdir("process") # ...do some extra work...
There is a problem with the download method. If an image cannot be downloaded correctly and an exception occurs, we return from the method. However,
os.chdir(backup) is not executed and we remain in the
/tmp folder! In
main() in step 2 the
process directory will be created in
/tmp and not in the folder where we wanted it to be.
Well, you can always add a
finally block to the exception handler and place
os.chdir(backup) there, but it’s easy to forget. Is there an easier solution?
Yes, there is an easier solution. Use a context manager.
The previous example with a context manager:
import os def download(li, folder): with ChDir(folder): for img in li: # download img somehow def main(): # step 1: download images to /tmp li = ["http://...1.jpg", "http://...2.jpg", "http://...3.jpg"] download(li, "/tmp") # step 2: create a "process" dir. HERE (where the script was launched) os.mkdir("process") # ...do some extra work...
And now the source code of
import os class ChDir(object): """ Step into a directory temporarily. """ def __init__(self, path): self.old_dir = os.getcwd() self.new_dir = path def __enter__(self): os.chdir(self.new_dir) def __exit__(self, *args): os.chdir(self.old_dir)
ChDir is a context manager, you use it in a with block. At the beginning of the block you enter the given folder. When you leave the with block (even if you leave because of an exception), you are put back to the folder where you were before entering the with block.
Following this discussion thread @reddit, someone suggested using the PyFilesytem library. I think PyFilesytem is a very good solution but it may be too much for a short script. It’s like shooting a sparrow with a cannon :) For a simple script
ChDir is good enough for me. For a serious application, check out PyFilesytem.
You want to download something to your local machine.
You can use the
wget module for this purpose:
import wget wget.download(url)
It cannot be any simpler. You can find wget on PyPi. Installation via
Note that the module uses
urllib.urlretrieve for downloading, not wget.