extract text from a PDF file
Problem
You have a PDF file and you want to extract text from it.
Solution
You can use the PyPDF2 module for this purpose.
import PyPDF2
def main():
book = open('book.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(book)
pages = pdfReader.numPages
page = pdfReader.getPage(0) # 1st page
text = page.extractText()
print(text)
Note that indexing starts at 0. So if you open your PDF with Adobe Reader for instance and you locate page 20, in the source code you must use getPage(19)
.
Links
- PyPDF2 on GitHub
Exercise
Write a program that extracts all pages of a PDF and saves the content of the pages to separate files, e.g. page0.txt
, page1.txt
, etc.
pdfmanip
Today I wrote a simple PDF manipulation CLI tool. You can find it here: https://github.com/jabbalaci/pdfmanip .
Python tutorials of Full Circle Magazine in a single PDF
On my other blog, I wrote a post on how to extract the Python tutorials from Full Circle Magazine and join them in a single PDF.
For the lazy pigs, here is the PDF (6 MB). Get it while it’s hot :)
Create a temporary file with unique name
Problem
I wanted to download an html file with Python, store it in a temporary file, then convert this file to PDF by calling an external program.
Solution #1
#!/usr/bin/env python import os import tempfile temp = tempfile.NamedTemporaryFile(prefix='report_', suffix='.html', dir='/tmp', delete=False) html_file = temp.name (dirName, fileName) = os.path.split(html_file) fileBaseName = os.path.splitext(fileName)[0] pdf_file = dirName + '/' + fileBaseName + '.pdf' print html_file # /tmp/report_kWKEp5.html print pdf_file # /tmp/report_kWKEp5.pdf # calling of HTML to PDF converter is omitted
See the documentation of tempfile.NamedTemporaryFile
here.
Solution #2 (update 20110303)
I had a problem with the previous solution. It works well in command-line, but when I tried to call that script in crontab, it stopped at the line “tempfile.NamedTemporaryFile”. No exception, nothing… So I had to use a different approach:
from time import time temp = "report.%.7f.html" % time() print temp # report.1299188541.3830960.html
The function time() returns the time as a floating point number. It may not be suitable in a multithreaded environment, but it was not the case for me. This version works fine when called from crontab.
Learn more
- tempfile – Create temporary filesystem resources (post by Doug Hellmann with lots of examples)
- Python doc on tempfile
Update (20150712): if you need a temp. file name in the current directory:
>>> import tempfile >>> tempfile.NamedTemporaryFile(dir='.').name '/home/jabba/tmpKrBzoY'
Update (20150910): if you need a temp. directory:
import tempfile import shutil dirpath = tempfile.mkdtemp() # the temp dir. is created # ... do stuff with dirpath shutil.rmtree(dirpath)
This tip is from here.