Archive

Posts Tagged ‘html’

remove tags from HTML

July 13, 2016 Leave a comment

Problem
You have an HTML string and you want to remove all the tags from it.

Solution
Install the package “bleach” via pip. Then:

>>> import bleach
>>> html = "Her <h1>name</h1> was <i>Jane</i>."
>>> cleaned = bleach.clean(html, tags=[], attributes={}, styles=[], strip=True)
>>> html
'Her <h1>name</h1> was <i>Jane</i>.'
>>> cleaned
'Her name was Jane.'

Tip from here.

Advertisements
Categories: python Tags: , ,

get the title of a web page

September 8, 2015 Leave a comment

Problem
You need the title of a web page.

Solution

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
print soup.title.string

I found the solution here.

Categories: python Tags: , , ,

Jinja2 example for generating a local file using a template

February 25, 2014 1 comment

Here I want to show you how to generate an HTML file (a local file) using a template with the Jinja2 template engine.

Python source (proba.py)

#!/usr/bin/env python

import os
from jinja2 import Environment, FileSystemLoader

PATH = os.path.dirname(os.path.abspath(__file__))
TEMPLATE_ENVIRONMENT = Environment(
    autoescape=False,
    loader=FileSystemLoader(os.path.join(PATH, 'templates')),
    trim_blocks=False)


def render_template(template_filename, context):
    return TEMPLATE_ENVIRONMENT.get_template(template_filename).render(context)


def create_index_html():
    fname = "output.html"
    urls = ['http://example.com/1', 'http://example.com/2', 'http://example.com/3']
    context = {
        'urls': urls
    }
    #
    with open(fname, 'w') as f:
        html = render_template('index.html', context)
        f.write(html)


def main():
    create_index_html()

########################################

if __name__ == "__main__":
    main()

Jinja2 template (templates/index.html)

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8"/>
    <title>Proba</title>
</head>
<body>
<center>
    <h1>Proba</h1>
    <p>{{ urls|length }} links</p>
</center>
<ol align="left">
{% set counter = 0 -%}
{% for url in urls -%}
<li><a href="{{ url }}">{{ url }}</a></li>
{% set counter = counter + 1 -%}
{% endfor -%}
</ol>
</body>
</html>

Resulting output
If you execute proba.py, you will get this output:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8"/>
    <title>Proba</title>
</head>
<body>
<center>
    <h1>Proba</h1>
    <p>3 links</p>
</center>
<ol align="left">
<li><a href="http://example.com/1">http://example.com/1</a></li>
<li><a href="http://example.com/2">http://example.com/2</a></li>
<li><a href="http://example.com/3">http://example.com/3</a></li>
</ol>
</body>
</html>

You can find all these files here (GitHub link).

Categories: python Tags: , ,

Prettify HTML with BeautifulSoup

April 3, 2011 Leave a comment

With the Python library BeautifulSoup (BS), you can extract information from HTML pages very easily. However, there is one thing you should keep in mind: HTML pages are usually malformed. BS tries to correct an HTML page, but it means that BS’s internal representation of the HTML page can be slightly different from the original source. Thus, when you want to localize a part of an HTML page, you should work with the internal representation.

The following script takes an HTML and prints it in a corrected form, i.e. it shows how BS stores the given page. You can also use it to prettify the source:

#!/usr/bin/env python

# prettify.py
# Usage: prettify <URL>

import sys
import urllib
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

def process(url):
    myopener = MyOpener()
    #page = urllib.urlopen(url)
    page = myopener.open(url)

    text = page.read()
    page.close()

    soup = BeautifulSoup(text)
    return soup.prettify()
# process(url)

def main():
    if len(sys.argv) == 1:
        print "Jabba's HTML Prettifier v0.1"
        print "Usage: %s <URL>" % sys.argv[0]
        sys.exit(-1)
    # else, if at least one parameter was passed
    print process(sys.argv[1])
# main()

if __name__ == "__main__":
    main()

You can find the latest version of the script at https://github.com/jabbalaci/Bash-Utils.

Categories: python Tags: , , ,

Create a temporary file with unique name

February 19, 2011 Leave a comment

Problem

I wanted to download an html file with Python, store it in a temporary file, then convert this file to PDF by calling an external program.

Solution #1

#!/usr/bin/env python

import os
import tempfile

temp = tempfile.NamedTemporaryFile(prefix='report_', suffix='.html', dir='/tmp', delete=False)

html_file = temp.name
(dirName, fileName) = os.path.split(html_file)
fileBaseName = os.path.splitext(fileName)[0]
pdf_file = dirName + '/' + fileBaseName + '.pdf'

print html_file   # /tmp/report_kWKEp5.html
print pdf_file    # /tmp/report_kWKEp5.pdf
# calling of HTML to PDF converter is omitted

See the documentation of tempfile.NamedTemporaryFile here.

Solution #2 (update 20110303)

I had a problem with the previous solution. It works well in command-line, but when I tried to call that script in crontab, it stopped at the line “tempfile.NamedTemporaryFile”. No exception, nothing… So I had to use a different approach:

from time import time

temp = "report.%.7f.html" % time()
print temp    # report.1299188541.3830960.html

The function time() returns the time as a floating point number. It may not be suitable in a multithreaded environment, but it was not the case for me. This version works fine when called from crontab.

Learn more

Update (20150712): if you need a temp. file name in the current directory:

>>> import tempfile
>>> tempfile.NamedTemporaryFile(dir='.').name
'/home/jabba/tmpKrBzoY'

Update (20150910): if you need a temp. directory:

import tempfile
import shutil

dirpath = tempfile.mkdtemp()    # the temp dir. is created
# ... do stuff with dirpath
shutil.rmtree(dirpath)

This tip is from here.

Categories: python Tags: , , , , , ,