Reading (writing) unicode text from (to) files
Problem
You want to write some special characters to a file (e.g. f.write("voilá")
) but you get immediately some unicode error in your face.
Solution
Instead of messing with the encode, decode methods, use the codecs module.
import codecs # read with codecs.open(fname, "r", "utf-8") as f: text = f.read() # write with codecs.open(tmp, "w", "utf-8") as to: to.write(text)
As can be seen, its usage is very similar to the well-known open
function.
This tip is from here.
Unicode box-drawing characters
I wanted to make a logo for my project PrimCom. Since it runs in the command line, I wanted to draw the logo with characters.
You can find a list of box-drawing characters here. I designed the logo on a paper, translated the box characters to normal characters (this way it was easier to type them in), then translated the characters back with a script.
Python source:
#!/usr/bin/env python # encoding: utf-8 from __future__ import (absolute_import, division, print_function, unicode_literals) import sys chars = { 'a': '┌', 'b': '┐', 'c': '┘', 'd': '└', 'e': '─', 'f': '│', 'g': '┴', 'h': '├', 'i': '┬', 'j': '┤', 'k': '╷', 'l': '┼', } logo = """ aeeb aeeb fabf faec fdcheiieejf aeeieeb faejaljkkff fabfkkf ff fffffffdejdcffff dc dcdggggeegeegggc """ def main(): for c in logo: if c in chars: sys.stdout.write(chars[c]) else: sys.stdout.write(c) if __name__ == "__main__": main()
Output:
┌─────┐ ┌─────┐ │ ┌─┐ │ │ ┌───┘ │ └─┘ ├───┬─┬─────┤ │ ┌─────┬─────┐ │ ┌───┤ ┌─┼─┤ ╷ ╷ │ │ │ ┌─┐ │ ╷ ╷ │ │ │ │ │ │ │ │ │ │ └───┤ └─┘ │ │ │ │ └─┘ └─┘ └─┴─┴─┴─┴─────┴─────┴─┴─┴─┘
Update (20140322)
If you want rainbow colors, check out the colout project.
monkeypatching the string type
Problem
“A monkey patch is a way to extend or modify the run-time code of dynamic languages without altering the original source code.” (via wikipedia) That is, we have the standard library, and we want to add new features to it. For instance, in the stdlib a string cannot tell whether it is a palindrome or not, but we would like to extend the string type to support this feature:
>>> s = "racecar" >>> print(s.is_palindrome()) # Warning! It won't work. True
Is it possible in Python?
Solution
As pointed out in this thread, built-in types are implemented in C and you cannot modify them in runtime. As I heard Ruby allows this, but it doesn’t work in Python.
However, there is a workaround if you really want to do something like this. You can make a subclass of the built-in type and then you can extend it as you want. Example:
from __future__ import (absolute_import, division, print_function, unicode_literals) class MyStr(unicode): """ "monkeypatching" the unicode class It's not real monkeypatching, just a workaround. """ def is_palindrome(self): return self == self[::-1] def main(): s = MyStr("radar") print(s.is_palindrome()) #################### if __name__ == "__main__": main()
Writing non-ASCII text to file
Problem
You download the source of an HTML page in a string and you want to save it in a file. However, you get some UnicodeDecodeError
:(
Solution
foo = u'Δ, Й, ק, م, ๗, あ, 叶, 葉, and 말.' f = open('test', 'w') f.write(foo.encode('utf8')) f.close()
Here is how to read it back:
f = open('test', 'r') print f.read().decode('utf8')
This tip is from here.
‘ascii’ codec can’t encode character: ordinal not in range(128)
Problem
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 1: ordinal not in range(128)
Solution
def encode(text): """ For printing unicode characters to the console. """ return text.encode('utf-8')
Or:
reload(sys) sys.setdefaultencoding("latin-1") a = u'\xe1' print str(a) # no exception
This tip is from here.
unicode to ascii
Problem
I had the following unicode string: “Kellemes Ünnepeket!” that I wanted to simplify to this: “Kellemes Unnepeket!”, that is strip “Ü” to “U”. Furthermore, most of the strings were normal ascii, only some of them were in unicode.
Solution
import unicodedata title = ... # get the string somehow try: # if the title is a unicode string, normalize it title = unicodedata.normalize('NFKD', title).encode('ascii','ignore') except TypeError: # if it was not a unicode string => OK, do nothing pass
Credits
I used the following resources: