sanitizing tweets

February 12, 2018 Leave a comment

Problem
You have the text of a tweet and you want to get rid of the bullshit (smileys, emojis, etc.)

Solution
See https://github.com/s/preprocessor. It’s customizable, you can select what to remove, e.g. URLs, smileys, etc.

Advertisements

What are the built-in functions?

January 19, 2018 Leave a comment

Problem
How to figure out the built-in functions in Python? Of course, you can look up the documentation, but now the exercise is to list them in the Python shell.

Solution

In [1]: import builtins

In [2]: dir(builtins)
Out[2]: 
['ArithmeticError',
 'AssertionError',
 'AttributeError',
 'BaseException',
 'BlockingIOError',
 'BrokenPipeError',
 'BufferError',
 'BytesWarning',
 'ChildProcessError',
 'ConnectionAbortedError',
 'ConnectionError',
 'ConnectionRefusedError',
 'ConnectionResetError',
 'DeprecationWarning',
 'EOFError',
 'Ellipsis',
 'EnvironmentError',
 'Exception',
 'False',
 'FileExistsError',
 'FileNotFoundError',
 'FloatingPointError',
 'FutureWarning',
 'GeneratorExit',
 'IOError',
 'ImportError',
 'ImportWarning',
 'IndentationError',
 'IndexError',
 'InterruptedError',
 'IsADirectoryError',
 'KeyError',
 'KeyboardInterrupt',
 'LookupError',
 'MemoryError',
 'ModuleNotFoundError',
 'NameError',
 'None',
 'NotADirectoryError',
 'NotImplemented',
 'NotImplementedError',
 'OSError',
 'OverflowError',
 'PendingDeprecationWarning',
 'PermissionError',
 'ProcessLookupError',
 'RecursionError',
 'ReferenceError',
 'ResourceWarning',
 'RuntimeError',
 'RuntimeWarning',
 'StopAsyncIteration',
 'StopIteration',
 'SyntaxError',
 'SyntaxWarning',
 'SystemError',
 'SystemExit',
 'TabError',
 'TimeoutError',
 'True',
 'TypeError',
 'UnboundLocalError',
 'UnicodeDecodeError',
 'UnicodeEncodeError',
 'UnicodeError',
 'UnicodeTranslateError',
 'UnicodeWarning',
 'UserWarning',
 'ValueError',
 'Warning',
 'ZeroDivisionError',
 '__IPYTHON__',
 '__build_class__',
 '__debug__',
 '__doc__',
 '__import__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'abs',
 'all',
 'any',
 'ascii',
 'bin',
 'bool',
 'bytearray',
 'bytes',
 'callable',
 'chr',
 'classmethod',
 'compile',
 'complex',
 'copyright',
 'credits',
 'delattr',
 'dict',
 'dir',
 'display',
 'divmod',
 'enumerate',
 'eval',
 'exec',
 'filter',
 'float',
 'format',
 'frozenset',
 'get_ipython',
 'getattr',
 'globals',
 'hasattr',
 'hash',
 'help',
 'hex',
 'id',
 'input',
 'int',
 'isinstance',
 'issubclass',
 'iter',
 'len',
 'license',
 'list',
 'locals',
 'map',
 'max',
 'memoryview',
 'min',
 'next',
 'object',
 'oct',
 'open',
 'ord',
 'pow',
 'print',
 'property',
 'range',
 'repr',
 'reversed',
 'round',
 'set',
 'setattr',
 'slice',
 'sorted',
 'staticmethod',
 'str',
 'sum',
 'super',
 'tuple',
 'type',
 'vars',
 'zip']
Categories: python Tags:

BASE64 as URL parameter

January 1, 2018 Leave a comment

Problem
In a REST API, I wanted to pass a URL as a BASE64-encoded string, e.g. “http://host/api/v2/url/aHR0cHM6...“. It worked well for a while but I got an error for a URL. As it turned out, a BASE64 string can contain the “/” sign, and it caused the problem.

Solution
Replace the “+” and “/” signs with “-” and “_“, respectively. Fortunately, Python has functions for that (see here).

Here are my modified, URL-safe functions:

def base64_to_str(b64):
    return base64.urlsafe_b64decode(b64.encode()).decode()

def str_to_base64(s):
    data = base64.urlsafe_b64encode(s.encode())
    return data.decode()

You can also quote and unquote a URL instead of using BASE64:

>>> url = "https://www.youtube.com/watch?v=V6w24Lg3zTI"
>>>
>>> import urllib.parse
>>>
>>> new = urllib.parse.quote(url)
>>>
>>> new
>>> 'https%3A//www.youtube.com/watch%3Fv%3DV6w24Lg3zTI'    # notice the "/" signs!
>>>
>>> urllib.parse.quote(url, safe='')
>>> 'https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DV6w24Lg3zTI'    # no "/" signs!
>>>
>>> new = urllib.parse.quote(url, safe='')
>>>
>>> urllib.parse.unquote(new)
>>> 'https://www.youtube.com/watch?v=V6w24Lg3zTI'
Categories: python Tags: , , ,

convert a file to an UTF-8-encoded text

December 16, 2017 Leave a comment

I wrote a simple script that takes an input file, changes its character encoding to UTF-8, and prints the result to the screen.

It’s actually a wrapper around the Unix commands “file” and “iconv“. The goal was to make its usage as simple as possible. The script is here: to_utf8.py.

Usage:

$ to_utf8.py input.txt

The program tries to detect the encoding of the input file.

Links

Categories: bash, python Tags: ,

work in a temp. dir. and delete it when done

December 11, 2017 Leave a comment

Problem
You want to work in a temp. directory, and delete it completely when you are done. You also need the name of this temp. folder.

Solution
You can write with tempfile.TemporaryDirectory() as dirpath:, and the temp. dir. will be removed automatically by the context manager when you quit the with block. Nice and clean.

import tempfile
from pathlib import Path

with tempfile.TemporaryDirectory() as dirpath:
    fp = Path(dirpath, "data.txt")
    # create fp, process it, etc.

# when you get here, dirpath is removed recursively

More info in the docs.

Categories: python Tags: ,

extract e-mails from a file

October 10, 2017 Leave a comment

Problem
You have a text file and you want to extract all the e-mail addresses from it. For research purposes, of course.

Solution

#!/usr/bin/env python3

import re
import sys

def extract_emails_from(fname):
    with open(fname, errors='replace') as f:
        for line in f:
            match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
            for e in match:
                if '?' not in e:
                    print(e)
                    
def main():
    fname = sys.argv[1]
    extract_emails_from(fname)

##############################################################################

if __name__ == "__main__":
    if len(sys.argv) == 1:
        print("Error: provide a text file!", file=sys.stderr)
        exit(1)
    # else
    main()

I had character encoding problems with some lines where the original program died with an exception. Using “open(fname, errors='replace')” will replace problematic characters with a “?“, hence the extra check before printing an e-mail to the screen.

The core of the script is the regex to find e-mails. That tip is from here.

Categories: python Tags: , ,

visiting Finland

August 31, 2017 Leave a comment

Last week I gave a 2-day long intensive introductory Python course at the University of Jyväskylä, Finland. It went well :)

The course was for students who already learnt programming but never used Python before. We covered the following topics: introduction, strings, lists, loops, tuple data type, list comprehension, control structures, functions, set, dictionary, file handling. We also solved a lot of exercises. Total length of the course was 2 x 5 hours.

Categories: python Tags: , , ,