Archive

Posts Tagged ‘text’

automatic text summarization

November 2, 2015 Leave a comment

See https://github.com/miso-belica/sumy . In the README there is a list of alternative projects.

Categories: python Tags: ,

Is a file binary?

June 17, 2014 Leave a comment

Problem
I want to process all text files in a folder recursively. (Actually, I want to extract all URLs from them). However, their extensions are not necessarily .txt. How to separate text files from binary files?

Solution
In this thread I found a solution. Here is my slightly modified version:

def is_binary(fname):
    """
    Return true if the given filename is binary.

    found at http://stackoverflow.com/questions/898669
    """
    CHUNKSIZE = 1024
    with open(fname, 'rb') as f:
        while True:
            chunk = f.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done

    return False

If it finds a '\0' character, then the file is considered to be binary. Note that it will also classify UTF-16-encoded text files as “binary”.

Categories: python Tags: ,