See https://github.com/miso-belica/sumy . In the README there is a list of alternative projects.
I want to process all text files in a folder recursively. (Actually, I want to extract all URLs from them). However, their extensions are not necessarily
.txt. How to separate text files from binary files?
In this thread I found a solution. Here is my slightly modified version:
def is_binary(fname): """ Return true if the given filename is binary. found at http://stackoverflow.com/questions/898669 """ CHUNKSIZE = 1024 with open(fname, 'rb') as f: while True: chunk = f.read(CHUNKSIZE) if '\0' in chunk: # found null byte return True if len(chunk) < CHUNKSIZE: break # done return False
If it finds a
'\0' character, then the file is considered to be binary. Note that it will also classify UTF-16-encoded text files as “binary”.
Here is a mini cheat sheet for reading and writing a text file.
Read a text file line by line and write each line to another file (copy):
f1 = open('./in.txt', 'r') to = open('./out.txt', 'w') for line in f1: to.write(line) f1.close() to.close()
text = f.read() # read the entire file line = f.readline() # read one line at a time lineList = f.readlines() # read the entire file as a list of lines