### Archive

Posts Tagged ‘text generator’

## Generating pseudo random text using Markov chains

The following entry is based on this post: Generating pseudo random text with Markov chains using Python (by Shabda Raaj).

Problem
I’ve been interested for a long time in generating “random” texts using a given corpus. A naive way is to take words randomly and drop them together but it would result in an unreadable text. The words in the generated text should come in an order that gives the impression that the text is more or less legit :)

Solution
We will use Markov chains to solve this problem. In short, a Markov chain is a stochastic process with the Markov property. By this property the changes of state of the system depend only on the current state of the system, and not additionally on the state of the system at previous steps.

The algorithm for generating pseudo random text is the following:

1. Take two consecutive words from the corpus. We will build a chain of words and the last two words of the chain represent the current state of the Markov chain.
2. Look up in the corpus all the occurrences of the last two words (current state). If they appear more than once, select one of them randomly and add the word that follows them to the end of the chain. Now the current state is updated: it consists of the 2nd word of the former tail of the chain and the new word.
3. Repeat the previous step until you reach the desired length of the generated text.

When reading and splitting up a corpus to words, don’t remove commas, punctuations, etc. This way you can get a more realistic text.

Example
Let’s see this text:

```A is the father of B.
C is the father of A.
```

From this we can build the following dictionary:

```{('A', 'is'): ['the'],
('B.', 'C'): ['is'],
('C', 'is'): ['the'],
('father', 'of'): ['B.', 'A.'],
('is', 'the'): ['father', 'father'],
('of', 'B.'): ['C'],
('the', 'father'): ['of', 'of']}
```

The key is a tuple of two consecutive words. The value is a list of words that follow the two words in the key in the corpus. The value is a multiset, i.e. duplications are allowed. This way, if a word appears several times after the key, it will be selected with a higher probability.

Let’s start the generated sentence with “`A is`“. “`A is`” is followed by “`the`” (“`A is the`“). “`is the`” is followed by “`father`” (“`A is the father`“). “`the father`” is followed by “`of`” (“`A is the father of`“). At “`father of`” we have a choice: let’s pick “`A`” for instance. The end result is: “`A is the father of A.`“.

Python code
This is a basic version of the algorithm. Since the input corpus can be a UTF-8 file, I wrote it in Python 3 to suffer less with Unicode.

```#!/usr/bin/env python3
# encoding: utf-8

import sys
from pprint import pprint
from random import choice

EOS = ['.', '?', '!']

def build_dict(words):
"""
Build a dictionary from the words.

(word1, word2) => [w1, w2, ...]  # key: tuple; value: list
"""
d = {}
for i, word in enumerate(words):
try:
first, second, third = words[i], words[i+1], words[i+2]
except IndexError:
break
key = (first, second)
if key not in d:
d[key] = []
#
d[key].append(third)

return d

def generate_sentence(d):
li = [key for key in d.keys() if key[0][0].isupper()]
key = choice(li)

li = []
first, second = key
li.append(first)
li.append(second)
while True:
try:
third = choice(d[key])
except KeyError:
break
li.append(third)
if third[-1] in EOS:
break
# else
key = (second, third)
first, second = key

return ' '.join(li)

def main():
fname = sys.argv[1]
with open(fname, "rt", encoding="utf-8") as f:

words = text.split()
d = build_dict(words)
pprint(d)
print()
sent = generate_sentence(d)
print(sent)
if sent in text:
print('# existing sentence :(')

####################

if __name__ == "__main__":
if len(sys.argv) == 1:
print("Error: provide an input corpus file.")
sys.exit(1)
# else
main()
```

Tips
Try to choose a long corpus to work with.

In our version the current state consists of two words. If you decide to put more words (3 for instance) in the current state, then the text will look less random, but also, it will look less gibberish (see also this gist).