Let's Build a Markov-Chain Word Generator in Python

Let's build a random word generator in Python to generate novel pronounceable words. This kind of tool has a lot of uses — we could use it to generate potential domain names, or ideas for business names, or names for characters in a fantasy novel, or unique, non-dictionary words to use in passphrases.

We'll use a Markov chain to generate new words based on a corpus of sample words — this way we can generate words with whatever 'flavour' we like, just by tailoring the input data.

You can find the code for this project on Github.

The Code

We actually don't need very much code at all — this is the full listing for our Markov generator:

import random

class WordGenerator:

    def __init__(self):
        self.table = {}
        self.start = []

    def add_word(self, word):
        size = len(word)
        if size < 3:
            return
        self.start.append((word[0], word[1]))

        for i in range(0, size - 2):
            c1, c2, c3 = word[i], word[i + 1], word[i + 2]
            self.table.setdefault((c1, c2), []).append(c3)

        c1, c2 = word[size - 2], word[size - 1]
        self.table.setdefault((c1, c2), []).append(None)

    def get_word(self, max_len=12):
        c1, c2 = random.choice(self.start)
        word = [c1, c2]

        while len(word) < max_len:
            next_char = random.choice(self.table[(c1, c2)])
            if next_char is None:
                break
            word.append(next_char)
            c1, c2 = c2, next_char

        return ''.join(word)

Supplying Input

So how does it work? We initialize a generator instance and feed in sample words one at a time:

generator = WordGenerator()
generator.add_word("hotel")
generator.add_word("host")

The generator uses these sample words to populate a lookup table, associating each pair of characters in the input with a list of all the characters which have followed that pair.

For example, after adding the single word "hotel" to the generator its lookup table looks like this:

(h, o)  -->  [t]
(o, t)  -->  [e]
(t, e)  -->  [l]
(e, l)  -->  [None]

Each key is a tuple of two characters, each value is a list of characters. We use the special value None to record when a pair of characters marked the end of a word.

If we add the words "host", "home", and "hostel" to the generator its table will expand to look like the following:

(h, o)  -->  [t, s, m, s]
(o, t)  -->  [e]
(t, e)  -->  [l, l]
(e, l)  -->  [None, None]
(o, s)  -->  [t, t]
(s, t)  -->  [None, e]
(o, m)  -->  [e]
(m, e)  -->  [None]

(You might have noticed that our algorithm can add duplicate entries to a list — this is deliberate. The number of times a character appears in a list reflects its relative frequency in the input data and determines its probability of being selected for output.)

The more words you add, the higher the quality of your output will be. You can safely add thousands of words to a generator — while I was testing this code I added a list of more than 450,000 English words to a test generator and my laptop chewed through the list in less than three seconds.

Generating Output

To generate an ouput word we pick a starting pair of characters at random, then select the next character by choosing randomly from the list of values associated with that pair. We continue appending characters in this way until we reach the maximum length limit or we select a None value from a list, indicating that we've reached a natural end-point for the word.

So what kind of output do we get? It's mostly junk of course but with the occasional jewel. I tried feeding in a list of roughly 6,500 English nouns as input and got the following output:

gostion         ovessy          bandal          shoption
repatity        whiver          boation         phydry
snodrocash      choment         scry            totent
furger          sner            pookeratry      hummist

I think I've seen emails in my spam folder about investing in the Snodrocash blockchain.

Feeding in a list of more than 25,000 personal names gave me the following output:

Yonal           Gurtov          Denris          Suser
Frena           Ari             Aghudia         Artxoang
Zsum            Neerislaa       Zerguona        Janciery
Nia             Myrall          Valdorina       Walim

That's the lineup for a best-selling fantasy novel right there. (Okay, maybe a bottom-shelf also-ran with a dodgy cover illustration, but still.)

Issues

You might be concerned that the size of our generator instance is unbounded — if we keep appending characters every time we add a word then the generator will keep getting bigger and bigger without limit.

Fortunately we can fix this issue (at the cost of adding some extra complexity) by recording character frequencies instead of simply storing the characters themselves. I'm not going to list the code for this here but you can find it in the ImprovedGenerator class in the repository.

Source

You can find the code for this project on Github, along with a bunch of sample word lists.

All code has been placed in the public domain.
Word lists are distributed according to their source licenses.