14th November 2020
Let's build a random word generator in Python to generate novel pronounceable words. This kind of tool has a lot of uses — we could use it to generate potential domain names, or ideas for business names, or names for characters in a fantasy novel, or unique, non-dictionary words to use in passphrases.
We'll use a Markov chain to generate new words based on a corpus of sample words — this way we can generate words with whatever 'flavour' we like, just by tailoring the input data.
You can find the code for this project on Github.
We actually don't need very much code at all — this is the full listing for our Markov generator:
import random class WordGenerator: def __init__(self): self.table = {} self.start = [] def add_word(self, word): size = len(word) if size < 3: return self.start.append((word[0], word[1])) for i in range(0, size - 2): c1, c2, c3 = word[i], word[i + 1], word[i + 2] self.table.setdefault((c1, c2), []).append(c3) c1, c2 = word[size - 2], word[size - 1] self.table.setdefault((c1, c2), []).append(None) def get_word(self, max_len=12): c1, c2 = random.choice(self.start) word = [c1, c2] while len(word) < max_len: next_char = random.choice(self.table[(c1, c2)]) if next_char is None: break word.append(next_char) c1, c2 = c2, next_char return ''.join(word)
So how does it work? We initialize a generator instance and feed in sample words one at a time:
generator = WordGenerator() generator.add_word("hotel") generator.add_word("host")
The generator uses these sample words to populate a lookup table, associating each pair of characters in the input with a list of all the characters which have followed that pair.
For example, after adding the single word "hotel"
to the generator its lookup table looks like this:
(h, o) --> [t] (o, t) --> [e] (t, e) --> [l] (e, l) --> [None]
Each key is a tuple of two characters, each value is a list of characters. We use the special value None
to record when a pair of characters marked the end of a word.
If we add the words "host"
, "home"
, and "hostel"
to the generator its table will expand to look like the following:
(h, o) --> [t, s, m, s] (o, t) --> [e] (t, e) --> [l, l] (e, l) --> [None, None] (o, s) --> [t, t] (s, t) --> [None, e] (o, m) --> [e] (m, e) --> [None]
(You might have noticed that our algorithm can add duplicate entries to a list — this is deliberate. The number of times a character appears in a list reflects its relative frequency in the input data and determines its probability of being selected for output.)
The more words you add, the higher the quality of your output will be. You can safely add thousands of words to a generator — while I was testing this code I added a list of more than 450,000 English words to a test generator and my laptop chewed through the list in less than three seconds.
To generate an ouput word we pick a starting pair of characters at random, then select the next character by choosing randomly from the list of values associated with that pair. We continue appending characters in this way until we reach the maximum length limit or we select a None
value from a list, indicating that we've reached a natural end-point for the word.
So what kind of output do we get? It's mostly junk of course but with the occasional jewel. I tried feeding in a list of roughly 6,500 English nouns as input and got the following output:
gostion ovessy bandal shoption repatity whiver boation phydry snodrocash choment scry totent furger sner pookeratry hummist
I think I've seen emails in my spam folder about investing in the Snodrocash blockchain.
Feeding in a list of more than 25,000 personal names gave me the following output:
Yonal Gurtov Denris Suser Frena Ari Aghudia Artxoang Zsum Neerislaa Zerguona Janciery Nia Myrall Valdorina Walim
That's the lineup for a best-selling fantasy novel right there. (Okay, maybe a bottom-shelf also-ran with a dodgy cover illustration, but still.)
You might be concerned that the size of our generator instance is unbounded — if we keep appending characters every time we add a word then the generator will keep getting bigger and bigger without limit.
Fortunately we can fix this issue (at the cost of adding some extra complexity) by recording character frequencies instead of simply storing the characters themselves.
I'm not going to list the code for this here but you can find it in the ImprovedGenerator
class in the repository.
You can find the code for this project on Github, along with a bunch of sample word lists.