Random Words
Random Letters
You may think it easy to create random words ... just pick letters randomly and put them together, and voila! a random word.
Well, here are 20 words made that way:
muyfd ighgd xhlng oyecn vjnsl ssjrx gxald tukxj rvfoq yxzxq
It turns out that the words are not only nonsense, but quite hard to pronounce!
(Try saying "tldkl" or "oewkx")
You see, the probability is very unlikely ... you would have to try lots of random combinations before getting lucky.
Why? Well, English has around 200,000 words (228,000 in the Oxford English Dictionary including many words no longer used) ... but how many different words can be made with just 5 letters?
26 × 26 × 26 × 26 × 26 = 11,881,376 possible 5 letter words!
And that is just the 5 letter words ...
Let us guess that there are 40,000 words in English that have 5 letters. So the probability of making a real word just randomly would be:
40,000 / 11,881,376 = 0.003, or about 0.3% chance
So real words are rare. And we can see that putting random letters together is very unlikely to produce a real word.
Vowels
We can improve our success by insisting that a word have at least one vowel, since nearly every word in English has one (except fly, by and a few others). Like this:
fnevz ewxko ljgew aglgo jpfoq dcytu uwkcj dzioy wekdx xuybk
This is a great improvement. More words can be pronounced.
But there are still lots of strange words like "zspsu" and "xuybk"
Letter Frequency
So, our next improvement is to use less of the letters like j, x, z and q and more of the letters like e, t and s.
In fact the frequency of letters in the English Language is well known. Here is how many times you would expect to see a letter in every 1,000 letters:
a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z |
82 | 15 | 28 | 42 | 127 | 22 | 20 | 61 | 70 | 2 | 8 | 40 | 24 | 67 | 75 | 19 | 1 | 60 | 63 | 90 | 27 | 10 | 24 | 2 | 20 | 1 |
Can you see that "e" is common, but "z" is rare?
- "e" is lkely to occur 127 times in every 1,000, or as a ratio 127/1000 = .127 (=12.7%)
- "z" is lkely to occur only 1 time in every 1,000, or as a ratio 1/1000 = .001 (=0.1%)
So, by selecting letters based on that frequency (a bit like rolling a 1,000 sided die (dice), where each die has 82 a's, 15 b's ... and only one z), we can get output like this:
dmswo dpuoh eewis ebdni laarm syucs idvos lhina igahh soyie
Still no real words, but some are close. And most of them can be pronounced. (Great names if you are writing a science fiction novel!)
Try For Yourself!
You can try all three methods here ... see if you can get lucky and find a real word:
but we can do better ...
2-Letter Frequencies
We can take the idea of Letter Frequency one step further by asking
"what is the frequency of letters that follow another letter"
For example, if we already have a "t", the next letter is very likely to be an "h" (making "th").
To illustrate this, I built up a Table of Two-Letter Frequencies (from Alice's Adventures in Wonderland).
Here is the line for "t":
Freq | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
t | 238 | 41 | 727 | 11 | 3197 | 459 | 275 | 18 | 12 | 990 | 149 | 153 | 333 | 125 | 65 | 54 |
So, "h" occured 3197 times after a "t" ("th") ... but "b" never followed a "t"
OK, let us start with a "t", and let us say we choose an "h" to make "th", then next we would use the "h"-row to choose another letter (maybe an "e" to make "the"), and so on ... well, here is a sample:
soacthake d imon binofowat oaten d heng wa
The results are remarkable ... nonsense, but almost like some strange language.
In fact we are not just making random words now, we are making random sentences!
Higher Letter Frequencies
Why stop there? We can make tables of three letter frequencies or more ...
3 Letter Frequencies
How do 3 Letter Frequencies work?
Well, say I already have two letters (like "ei") ... we then:
- look through the sample text for every time "ei" appears,
- randomly choose one of those
- look for the letter following "ei" (possibly "t").
- then add the "t" to make "eit"
- and start again using "it" (... always the last two letters)
Here is a sample:
to wondere started into the book about hear!
Now, that looks good! By sampling from a real source we can get good results.
4 Letter Frequencies
Using the same method I used groups of 3 Letters to decide on the 4th letter and got:
happen next. First, she look down mind
5 Letter Frequencies
And with 5 Letter frequencies:
but to take out of time as she had not like to do
Try For Yourself!
Yes, I wrote something for you to play with. It has the first 6 paragraphs from Alice's Adventures in Wonderland), but you can put you own source text in there.
Find something from Shakespeare, or a political speech and see what it comes up with ... you could even combine quotes from different authors to see what their children might write.
And Beyond
What if we were to take an entire encyclopedia, and choose not just sequences of letters, but of word fragments. And they don't have to be in order but just nearby each other.
Would it Generate a good response using Pretrained data, by Transforming it?