mctext – Using Markov Chains to Generate Text

mctext is a new project of mine, focusing on text generation using Markov Chains. This little utility reads a sample text file, preferably a large one, and generates new text based on the semantics given in the sample text.

How does it work?

mctext reads the given file and treats it as a list of words. Now it randomly chooses two adjacent words and puts in the output string. Now the text generation employs Markov Chains to continue. It takes the last two words in the output string, and searches for all the words that follow them in the sample file. He choose between those words randomly and adds the chosen one to the string. After doing so, it repeats the process until enough new text is generated.

For example, this was an output of the program when given 500 posts from Tech Crunch as sample text:

$ ./mctext -w 100 tc.txt
declined to name a specific position on the internet is now 
extended through the birth of high velocity P2P file sharing
and broadcasting short experiences, thoughts and fantasies.
By that we can look forward to seeing everyone. Loic Le Meur,
a well of useful contextual information that would be complete
without a second's hesitation. DonorsChoose Doing Well, But
Fred Wilson and the

(tc.txt was the file holding the text of the 500 posts)

Compiling and Using mctext

If you want to try it yourself, download the source package from here. Compiling is pretty straight forward (./configure && make). The program depends on the Boost library. Some Linux distributions separate the additional Boost libraries from the core ones, so if it’s your case you will need to install the program-options library.

Invoking the program is simple. Just pass it the sample text file as argument and use -w NUM flag to specify how much words do you want it to generate. mctext can also take the sample text from stdin. See mctext --help for more information.

mctext is a new project, and the current implementation was a proof-of-concept. As such, there is still a lot to improve and look up to.For the next I’m planning to allow changing the number of words considered at each step from the command line, as well as improving the sentence recognition. If you found a bug, or you have any suggestion for new feature I will be glad to hear.

Update: I’ve released a new version of mctext.

