I had this grand, exciting plan. A normal Markov generator trains on a corpus of text, and sees what words are likely to come after other words. So for example, given “Happy families are all alike,” it would say, “Happy families” —> “are”, “families are” —> “all”, “are all” —> “alike”. You can then start with any random seed and populate a whole chunk of however much text you please by going through your correspondence and picking at random: “hmm, what words might possibly come after ‘are all’?”
I was going to take it one step further and incorporate part-of-speech tagging. The above example sentence would be part-of-speech tagged, “adjective/plural noun/verb non-3rd person singular present/determiner/adverb.” Then I could make a correspondence: adjective + plural noun may be followed by a non-3rd person singular present verb, and so on. And when I had a whole paragraph or two worth of “adjective noun verb” etc. I could begin to fill that in from my word correspondence: “What word might come after ‘families are’ that is also a noun?”
It was brilliant! I was excited! And, as you’ve probably guessed, things didn’t turn out the way I had planned. I didn’t quite consider that one of my correspondences was much more general than the other, so is I generated my parts-of-speech beforehand and then went looking for a word that can follow words a + b and is also a (fill in part of speech here), I wouldn’t necessarily find anything! And then I’m reduced to filling in a random (part of speech), which leaves me no better off than a normal Markov generator, and perhaps even worse (see my last two posts for comparison).
So, my whole project idea is a bit shot in the foot. There’s one more thing I can try, which is generating the parts-of-speech and actual-words versions of my text pretty much concurrently. Given Word A and Word B, what’s a part-of-speech that might follow part-of-speech(Word A) and part-of-speech(Word B)? Now, what’s a word that might follow Word A and Word B that is a (part of speech)? And, y’know, even as I type this I’m not positive that it makes any sense whatsoever, and will be any better than a standard-issue Markov generator. (I’ll need to think this through more carefully when I haven’t had a beer.)
This whole debacle raises the question: when is it time to give up? When do I dump this project and say, “It’s been fun, but that thing I wanted to do is no longer feasible, I’m going to move on to something else”? If my grand plan has been foiled, do I leave the project altogether? Do I struggle to implement some halfway execution of my original plan that doesn’t quite make sense? Do I do something else vaguely useful with the project, but in another direction entirely? I’m a little curious to, say, look at n-grams of increasing n’s—that is, look at bigger and bigger groups of words at a time—and see where the trade-off is between intelligibility and just reproducing the original text. Another interesting path to take would be implementing a custom data structure that could weight common words, rather than having the word in question show up multiple times for a given key according to its frequency. But none of that is the stuff I came into this project excited about doing.
Was this all a waste of time? No, I don’t think so. I learned about serializers and thought a bit about program optimization for the first time (even though I figured out that all that was bloating my run-time was nltk’s incredible but super-slow part-of-speech tagger). I got yet more practice writing in Python, made my own classes, played in REPLs, made my first generator function, and learned the hard way that new instances of objects, even if they seem identical to other instances of that same object, are NOT IN FACT IDENTICAL. (I may or may not have spent my entire morning learning that one.) Plus, I learned how smoothly things can run if you map out your program beforehand (and, conversely, how borked they can get if you don’t).
I’m still a little bummed that I won’t have something shiny to show for this at the end. Well, I’ll have my simple Markov gen, but nothing all that original, and I’ve sunk a week on this project. But enumerating the things I’ve actually learned from this hot mess of a project has persuaded me somewhat, so I feel a little better. (Plus, a Markov generator is a still a pretty damn good party trick!)