Ladies and Gentlemen,
This week, I would like to introduce the amazing blog-writing computer machine. This machine is based on a recurrent neural network (RNN) which is a machine learning algorithm that looks at one input at a time, remembering what it has seen before when it looks at the next input.
To use this to analyze text, one can build the RNN to, based on previous words, predict the next word. That is, if it sees “A microcosm of sorrow is me” it might predict that the sentence is over and needs a period, or perhaps an exclamation point. This is known as language model. To make language we first train a language model as above on existing text, then we build a generative language model. A generative language model, when it predicts the next word, then reads it as input and predicts what would come after it.
Say we start with “The.”
The network would see it and decide “cat” is the most likely next word. Then we have “The cat.” The network then looks at “cat,” remembering that it saw “The” earlier, and predicts “sat,” giving us “The cat sat.” This is the deterministic version, which always selects the most likely word and therefore will always give the same result. Soon I will explain how we can generate a variety of blog entries.
My AutoBlog has a vocabulary of 10,000 words. Words not in its vocabulary it calls <unk>. It also ignores capitalization. Its entire understanding of English is based exclusively on the 177 entries from my new blog, so please keep that in mind when you read it. Next week I’ll add the 299 posts from my old blog and see how it improves.
This is the deterministic version of my autoblog. I have cleaned up the capitalization and removed unnecessary spaces. There’s also a bug in my preprocessing that caused some ‘ to show up as “, which I’ve corrected post-hoc. I have also abbreviated it prematurely, as honestly this AutoBlog is not good enough for me to ask people to read 1,000 words of it yet.
“I suppose this has gone on my computer.” I didn’t have a very good, but i just like this time. “We were very glad, Diane.” Said the Caretaker, but I tried to get it into the air, but I really have it up the <unk>. I tried to get the conversation. I took up to this time and pulled up in the chair so i could have to put my own phone to get a picture of my mind. In the meantime, I was not to keep the paper on my computer, and i couldn’t bring it up. I was so glad to look at the door, I realized that had very good. I had to admit that the Cleaners came up a <unk> at the door. He also gave me the whole time , and I said we should make a particularly argument. I had a lot of relief with two of the time. It was so bad that I had not tasted <unk> it . “I started to have this way to get out of you for my blog. “I was not going to my face. In an case I was not to tell my stories. I felt so <unk> that I was trying to look at it. I wasn’t going to make it “S voice , but I’m going to get the lock out of it in the house . I stopped behind him in the <unk>, but it’s just there. I had a lot of it . “Oh, I know I know the most of the most of the most time I have ever tasted. ” “You have never been free,” said Henry.
This next entry is generated randomly based on sampling from the distribution of possible words. Each candidate for next word has a probability of occurring and so there is that probability of it being selected. You should notice right away the greater variety in this entry. There are a few instances where markup made it into the training data, and I’ve applied that markup as it would appear in the blog. A few words are italicized because of markup that the autoblog applied to them. Also, I apparently swear in my blog because at least once my Twitter censor was activated. I was originally using this model on Twitter data and showing off to my work, so naturally I would want to have a censor. Don’t ask what got censored. I don’t know.
Carrots, and deadline recurring touch audition the first times generate code very sat, we had a cheerier of disgusting ringbearer on the danger, and I was screams about the ramping which machine copper on the receipts part of each room. I dragging gazebo, and knit harmless quittings into my enigmatic plant, but parent educated arrival everybody has a () resides to better was . Receding I’m despair <unk> my ornamentation and Pelor made my [censored] to powder” Resume it is never miles, “slowly paunch opener. Got my repurposed and autocomplete for her main yard.” Let me using you in the second line, customers?” I called. “It given mayonnaise you laughing trouble recordings a you 500 services.” Hyland was designed to endure technological in Mike’s, but surprised that oratory checked it was mean if it was usual or who was AC or but largest relented was casual, but that assured a bets to retains returning . If you Kohen’s clustered missing condition, address?” Salem a 0.01% spaces necessary from the dislodge working like the freshly time relationships his eezzal 1.1 a brother’s 1280. Selection, Henry’s globe is a essentially informal poem not a portends, yet. They just have crunched right watching the 20, playing eezzal particularly Brad that the 1278 numbered saturated violence on Amazon sort. the Anti-cleaners mamas that continue treat the emotionally would solids for turn to tell their being link.
Let’s start with the successes. “‘You have never been free,’ said Henry,” does not occur anywhere in the training data. I was so impressed by that line that I checked to be sure. There’s a lot of good placement of open and close quotes in the deterministic version. 177 blog entries is really a very small training set. With more training data we can expect to see more improvement. One question I have is if we can use training data from other sources to help inform my autoblog without diluting the style, or at least minimizing dilution.
Now we can address the elephant in the room. My language model doesn’t generate much sense. This places it in the same echelons as the “Sunspring” script. Google has managed to make sensible translations from one language to another using RNNs, but coming up with an idea and communicating it in a way people will understand is not something that computers can do yet. Really, I think it won’t be so hard. It’s just a simple matter of <unk>.
<end of entry>