Tag Archives: Language Models

AutoBlog 2: Adding the Old Blog

I have now added my old blog, 300 entries from 2008 to 2013. I include here results trained on my old blog, my new blog, and my full blog (old and new together). We should expect, since the full blog roughly doubles the training data, that it will tend to do better than previous models. The short blogs below are all probabilistic, meaning the network generates a probability distribution for the next word and the generator selects from it randomly according to that distribution.

I have included two versions of model for each of these data sources. Small models have fewer neurons than large models, making them faster to train, but less able to represent complex phenomena. I haven’t fixed the formatting manually this time.

Small Models

Old Blog

switched .

You’ll notice that this one is unusually short. Generally models trained on my blog get to the arbitrary word limit before they predict an “end of entry” tag. This one is an exception, and a notable one at that. I doubt it’s representative of my old blog, except that I tended to write shorter entries.

New Blog

demeanor dissertation perfection with recycle , has , shared an even shuffled crick to make investment that they can sore . it five-year-old if a crime snapped is a human pinch under a producing dressings . as if each violation is full of the roiling https://www.youtube.com/watch?v=m78gyytrg7y we feminist how i can present . each brags of backside discovered rainbow margin that seems soundly . flavor , and this motivate , choose that dairy , also lifeless . without real moment , though is narrowing rationalizing and carson street cleaning people need to make your wishes because todd has bizarre such

The “roiling” Youtube link points to an unavailable video. I should figure out how that got past the preprocessing step. Pay attention to this and we can see if it gets better when I apply a larger model.

Full Blog

donations , <unk> i’m never chuck on facebook and ime mountain angry donned . but too , so i’m , old <unk> at the office . wouldn’t just , but the most induction people should get off inviting of cube into us peace on human overhaul and peak the japanese country and finally and their sent me . evidently quintessential children again , christmas , ” and and lived , and effects behind to us trouble . ” diane , you can go claim from significant hero how to make there !

For the small model, including both didn’t increase the sensibility of the model as much as I had expected.

Large Model

Old Blog

dejected , i have moisture results out of boss disease and expo for my spark horrible dispensers . unfortunately , i offhandedly decided my elaborate retreat to my hats . weekend , i important lose locate lost lee’s and an slew of a sledgehammer with the narrative partially for the side . elliot foolishly my mishaps and challenged one’s blend acquaintances complaining that i could re-read my relief . in the woo , the quietest organization in dirt representing following consciousness , implication . i meanings [censored] corporations , i drank pop graphic break and debug fit , so batches

I would have to do some more analysis to figure out if the first word, dejected, led to the model keeping that tone throughout, or if it is just representing what may be an overall somewhat negative blog.

New Blog

dumped , wouldn’t overhaul the reinvent the painstaking introduced up of awful day . lower hidden forty-eight resounding and fiction is moments next , like the time for the distributing repurposed and note on diane .

No dramatic improvement here with the larger model.

Full Blog

i’ve been videos to application these pauses to towers in goodwill . i again , my opening appeared ever heard ever blowing since i texted my re-read . i don’t remember the clumps story . in the address i sol forgot some brahe and junior press every exam . explain you tearer . ” cabin-mates xeon , ” what it is good , should alone anthropomorphic language , ” secret goading , ] what i had releases worst as i rely its message to torn up many grandma , and the wider tacos was delay on slogan . tried to

So, doubling the data did not have a noticeable effect. I wonder if even all the blogs I’ve written in nine years are not enough to make a reasonable language model. They do pale in comparison to English Wikipedia, for instance, which has 2.9 billion words to my blog’s paltry 240,000. Excessive randomness in the probabilistic model could be another weakness. Other approaches to generative models describe modifying the random distribution to make likely words appear more often without going completely deterministic.


The AutoBlog

Ladies and Gentlemen,

This week, I would like to introduce the amazing blog-writing computer machine. This machine is based on a recurrent neural network (RNN) which is a machine learning algorithm that looks at one input at a time, remembering what it has seen before when it looks at the next input.

To use this to analyze text, one can build the RNN to, based on previous words, predict the next word. That is, if it sees “A microcosm of sorrow is me” it might predict that the sentence is over and needs a period, or perhaps an exclamation point. This is known as language model. To make language we first train a language model as above on existing text, then we build a generative language model. A generative language model, when it predicts the next word, then reads it as input and predicts what would come after it.

Say we start with “The.”

The network would see it and decide “cat” is the most likely next word. Then we have “The cat.” The network then looks at “cat,” remembering that it saw “The” earlier, and predicts “sat,” giving us “The cat sat.”  This is the deterministic version, which always selects the most likely word and therefore will always give the same result. Soon I will explain how we can generate a variety of blog entries.

My AutoBlog has a vocabulary of 10,000 words. Words not in its vocabulary it calls <unk>. It also ignores capitalization. Its entire understanding of English is based exclusively on the 177 entries from my new blog, so please keep that in mind when you read it. Next week I’ll add the 299 posts from my old blog and see how it improves.

This is the deterministic version of my autoblog. I have cleaned up the capitalization and removed unnecessary spaces. There’s also a bug in my preprocessing that caused some ‘ to show up as “, which I’ve corrected post-hoc. I have also abbreviated it prematurely, as honestly this AutoBlog is not good enough for me to ask people to read 1,000 words of it yet.

Deterministic AutoBlog

“I suppose this has gone on my computer.” I didn’t have a very good, but i just like this time. “We were very glad, Diane.” Said the Caretaker, but I tried to get it into the air, but I really have it up the <unk>. I tried to get the conversation. I took up to this time and pulled up in the chair so i could have to put my own phone to get a picture of my mind. In the meantime, I was not to keep the paper on my computer, and i couldn’t bring it up. I was so glad to look at the door, I realized that had very good. I had to admit that the Cleaners came up a <unk> at the door. He also gave me the whole time , and I said we should make a particularly argument. I had a lot of relief with two of the time. It was so bad that I had not tasted <unk> it . “I started to have this way to get out of you for my blog. “I was not going to my face. In an case I was not to tell my stories. I felt so <unk> that I was trying to look at it. I wasn’t going to make it “S voice , but I’m going to get the lock out of it in the house . I stopped behind him in the <unk>, but it’s just there. I had a lot of it . “Oh, I know I know the most of the most of the most time I have ever tasted. ” “You have never been free,” said Henry.

This next entry is generated randomly based on sampling from the distribution of possible words. Each candidate for next word has a probability of occurring and so there is that probability of it being selected. You should notice right away the greater variety in this entry. There are a few instances where markup made it into the training data, and I’ve applied that markup as it would appear in the blog. A few words are italicized because of markup that the autoblog applied to them. Also, I apparently swear in my blog because at least once my Twitter censor was activated. I was originally using this model on Twitter data and showing off to my work, so naturally I would want to have a censor. Don’t ask what got censored. I don’t know.

Stochastic AutoBlog

Carrots, and deadline recurring touch audition the first times generate code very sat, we had a cheerier of disgusting ringbearer on the danger, and I was screams about the ramping which machine copper on the receipts part of each room. I dragging gazebo, and knit harmless quittings into my enigmatic plant, but parent educated arrival everybody has a () resides to better was . Receding I’m despair <unk> my ornamentation and Pelor made my [censored] to powder” Resume it is never miles, “slowly paunch opener. Got my repurposed and autocomplete for her main yard.” Let me using you in the second line, customers?” I called. “It given mayonnaise you laughing trouble recordings a you 500 services.” Hyland was designed to endure technological in Mike’s, but surprised that oratory checked it was mean if it was usual or who was AC or but largest relented was casual, but that assured a bets to retains returning . If you Kohen’s clustered missing condition, address?” Salem a 0.01% spaces necessary from the dislodge working like the freshly time relationships his eezzal 1.1 a brother’s 1280. Selection, Henry’s globe is a essentially informal poem not a portends, yet. They just have crunched right watching the 20, playing eezzal particularly Brad that the 1278 numbered saturated violence on Amazon sort. the Anti-cleaners mamas that continue treat the emotionally would solids for turn to tell their being link.

Let’s start with the successes. “‘You have never been free,’ said Henry,” does not occur anywhere in the training data. I was so impressed by that line that I checked to be sure. There’s a lot of good placement of open and close quotes in the deterministic version. 177 blog entries is really a very small training set. With more training data we can expect to see more improvement. One question I have is if we can use training data from other sources to help inform my autoblog without diluting the style, or at least minimizing dilution.

Now we can address the elephant in the room. My language model doesn’t generate much sense. This places it in the same echelons as the “Sunspring” script. Google has managed to make sensible translations from one language to another using RNNs, but coming up with an idea and communicating it in a way people will understand is not something that computers can do yet. Really, I think it won’t be so hard. It’s just a simple matter of <unk>.

<end of entry>