Using Python and the NLTK to Find Haikus in the Public Twitter Stream

So after sitting around mining the public twitter stream and detecting natural language with Python, I decided to have a little fun with all that data by detecting haikus.

The Natural Language Toolkit (NLTK) in Python, basically the Swiss army knife of natural language processing, allows for more than just natural language detection. The NLTK offers quite a few corpora, including the Carnegie Mellon University (CMU) Pronouncing Dictionary. This corpus contains quite a few features, but the one that piqued my interest was the syllable count for over 125,000 (English) words. With the ability to get the number of syllables for almost every English word, why not see if we can pluck some haikus from the public Twitter stream!

We’re going to be feeding Python a string formed Tweet and try to figure out if it is a haiku, trying our best to split it up into haiku form.

Building upon natural language detection with the NLTK, we should first filter out all the Tweets that come are probably not English (to speed things up a little bit).

Once we have that out of the way, we can dig into the haiku detection.

So what we have now is a function,  is_haiku, that will return a list of the three haiku lines if the given string is a haiku, or returns  False  if it’s (probably) not a haiku. I keep saying probably because this script isn’t perfect, but it works most of the time.

After all that hacky code, it’s just a matter of hooking it up to the public Twitter stream. Borrowing from the public Twitter stream mining code, we can pipe every Tweet into the is_haiku function and if it returns a list, add it to our database.

So running this for a while, we actually pick up some pretty entertaining Tweets. I have been running this script for a little while on a micro EC2 instance and created a basic site that shows them in haiku form, as well as a Twitter account that retweets every haiku that it finds.

Some samples of found haikus,

 

 

 

So it’s can be pretty interesting. What this exercise underlines is the publicity of your Tweets. There might be some robot out there mining all that stuff. In fact, every Tweet is archived by the Library of Congress, so be mindful what you post.

I have posted the full script in as a Gist that puts it all together. If you have any improvements or comments, feel free to contribute!

20 Comments

  1. Is it possible to narrow it down to a single Twitter profile?

    Reply

    1. Yep, if instead of the public stream you use a particular user’s stream, this should work the same. You can use Tweepy [1] and the Twitter streaming API [2] to do that.

      [1] https://github.com/tweepy/tweepy
      [2] https://dev.twitter.com/docs/streaming-apis/streams/user

      Reply

    2. Christopher Ing March 21, 2013 at 11:36 am

      Haha, that would be a great blog post. Unintentional celebrity haikus.

      Reply

  2. [...] and a Python tool that can determine the number of syllables in most English words, Brandon at h6o6 has found a way to collect the haikus hidden in plain sight on Twitter. You probably won’t find the poems [...]

    Reply

  3. [...] and a Python tool that can determine the number of syllables in most English words, Brandon at h6o6 has found a way to collect the haikus hidden in plain sight on Twitter. You probably won’t find the poems [...]

    Reply

  4. [...] and a Python tool that can determine the number of syllables in most English words, Brandon at h6o6 has found a way to collect the haikus hidden in plain sight on Twitter. You probably won’t see the poems [...]

    Reply

  5. [...] and a Python tool that can determine the number of syllables in most English words, Brandon at h6o6 has found a way to collect the haikus hidden in plain sight on Twitter. You probably won’t see the poems [...]

    Reply

  6. [...] and a Python tool that can determine the number of syllables in most English words, Brandon at h6o6 has found a way to collect the haikus hidden in plain sight on Twitter. You probably won’t see the poems [...]

    Reply

  7. [...] and a Python tool that can determine the number of syllables in most English words, Brandon at h6o6 has found a way to collect the haikus hidden in plain sight on Twitter. You probably won’t see the poems [...]

    Reply

  8. [...] and a Python tool that can determine the number of syllables in most English words, Brandon at h6o6 has found a way to collect the haikus hidden in plain sight on Twitter. You probably won’t find the poems [...]

    Reply

  9. *piqued my interest

    Reply

    1. Not sure why it took me nearly a year to realize that this was a correction for the post. Thank you so much! (Gosh, I feel so foolish!)

      Reply

  10. [...] Using the Natural Language Tool Kit (NLTK) and Python, Haikus can be found in the public Twitter stream. Source code and implications of this are explored.  [...]

    Reply

  11. This is so awesome! I was inspired (by this and Pentametron [1]) to make an automatic limerick generator with NLTK [2]. The code is on github [3]. Thanks for the great post!

    [1] http://pentametron.com
    [2] http://limerick.dfm.io
    [3] https://github.com/dfm/twitterbot

    Reply

    1. Awesome, the poems are actually pretty good for random rhymes, haha. Thanks for posting the code!

      Reply

  12. [...] Using the Natural Language Tool Kit (NLTK) and Python, Haikus can be found in the public Twitter stream. Source code and implications of this are explored.  [...]

    Reply

  13. [...] formed tweets from the public stream.” There’s a techy article explaining how it works here but, in short, the program analyses the 400 million tweets posted each day to determine whether [...]

    Reply

  14. [...] and a Python tool that can determine the number of syllables in most English words, Brandon at h6o6 has found a way to collect the haikus hidden in plain sight on Twitter. You probably won’t see the poems [...]

    Reply

  15. Diego Pignattini May 19, 2014 at 11:02 am

    What’s the haiku module you import on the third script?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">