GenderAnalyzer thoughts

First, thanks to everyone who is testing GenderAnalyzer, we have had incredible feedback. We received emails from many people that are facinated and a few that thinks it sucks =) GenderAnalyzer is still generating a lot of traffic and people are blogging about it.

Our learnings

Determining the gender of an author is not easy, besides the classification there is a chain of technical events that must work in order to get a reliable result. As many of you have noticed the accuracy has dropped to 53% which is far lower than expected based on our tests. There may be several reasons for this low accuracy and I will mention some of them here.

  • Our trainingdata of 2000 blogs is automatically collected from blogspot. Runing internal tests (10 fold cross validation) on this data gives us an accurcy of 75% this effectivly means “Given that the corpus is a perfect representation of real world data, the classifier is able to give any real world data the correct label by a chance of 75%”. So our trainingdata is probably not very representative, as a matter of fact it’s very stereotypical (see for yourself here). Using data from all kind of sources should give us a better model.
  • When someone is testing a blog we are not crawling through posts on the blog to get a good amount of text. We are only hitting the given url and using the text (and html) that appear there as test data. So a page with mostly images or frames will give bad test data. Does anyone know a nice library that - given an url crawls blog posts? Via RSS perhaps?
  • We are trying to encode test data to utf-8 which is the format of the training data - it could be that we are missing some encodings.
  • And of course - the difference between male and female writing is not significant?

What’s next?

We are currently collecting a new set of training data that is much more representative. We will switch to this classifier during the next week and start a new poll for it. It’s going to be very exciting!

  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Blogosphere News
  • De.lirio.us
  • Furl
  • LinkedIn
  • Live
  • Ma.gnolia
  • Slashdot
  • Spurl
  • StumbleUpon
  • TailRank
  • Technorati
  • Tumblr
  • TwitThis
  • Wikio
  • Yahoo! Buzz

Tags: , , , , , ,

Leave a Reply