GenderAnalyzer thoughts – uClassify blog

First, thanks to everyone who is testing GenderAnalyzer, we have had incredible feedback. We received emails from many people that are facinated and a few that thinks it sucks =) GenderAnalyzer is still generating a lot of traffic and people are blogging about it.

Our learnings

Determining the gender of an author is not easy, besides the classification there is a chain of technical events that must work in order to get a reliable result. As many of you have noticed the accuracy has dropped to 53% which is far lower than expected based on our tests. There may be several reasons for this low accuracy and I will mention some of them here.

Our trainingdata of 2000 blogs is automatically collected from blogspot. Runing internal tests (10 fold cross validation) on this data gives us an accurcy of 75% this effectivly means “Given that the corpus is a perfect representation of real world data, the classifier is able to give any real world data the correct label by a chance of 75%”. So our trainingdata is probably not very representative, as a matter of fact it’s very stereotypical (see for yourself here). Using data from all kind of sources should give us a better model.
When someone is testing a blog we are not crawling through posts on the blog to get a good amount of text. We are only hitting the given url and using the text (and html) that appear there as test data. So a page with mostly images or frames will give bad test data. Does anyone know a nice library that – given an url crawls blog posts? Via RSS perhaps?
We are trying to encode test data to utf-8 which is the format of the training data – it could be that we are missing some encodings.
And of course – the difference between male and female writing is not significant?

What’s next?

We are currently collecting a new set of training data that is much more representative. We will switch to this classifier during the next week and start a new poll for it. It’s going to be very exciting!