First, thanks to everyone who is testing GenderAnalyzer, we have had incredible feedback. We received emails from many people that are facinated and a few that thinks it sucks =) GenderAnalyzer is still generating a lot of traffic and people are blogging about it.
Our learnings
Determining the gender of an author is not easy, besides the classification there is a chain of technical events that must work in order to get a reliable result. As many of you have noticed the accuracy has dropped to 53% which is far lower than expected based on our tests. There may be several reasons for this low accuracy and I will mention some of them here.
- Our trainingdata of 2000 blogs is automatically collected from blogspot. Runing internal tests (10 fold cross validation) on this data gives us an accurcy of 75% this effectivly means “Given that the corpus is a perfect representation of real world data, the classifier is able to give any real world data the correct label by a chance of 75%”. So our trainingdata is probably not very representative, as a matter of fact it’s very stereotypical (see for yourself here). Using data from all kind of sources should give us a better model.
- When someone is testing a blog we are not crawling through posts on the blog to get a good amount of text. We are only hitting the given url and using the text (and html) that appear there as test data. So a page with mostly images or frames will give bad test data. Does anyone know a nice library that – given an url crawls blog posts? Via RSS perhaps?
- We are trying to encode test data to utf-8 which is the format of the training data – it could be that we are missing some encodings.
- And of course – the difference between male and female writing is not significant?
What’s next?
We are currently collecting a new set of training data that is much more representative. We will switch to this classifier during the next week and start a new poll for it. It’s going to be very exciting!