GenderAnalyzer thoughts

First, thanks to everyone who is testing GenderAnalyzer, we have had incredible feedback. We received emails from many people that are facinated and a few that thinks it sucks =) GenderAnalyzer is still generating a lot of traffic and people are blogging about it.

Our learnings

Determining the gender of an author is not easy, besides the classification there is a chain of technical events that must work in order to get a reliable result. As many of you have noticed the accuracy has dropped to 53% which is far lower than expected based on our tests. There may be several reasons for this low accuracy and I will mention some of them here.

  • Our trainingdata of 2000 blogs is automatically collected from blogspot. Runing internal tests (10 fold cross validation) on this data gives us an accurcy of 75% this effectivly means “Given that the corpus is a perfect representation of real world data, the classifier is able to give any real world data the correct label by a chance of 75%”. So our trainingdata is probably not very representative, as a matter of fact it’s very stereotypical (see for yourself here). Using data from all kind of sources should give us a better model.
  • When someone is testing a blog we are not crawling through posts on the blog to get a good amount of text. We are only hitting the given url and using the text (and html) that appear there as test data. So a page with mostly images or frames will give bad test data. Does anyone know a nice library that – given an url crawls blog posts? Via RSS perhaps?
  • We are trying to encode test data to utf-8 which is the format of the training data – it could be that we are missing some encodings.
  • And of course – the difference between male and female writing is not significant?

What’s next?

We are currently collecting a new set of training data that is much more representative. We will switch to this classifier during the next week and start a new poll for it. It’s going to be very exciting!

GenderAnalyzer showdown + server upgrade

Today was featured on BoingBoing this resulted in that our server could not handle all the requests. We have now upgraded the server and it should be happy to serve all requests.

While the server was unable to respond to all requests – accuracy in the poll dropped from 63% to 55% (since the error message makes people vote that it’s not guessing right). However now the accuracy is slowly recovering!

Sorry for any inconvenience this might have caused.

uClassify beta!

Today we are very pleased to announce the beta release of a new web service that allows everyone to access text classifiers for free. In short, by using a web api (e.g. google maps), everyone can create and train their own classifiers.

Two sites using the api already exists, be inspired and come up with your own classifiers – Analyzes the personality of a blog author. – Figures out if a text is written by a man or woman.

During beta we will test the server for usability, stability, scalability and performance.

All comments and feedback are very appreciated!!

Best regards,

Jon Kågström, Roger Karlsson and Emil Kågström.