text – uClassify blog

Artificial Intelligence to determine an authors age

We have just released ageanalyzer.com, a site that reads a blog and guesses the age of the author!

Background

Our writing style reflects us in many ways, for example texts written in anger probably differs from words written in joy. Reading a text intuitively gives us a clue about the author as you start forming a picture in your head. Sometimes it’s easy to pinpoint how you got this picture and at other times harder.

We wanted to know if we could give computers the same intuition, in this particular project we are interesting in finding out if a computer can tell the age of an author – only given a text.

To do this experiment we collected 7000 blogs that had age information in the profile and split it into 6 different age groups, 13-17, 18-25, 26-35, 36-50, 51-65 and 65+. We then created a classifier on uClassify and fed it with the training data. Viola!

Expected results

After running tests on the training data (10-fold-cross-validation) it was clear that our classifier was able to find differences between the six age groups. We expect the proportion of correctly classified blogs would be around 30% compared to a baseline of 17% which would be expected if the classifier was guessing out of the blue.

We have added a poll to the site to help us see how well (or poorly) it works!

Try AgeAnalyzer out here!

Gender Text Analysis

Do males and females express themselves differently in text? Yes is the answer if we look at the research carried out at the University of Texas, in the article “Effects on age and Gender on Blogging” [1] it’s found that author gender can be determined with an accuracy of 80% by looking at a text. This is achieved with a classifier, trained on 37478 blogs written by males and females at blogger.com.

Gender stereotypes in the blogosphere

The research also shows the most discriminating terms for males of females (using information gain).

Male favorite words

– linux
– microsoft
– gaming
– server

– software
– gb
– programming
– google
– data
– graphics
– india
– nations
– democracy

– users
– economic

Female favorite words

– shopping
– mom
– cried
– freaked
– pink

– cute
– gosh
– kisses
– yummy
– mommy
– boyfriend
– skirt
– adorable
– husband
– hubby

They conclude “Male bloggers of all ages write more about politics, technology and money than do their female cohorts. Female bloggers discuss their personal lives – and use more personal writing style – much more than males do.”

Try it on your blog

GenderAnalyzer.com uses the same approach as described in the article, they have collected 2000 blogs from blogger.com written by men and woman. They also have a poll which allows us to see how well it’s working, as we speak it has an accuracy of 70%.

Trying this blog in the analyzer gives us the correct answer

Results
We think http://blog.uclassify.com is written by a man.

[1] J. Schler, Moshe Koppel, S. Argamon and J. Pennebaker (2006), Effects of Age and Gender on Blogging, in Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006. PDF

What is a text classifier?

A text classifier places documents into their relevant classes (categories). For example, placing spam in the spam folder or web pages about Artificial Intelligence into the AI category. There are different types of text classifiers, the one I will be addressing here is a machine learning one!

Training

To make the classifier understand where documents should go you must first train it. By training you manually set up two or more classes (e.g. spam and legitimate) and describe each class by showing typical documents. In the case of a spam classifier you would train the classifier on spam and legitimate documents. Basically saying to Mrs. Classifier “Hey look at this bunch of documents, they are all spam!” after which you show her the legitimate documents “and these are legitimate!”

By doing so the classifier learns characteristics for each class. This is called supervised training. The training documents are often referred to as the training corpus.

Classifying

Once a classifier has been trained it can be used to find out into which of the predefined classes a previously unseen document is most likely belong. You ask Mrs. Classifier something like “To which of the classes (I have trained you on) is this document most likely to belong?” She would the kindly answer something like “I am 96% certain that it should go into the spam folder.”

It’s not necessary to stop training a classifier when you start classifying. Training and classifying can take place at the same time.

Using our XML API you can communicate with “Mrs. Classifier”!