Jon – Page 9 – uClassify blog

Everybody can classify

Creating your own classifiers has never been easier, we have developed a Click’n’Classify Graphical User Interface (GUI). This means that you can manually create and train your classifiers without knowing any programming at all. This is very good way to test an idea, if the classifier works well – build your web site around it or use it for whatever purpose.

The GUI allows you to do everything that you can do via our Application Programming Interface (API). Also, just like phpMyAdmin shows the SQL queries our uClassify GUI will show the XML queries so you can easily understand and use the API from your site.

Features

Create and remove classifiers
Add and remove classes
Train and untrain classes
See basic information about your classifiers

Screenshot – Create a classifier

This shows a screenshot of how it looks like when you are about to create a classifier, just log in and try it yourself!

Creating a classifier is easy

Screenshot – Training a classifier

Just copy and paste the texts you want to use as training data.

Training a classifier is easy

Happy classifying!

More memory or smaller memories?

In order to build a classification server that can handle thousands of classifiers and process huge amounts of data we were sure that we eventually would have to do some major optimizations. To avoid doing any work prematurely we waited with all optimizations that actually require design changes or invoke less code readability until we were absolutely sure where to improve.

When we ran our first test in May it was obvious what would be our first bottleneck – the memory consumption of classifiers. It was really bad, raw classifier data that expanded by a factor of about 5 into the primary memory – a tiny classifier of 1Mb would take 5Mb as soon it’s fetched into memory. It was really easy to pinpoint the memory theives.

Couple in crime – STL strings and maps

We were using STL maps to hold frequency distributions for tokens (features). All tokens were mapped to their frequency, map<string, unsigned int> accordingly. This is a very convenient and straightforward way to do it. But the memory overhead is not very attractive.

VS2005 STL string memory overhead

The actual sizes of types vary between platforms and STL implementations (these numbers are from the STL that comes with VS2005 on 32 bit Windows XP).

Each string takes at least 32 bytes
size_type _Mysize = 4 bytes (string size)
size_type _Myres = 4 bytes (reserve size)
_Elem _Buf = 16 bytes (internal buffer for strings shorter than 16 bytes)

_Elem* _Ptr = 4 bytes (pointer to strings that don’t fit in the buffer)
this* = 4 bytes (this pointer)

Best case overhead for STL strings is 16 bytes if the internal buffer is filled exactly. Worst case is for empty or strings longer than 15 bytes which gives the overhead of 32 bytes. Therefore string overhead varies from 16 to 32 bytes.

VS2005 STL map memory overhead

Each entry in a map consists of a STL pair – the key and value (first and second). A pair only has the memory overhead of the this pointer (4 bytes) (and that inherited from the types it’s composed of). However the map is a colored tree and consists of linked nodes. Each pair is stored in a node and nodes have quite heavy memory overhead:

_Genptr _Left = 4 bytes (points to the left subtree)
_Genptr _Parent = 4 bytes (pointer to parent)
_Genptr _Right = 4 bytes (points to the right subtree)
char _Color = 1 byte (the color of the node)
char _Isnil = 1 byte (true if node is head)

this* = 4 bytes (this pointer)

So there is a 18 byte overhead per node and 4 bytes per pair, which sums up to 22 bytes.

Strings in maps

Now inserting a string shorter than 16 bytes into a map<string, unsigned int> will consume 32+22+4=58 bytes. It could even be more if memory alignment kicks in for any of the allocations. In most cases this is perfectly fine and is not even worth considering optimizing. In our case it was not plausible to have a memory overhead factor of 5. Our language classifier takes about 14Mb on disk and should not take much more when loaded into memory – it blew up to about 65Mb. As it consists of 43 languages with probably around 30000 unique words per class (language) it gets really bloated.

One solution

We needed to maintain the search and insertion speed of maps (time complexity O(log n)) but get rid of the overhead. Insertions are needed when classifiers are trained.

Maintaining search speed

Since we already had limited features to the maximum length of 32 bytes we could use that information to create what we call memory lanes. A memory lane only consists of tokens of the same size followed by the frequency. In that manner we created 32 lanes, lane 1 with all tokens of size 1, lane 2 with all tokens of size 2 and so on. Tokens in memory lanes are sorted so we can use binary search.

Memory lane 1 could look like this (tokens of size 1 followed by the frequency)
a0031i0018y0003
…
and memory lane 3 like this
can0011far0004the0019zoo0001

By doing so we get rid of all overhead and maintaining search at O(log n).

Maintaining insertion speed (almost)

Maps allow fast insertions in O(log n) so we kept an intermediate map for each memory lane. When a classifier is trained, new tokens they go into the map and the frequency of those that already exist in the memory lane is increased. When the training session is over the intermediate maps are merged to their respective memory lane. This can be done in O(n) and is the major penalty. Note that explicit sorting is never required since maps are ordered. Another penalty occur when both the map and memory lane are filled with tokens – at this point two lookups can happen (first in the memory lane and if it doesn’t exist a search through the map is required).

This solution reduced memory consumption by a factor of 4-5 at the penalty of having to merge new training data into memory lanes every now and then. This is perfectly fine for us as training often reduce with time (training data get good enough) and classification hence increase.

A similar optimization for Java is described on the LingPipe blog.

Click’n’classify

As we suspected, most users who sign up think it’s to high threshold to get started as it requires some programming to create and train classifiers. Therefore we have decided to add more GUI features that allows users to do all the API calls without any programming! Once classifiers are set up developers can start building their web application around their classifiers via the API.

Copy’n’code

All GUI driven API calls will display the generated the XML so that users can easily see whats going on and copy the XML directly into the code (much like PhpMyAdmin does with SQL queries).

We expect this to take a couple of weeks.

Donate your spam!

We are evaluating our next move and are running preliminary tests on spam comments (spaments?). We only have a few corporas to test on and it looks good on those (I’ll get back with exact performance later).

We want your blog comments for a good cause

Following our own guidelines we are looking for more data to test on. If you have a WordPress installation you can help us out by:

Log into phpMyAdmin
Select your WordPress database
Click on the table ‘wp_comments’
Click on ‘Export’
Select the XML format
Check ‘Save to file’ and click ‘Run’
Attach the exported XML to an e-mail for contact AT uclassify DOT com

We will not publish any comments without asking you for permission first. Also you will be credited with your name and blog when we return with the classifier results for your comments.

Thank you!

Developing the development

Since we released the beta version a couple a weeks ago we have seen a few websites pop up building on the uClassify techonology. This is very encouraging for us! Right now we are trying to reach out to more users who want to use our classifier API.

We have spent a lot of time on development of our service – making it parallel – robust – low on memory – fast etc. This is what we are really good at. The remaining part which is as important – to reach out to users – advertise ourselves and being seen on the right places is not our sharpest skill.

Besides writing this blog and posting the uClassify link on a couple of sites we haven’t done much to show our muscles – yet! We thought that we perhaps would use our own API ourselves – that is probably an easier way to create some buzz! We have a couple of ideas make us seen (feel free to use these ideas for yourself):

Build an Anti Spam Comment Plugin for WordPress?

We are quite confident that we could do really well as the classifier engine has shown really good results in Cactus Spam Filter. This would compete or be a good complement to Akismet, Defensio and similar. Is there anyone who needs another blog spam comment filter?
antispamspam

Build a Spam Blog Filter?

This seems to be a problem for many blog communities, building a splogs (spam blogs) filter could give us some good attention. What would be really nice is if somebody could provide us with dynamic training data on slogs and blogs – then we could automate the training process and find the undetected spam! Anyone who want to donate their spam?

Implement a JSON API for uClassify?

Building a JSON API would not only broaden our API (only XML API right now) it would also let users use our classification service via Yahoo! Pipes. Yahoo Pipes let’s you combine different RSS flows into one and use external web services (via JSON) – which is madly cool.

Language Detection – talar du svenska?

We already have a language detection classifier (not published yet) that only needs training data refinement (removal of noise such as English words in the Filipino class). It supports 40 languages. This would be fairly simple and could give us some buzz.

Ideas, anyone!

Do you have any ides? Let us know – or use the uClassify API to create your own classifier (spam filter, language detection or whatever comes to your mind).

Classifier performance – part II

In the first part I explained some guidelines to keep in mind when selecting a test copurs. In this part I will give a brief introduction of how to run tests on your corpuses. Given a corpus of labeled documents, how can it be used to determine a classifiers performance? There are several ways, one of the simplest is to divide it into two parts and use one part for training and the other for testing. How the corpus is divided affects the performance and is very likely to be biased, there is also a great data loss (50% is never used as testing/training). We can do a lot better.

Leave one out cross validation (LOOCV)

A well established technique is to train on all documents except one which is left out and used for testing. This procedure is repeated so that every document has been used for testing once. An advantage of this method is that it almost uses the full corpus as the training data (no data waste). The downside is that it’s expensive as it must be repeated as many times as there are documents. k fold cross validation solves this problem by dividing the corpus into k piles. The performance is then averaged over all the runs.

k fold cross validation

Perhaps the most common way to run tests is to use k fold cross validation. This means that k-1 parts of the corpus are used for training and 1 part for testing. This method is repeated k times so that every part of the corpus is once used for testing and k-1 times for training. Using 10 fold cross validation is commonly used. In that case start by training the classifier on part 2->10 and test it on part 1, then training it on part 1+3->10 and test it on part 2 and so on. For every rotation the performance is measured. When the tests have completed the performance is averaged. Using k fold cross validation will give a more robust performance measure as every part will be used as training and test data.

Remember from part one that it can be useful to vary the size of the corpus, scaling it from a small magnitude to a greater and using unbalanced data.

Summary

Don’t use test methods because they are simple – the results probably fool you.
Use an established method, such as k fold cross validation or leave one out.
Always remember to specify what method you have used together with the results.

In the next part I’ll show how performance actually can be measured! Happy classifying until then!

Classifier performance – Part I

There are several different classifiers, to name a few: Naive Bayesian Classifiers, Support Vector Machines, k Nearest Neighbor and Neural Networks. A crucial cornerstone when choosing a classifier is the performance – how well it classifies the data. There are several methods to measuring how good a classifier performs. In three parts I will try to give an idea of how to avoid common pitfalls.

Part I: Choosing test corpus
Part II: Running tests
Part III: Measuring the performance

What test corpus should I use? Use many!

This is perhaps the hardest part when trying to determine the performance of a classifier, every subset of data is a model that is likely to be biased. Therefore you should always question on what data (corpus) the tests are carried out on. For example, a classifier that reports high performance on a specific corpus is likely to have a different performance on real world data (and often lower! – to not look stupid in comparison to other classifiers they are tuned for the test corpus but this bias may degrade performance on other corpuses). Using many relevant corpuses can help avoiding that a classifier gets to narrowed down (specialized) to one specific corpus.

In “The fallacy of corpus anti-spam evalutation” Vipul explains why many anti spam researchers results may not say much since the test corpus is static. While researchers are busy measuring their performance on a corpus from 2005 (TREC2005), spammers today have had three years to figure out how to fool their spam filters… I completely agree, it’s almost like

– Look everyone, I’ve spent years inventing a highly accurate long bow to shoot down our nemisis Spam! It works really well back in my training yard!

– Did you not hear? Spam evolved, they now come in F-117 Nighthawk Stealth Fighters cloaked by deceiving words, even if you could see it – an arrow couldn’t scratch it.

– Oh, dear...

Small or large test sets? Both!

The size of the test data is also vital; using too small test set says more of how the classifier will perform during the training phase than after. Using too large and rich training sets may invoke overfitting (so much data that seemingly nonsense tests will fit your model). The best is to measure performance on different sizes, scaling the training set from a few test documents to the full set. Unfortunately this is often disregarded.

Hint: A benefit from measuring the performance as a function of a scaling corpus is that you can predict how much training data is needed to reach a certain level of performance. Just project the performance curve beyond the size of the test corpus.

Balanced or unbalanced training sets? Both!

The test corpus can be heavily unbalanced, meaning that one of the classes is overrepresented and another is underrepresented in number of documents. As Amund Tviet points out in his blog

“Quite frequently classification problems have to deal with unbalanced data sets, e.g. let us say you were to classify documents about soccer and casting (fishing), and your training data set contained about 99.99% soccer and 0.01% about casting, a baseline classifier for a similar dataset could be to say – “the article is about soccer”. This would most likely be a very strong baseline, and probably hard to beat for most heavy machinery classifiers.”

In many cases it’s desirable to run tests on unbalanced test sets. For example, imagine that you get 2000 spam every day and 10 legitimate. You decide to install a machine learning spam filter. The spam filter requires training, so each day you train it on 2000 spam and 10 legitimate e-mails. This creates unbalanced training data for your classifier, and it’s extremely important that it your spam filter doesn’t mark your legitimate as spam (could be a matter of life, death, peace or war – literally).

Summary

Understand that corpuses are biased and therefore also the test results.
Use up-to-date corpuses if the classification domain is dynamic.
Make sure the test data is as representative as possible for the domain. E.g. don’t trust test results from a spam corpus to apply on sentiment classification.
Prefer running tests on many different corpuses.
Run tests on the corpuses as they scale in size.
Make sure that the classifier is robust on unbalanced training data, especially when correct classifications can be a matter of life and death.

Gender Text Analysis

Do males and females express themselves differently in text? Yes is the answer if we look at the research carried out at the University of Texas, in the article “Effects on age and Gender on Blogging” [1] it’s found that author gender can be determined with an accuracy of 80% by looking at a text. This is achieved with a classifier, trained on 37478 blogs written by males and females at blogger.com.

Gender stereotypes in the blogosphere

The research also shows the most discriminating terms for males of females (using information gain).

Male favorite words

– linux
– microsoft
– gaming
– server

– software
– gb
– programming
– google
– data
– graphics
– india
– nations
– democracy

– users
– economic

Female favorite words

– shopping
– mom
– cried
– freaked
– pink

– cute
– gosh
– kisses
– yummy
– mommy
– boyfriend
– skirt
– adorable
– husband
– hubby

They conclude “Male bloggers of all ages write more about politics, technology and money than do their female cohorts. Female bloggers discuss their personal lives – and use more personal writing style – much more than males do.”

Try it on your blog

GenderAnalyzer.com uses the same approach as described in the article, they have collected 2000 blogs from blogger.com written by men and woman. They also have a poll which allows us to see how well it’s working, as we speak it has an accuracy of 70%.

Trying this blog in the analyzer gives us the correct answer

Results
We think http://blog.uclassify.com is written by a man.

[1] J. Schler, Moshe Koppel, S. Argamon and J. Pennebaker (2006), Effects of Age and Gender on Blogging, in Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006. PDF

What is a text classifier?

A text classifier places documents into their relevant classes (categories). For example, placing spam in the spam folder or web pages about Artificial Intelligence into the AI category. There are different types of text classifiers, the one I will be addressing here is a machine learning one!

Training

To make the classifier understand where documents should go you must first train it. By training you manually set up two or more classes (e.g. spam and legitimate) and describe each class by showing typical documents. In the case of a spam classifier you would train the classifier on spam and legitimate documents. Basically saying to Mrs. Classifier “Hey look at this bunch of documents, they are all spam!” after which you show her the legitimate documents “and these are legitimate!”

By doing so the classifier learns characteristics for each class. This is called supervised training. The training documents are often referred to as the training corpus.

Classifying

Once a classifier has been trained it can be used to find out into which of the predefined classes a previously unseen document is most likely belong. You ask Mrs. Classifier something like “To which of the classes (I have trained you on) is this document most likely to belong?” She would the kindly answer something like “I am 96% certain that it should go into the spam folder.”

It’s not necessary to stop training a classifier when you start classifying. Training and classifying can take place at the same time.

Using our XML API you can communicate with “Mrs. Classifier”!

Bug fix and a single chat bot

Some of you may have experienced problems registering since it has been impossible to click on the password textbox (only tab worked). This bug has now been fixed.

Thanks to Marcus Endicott who reported it – he has an interesting blog on Artificial Intelligence and Natural language processing (which I believe is one of the hardest domain of AI). There is also a demo of a chat bot on his page:

– You: what is your name?
– VagaBot: My name is Ralf.
– You: Are you having a good time online?

– VagaBot: Single men will not travel as fast as a pair of women or a mixed couple but should make good time.