classification – uClassify blog

Buzz & Development

Yesterday we were mentioned on ReadWriteWeb which generated a lot of visits and more importantly – classifiers. 30 new classifiers were created within a time period of 10 hours, even though many are just created out of curiosity to quickly test the system – some will hopefully mature and have web applications built around it.

What’s going on techwise

As you have noticed we are continuously improving our system by carefully adding new features. The following tasks are planned for the GUI

We are soon installing a new more flexible menu system.

Users will be able to create profiles with descriptions and links. Also classifiers should be able to have a link to the web site it’s implemented.

Better information about training – right now there is no feedback on how much training has been done or is required. We want to give users an idea of how the training data performs.

What’s going on commercialwise

Everything is free on uClassify and that is how it will stay.

Our commercial idea is to offer companies the possibility to buy their own classification servers. For large databases with texts that needs to be classified it’s intractable to send every text for a roundtrip to uclassify.com. Instead companies could be interested in doing this efficiently locally. A products page with server information will appear soon.

What’s your mood?

Today, 2 months after our launch, our users have created over 200 classifiers. Most are unpublished and under construction. PRfekt, the team behind the popular Typealyzer, recently published a new classifier that determines the mood of a text – whether a text is happy or upset. You can try it for yourself here!

So lets test some snippets!

Jamis is (justly) upset and writes:

“Is anyone else annoyed by the “just speak your choice” automation in so many telephone menus? I feel like an idiot mumbling “YES!” or “CHECK BALANCE!” into my phone. Maybe it’s the misanthrope in me coming to the front, but I’d much rather push buttons than talk to a pretend person.”

The mood classifier says 98.1% upset.

Spam is no fun either, or as Ed-Anger notes:

“I’m madder than a rooster in an empty hen house at Internet spammers and I won’t take it anymore. Those creeps clutter up my e-mail with their junk, everything from penis enlargement pills to some lady telling me she’ll give me a million dollars if I’ll help her get her money out of Africa. “Rush me 10 grand quick as possible and we’ll get the whole thing started,” she says.”

The mood classifier says 97.0% upset.

Now over to some happy blogs, amour-amour has a confesion:

“I love my iphone in a way I never thought possible!! When my fiance got his and spent 23 hours gazing at it lovingly, uploading (or is it downloading??) apps and buying accessories for it I put it down to him just being a technology geek.”

The mood classifier says 79.8% happy.

Finally Nitwik Nastik comments a Rickey Gervais:

“This is a hilarious stand-up routine by British Comedian Ricky Gervais on Bible and Creationism. It’s really funny how he ridicules the creationist stories from the book of Genesis (the book of genesis can be found here)and point out to it’s obvious logical blunders. Sometimes it may be difficult to understand his accent and often he will make some funny comments under his breath, so try to listen carefully.”

The mood classifier says 69.7% happy.

The author recommends at least two hundred words (more text than my samples) which seems reasonable!

More memory or smaller memories?

In order to build a classification server that can handle thousands of classifiers and process huge amounts of data we were sure that we eventually would have to do some major optimizations. To avoid doing any work prematurely we waited with all optimizations that actually require design changes or invoke less code readability until we were absolutely sure where to improve.

When we ran our first test in May it was obvious what would be our first bottleneck – the memory consumption of classifiers. It was really bad, raw classifier data that expanded by a factor of about 5 into the primary memory – a tiny classifier of 1Mb would take 5Mb as soon it’s fetched into memory. It was really easy to pinpoint the memory theives.

Couple in crime – STL strings and maps

We were using STL maps to hold frequency distributions for tokens (features). All tokens were mapped to their frequency, map<string, unsigned int> accordingly. This is a very convenient and straightforward way to do it. But the memory overhead is not very attractive.

VS2005 STL string memory overhead

The actual sizes of types vary between platforms and STL implementations (these numbers are from the STL that comes with VS2005 on 32 bit Windows XP).

Each string takes at least 32 bytes
size_type _Mysize = 4 bytes (string size)
size_type _Myres = 4 bytes (reserve size)
_Elem _Buf = 16 bytes (internal buffer for strings shorter than 16 bytes)

_Elem* _Ptr = 4 bytes (pointer to strings that don’t fit in the buffer)
this* = 4 bytes (this pointer)

Best case overhead for STL strings is 16 bytes if the internal buffer is filled exactly. Worst case is for empty or strings longer than 15 bytes which gives the overhead of 32 bytes. Therefore string overhead varies from 16 to 32 bytes.

VS2005 STL map memory overhead

Each entry in a map consists of a STL pair – the key and value (first and second). A pair only has the memory overhead of the this pointer (4 bytes) (and that inherited from the types it’s composed of). However the map is a colored tree and consists of linked nodes. Each pair is stored in a node and nodes have quite heavy memory overhead:

_Genptr _Left = 4 bytes (points to the left subtree)
_Genptr _Parent = 4 bytes (pointer to parent)
_Genptr _Right = 4 bytes (points to the right subtree)
char _Color = 1 byte (the color of the node)
char _Isnil = 1 byte (true if node is head)

this* = 4 bytes (this pointer)

So there is a 18 byte overhead per node and 4 bytes per pair, which sums up to 22 bytes.

Strings in maps

Now inserting a string shorter than 16 bytes into a map<string, unsigned int> will consume 32+22+4=58 bytes. It could even be more if memory alignment kicks in for any of the allocations. In most cases this is perfectly fine and is not even worth considering optimizing. In our case it was not plausible to have a memory overhead factor of 5. Our language classifier takes about 14Mb on disk and should not take much more when loaded into memory – it blew up to about 65Mb. As it consists of 43 languages with probably around 30000 unique words per class (language) it gets really bloated.

One solution

We needed to maintain the search and insertion speed of maps (time complexity O(log n)) but get rid of the overhead. Insertions are needed when classifiers are trained.

Maintaining search speed

Since we already had limited features to the maximum length of 32 bytes we could use that information to create what we call memory lanes. A memory lane only consists of tokens of the same size followed by the frequency. In that manner we created 32 lanes, lane 1 with all tokens of size 1, lane 2 with all tokens of size 2 and so on. Tokens in memory lanes are sorted so we can use binary search.

Memory lane 1 could look like this (tokens of size 1 followed by the frequency)
a0031i0018y0003
…
and memory lane 3 like this
can0011far0004the0019zoo0001

By doing so we get rid of all overhead and maintaining search at O(log n).

Maintaining insertion speed (almost)

Maps allow fast insertions in O(log n) so we kept an intermediate map for each memory lane. When a classifier is trained, new tokens they go into the map and the frequency of those that already exist in the memory lane is increased. When the training session is over the intermediate maps are merged to their respective memory lane. This can be done in O(n) and is the major penalty. Note that explicit sorting is never required since maps are ordered. Another penalty occur when both the map and memory lane are filled with tokens – at this point two lookups can happen (first in the memory lane and if it doesn’t exist a search through the map is required).

This solution reduced memory consumption by a factor of 4-5 at the penalty of having to merge new training data into memory lanes every now and then. This is perfectly fine for us as training often reduce with time (training data get good enough) and classification hence increase.

A similar optimization for Java is described on the LingPipe blog.