Update: Thursday 14/5 May (read-only-mode)

Exciting times! I’ve decided to push the next update out on Thursday May the 14th (2015). Normally you won’t notice updates but this one is huge.

I’m migrating servers from old ‘Classic’ Amazon EC2 to their new cloudy thing. This will require a DNS update which takes time to propagate over internet before it’s completely done.

Read Only Mode during the transition

Since this also involves a database migration step, I will set the uClassify to ‘read-only’ until it’s done. This means that all the read calls (classify etc) should continue to work during the transition while write calls won’t go through (creating, training classifiers). You won’t be able to register as a new user during this time either. DNS updates usually takes about 48h.

What will be new

First, I’ve done extensive testing to make sure the API will behave exactly the same. If I have not missed anything your app will continue to work without any changes.

The major ‘visible’ changes are:

– A new responsive bootstrap UI (the vanilla theme, somehow cosmetics always ends up last on my prio lists 🙂

– To make it more secure the entire site will be in SSL (don’t worry all the API links without https:// will still work).

– It will be possible to sign in via Twitter, Facebook and Google.

– You can train classifiers by uploading files.

This is the first of a few major updates for uClassify, it doesn’t introduce much new cool fancy stuff but it’s a very important updates that paves the road for the stuff I actually want to add, such as an JSON api.

If you have any questions please don’t hesitate to contact me. (contact AT uclassify DOT com)

Sentiment analysis with keyword extraction

Lately we have been getting a lot requests to our sentiment classifier, many are from social media analyst companies. In fact our sentiment analysis is now the most popular classifier at uClassify!

I just wanted to share something that could be usable for you guys. By using our latest Api call, ‘classifyKeywords’ you can see which keywords are the strongest triggers for the positive and negative classes. This could reveal additional valuable information for your clients.

For example, if you use the keyword analysis on a long product review, you could use the keywords to extract the sentences where the product is mentioned in a positive or negative way. Why not highlight it in green or red? Highlighting sentences will give a very good overview for human reviewers.

Here is how an XML request looks like (just swap ‘classify’ for ‘classifyKeywords’):

<?xml version=”1.0″ encoding=”utf-8″ ?>
<uclassify xmlns=”http://api.uclassify.com/1/RequestSchema” version=”1.01″>
<texts>
<textBase64 id=”tweet1″>bm93IHNvbWV0aW1lcyBpIHdvbmRlciB3aGF0</textBase64>
</texts>
<readCalls readApiKey=”YOUR_READ_API_KEY_HERE”>
<classifyKeywords id=”ClassifyKeywords” username=”uClassify” classifierName=”Sentiment” textId=”tweet1″/>
</readCalls>
</uclassify>

You can find more info about ‘classifyKeywords’ here.

The sentiment classifier is described in more detail here.

 

Keywords API

With the keywords API you can extract relevant/discriminating words from texts, this opens up a lot of possibilities for developers. Keywords can be used to for tag clouds and answer questions such as why a text is classified into a class. Compared to ordinary tag clouds they bring an extra angle as they are not the overall keywords but only for a certain genre. For example you can find out what parts of a text makes it manly using the gender classifier at the same time running it through the mood classifier and finding out keywords that indicate happy parts.

After having tested the keyword API for a while I’ve just made it public in the XML API now. It works exactly like the classify API but  you will get back a list of keywords for each class as well.

In short this is how a call can look:

<?xml version="1.0" encoding="utf-8" ?>
<uclassify xmlns="http://api.uclassify.com/1/RequestSchema" version="1.01">
  <texts>
    <textBase64 id="UnknownText1">bm93IHNvbWV0aW1lcyBpIHdvbmRlciB3aGF0</textBase64>
  </texts>
  <readCalls readApiKey="YOUR_READ_API_KEY_HERE">
    <classifyKeywords id="ClassifyKeywords" classifierName="MySpamClassifier" textId="UnknownText1"/>
  </readCalls>
</uclassify>

Example response:

<?xml version="1.0" encoding="utf-8" ?>
<uclassify xmlns="http://api.uclassify.com/1/ResponseSchema" version="1.01">
  <status success="true" statusCode="2000"/>
  <readCalls>
    <classifyKeywords id="ClassifyKeywords">
      <classification textCoverage="0.96">
        <class className="Legitimate" p="0.12"/>
        <class className="Spam" p="0.88"/>
      </classification>
      <keywords>
        <class className="Legitimate">uclassify jon computer urlai</class>
        <class className="Spam">viagra cheap pills</class>
      </keywords>
    </classifyKeywords>
  </readCalls>
</uclassify>

More info is available in the XML API documentation.

Happy new years!

Classifier Visualization

I’m currently working on a new keywords API for uClassify. This will allow users to get information about what words that are good discriminators for certain classes. To test this API I spent last weekend to built a visualization application for urlai.com.

Here is a screenshot how the visualization prototype show data:

 

I would very much like to get some feedback on this, you can try it here, please comment below.

 

API change: Moved textCoverage into ApiVersion 1.01

The last release of the API introduced a new feature called textCoverage. This release was a bit premature and supposed to go into the API version 1.01 in order not to break any of our users response parsers.

If you have not changed anything in your parser during the last couple of days this should not affect you. If anyone was quick enough to start using the textCoverage under version ‘1.00’ this change means that it will disappear from the responses and then you need to bump the version to 1.01. I am really sorry about that.

Bumping xml version

For xml just chage the version number from ‘1.00’ to ‘1.01’: <uclassify xmlns=”http://api.uclassify.com/1/RequestSchema” version=”1.01“>

Bumping the url API version

Here you need to add a new paramter, ‘version’ and set it to the version: http://uclassify.com/browse/uClassify/Text Language/ClassifyUrl?readkey=YOUR_READ_API_KEY_HERE&url=http%3a%2f%2fblog.uclassify.com&version=1.01

You can read more about version handling here.

I am really sorry for any disturbance this may have caused. Let me know if you need any support with this.

uClassify Corpus Tool BETA

With the uClassify Corpus Tool you will be able to build and test classifiers locally without any programming involved. It’s included in the distribution of the uClassify server. You can download the server evaluation version here.

Classifier representation

This tool is really simple to use. To represent a classifier simply create a directory on your hard drive with the name of the classifier. Then create sub directories for each class belonging to this classifier.

For example:
c:corpussentiment (classifier ‘sentiment’ directory)
c:corpussentimentpositive (class ‘positive’ belonging to classifier ‘sentiment’)
c:corpussentimentnegative (class ‘negative’ belonging to classifier ‘sentiment’)

Now you just fill each class directory with documents belonging the that class. For example put positive Amazon reviews in the ‘c:corpussentimentpositive’ and negative reviews in the ‘c:corpussentimentnegative’ folder.

Testing a classifier

To test this classifier you run the uClassify Corpus Tool:
uclassifytool.exe -test c:corpussentiment

This will output some basic metrics on the performance such as accuracy, macro precision, macro recall and the f1 measure between precision and recall. Also some per class statistics are shown.

In order to calculate custom metrics on a classifier you can export a confusion matrix with the the flag ‘-outcm’. This will allow you to calculate a lot of other measurements on the classifier. You may also output per class (one vs all) statistics with the ‘-outpc’ flag.

Building a classifier

To build a classifier:
uclassifytool.exe -build c:corpussentiment

This will create a binary that the uClassify server can read. It’s basically a frequency distribution with some additional information. The resulting file will be called ‘sentiment.dat’ and placed in the root dir of the classifier (in this case ‘c:corpussentimentsentiment.dat’).

Now you can just copy this file to your local uClassify server classifier directory.

Public data sets

There are hundreds of public data sets that you can test the classifier on. You can just download them, unzip and put their documents in a directory structure that the uClassify Corpus Tool understands. To mention a few:

Future

For now the .dat files can only be used by your own local uClassify server, however, we are looking into ways to make it possible to upload .dat classifier files to the uclassify.com to be used via the web api.

uClassify Tool Screenshot

Added text coverage score to classification responses

When you classify texts you will get back class probabilities. Sometimes it’s hard to know what those are based on therefore I’ve added a new score called ‘text coverage’.

Text coverage is the proportion words in the text to classify that is found in the training data. This is helps users to determine how trustworthy the probabilities are. For example if you send a text with 10000 words to the language classifier and get back high English probability but with with a low text coverage (say 0.01). This means that only 100 of the 10000 words where recognized by the language classifier. A reasonable cause could be that the text is written in an unknown language but has some English words in (quotations, borrowed words etc). It’s up to the user to determine how to handle this. Sometimes low text coverage scores are ok, it’s highly dependent on the domain.

The text coverage can be found as an attribute in the <classification> tag and is called ‘textCoverage’.

Let me know if you have any questions about this.

Feedback, anyone?

One of the most popular published classifiers is Language detection which classified more than 800000 texts last week. However, this is just the top of the iceberg as most classifiers are unpublished (about 500 classifiers). All classifiers are of course not active however a good number are, which brings me to my question. How does it work? I’m not getting much feedback or support – I am not sure if this is good or bad.

If you read this and are using uClassify for a project feel more than free to contact me with any positive or negative feedback on any aspect (classifier performance, documentation, response times or any unclarity). You may leave a comment or e-mail me at this address: contact at uclassify dot com. <– Are spambots able to read this now days?

Over and out!

Using published classifiers

We’ve just implemented so that everyone with a uClassify account (free) can access public classifiers.

Once a classifier is published everyone can use it via the GUI or the web API and in return authors get a link to their website from everyone who use their classifiers. This should hopefully inspire more people to share their cool classifiers!

As an example of a published classifier check out the mood classifier by prfekt.se. Here is the list of all published classifiers.

More memory or smaller memories?

In order to build a classification server that can handle thousands of classifiers and process huge amounts of data we were sure that we eventually would have to do some major optimizations. To avoid doing any work prematurely we waited with all optimizations that actually require design changes or invoke less code readability until we were absolutely sure where to improve.

When we ran our first test in May it was obvious what would be our first bottleneck – the memory consumption of classifiers. It was really bad, raw classifier data that expanded by a factor of about 5 into the primary memory – a tiny classifier of 1Mb would take 5Mb as soon it’s fetched into memory. It was really easy to pinpoint the memory theives.

Couple in crime – STL strings and maps

We were using STL maps to hold frequency distributions for tokens (features). All tokens were mapped to their frequency, map<string, unsigned int> accordingly. This is a very convenient and straightforward way to do it. But the memory overhead is not very attractive.

VS2005 STL string memory overhead

The actual sizes of types vary between platforms and STL implementations (these numbers are from the STL that comes with VS2005 on 32 bit Windows XP).

Each string takes at least 32 bytes
size_type _Mysize = 4 bytes (string size)
size_type _Myres = 4 bytes (reserve size)
_Elem _Buf = 16 bytes (internal buffer for strings shorter than 16 bytes)

_Elem* _Ptr = 4 bytes (pointer to strings that don’t fit in the buffer)
this* = 4 bytes (this pointer)

Best case overhead for STL strings is 16 bytes if the internal buffer is filled exactly. Worst case is for empty or strings longer than 15 bytes which gives the overhead of 32 bytes. Therefore string overhead varies from 16 to 32 bytes.

VS2005 STL map memory overhead

Each entry in a map consists of a STL pair – the key and value (first and second). A pair only has the memory overhead of the this pointer (4 bytes) (and that inherited from the types it’s composed of). However the map is a colored tree and consists of linked nodes. Each pair is stored in a node and nodes have quite heavy memory overhead:

_Genptr _Left = 4 bytes (points to the left subtree)
_Genptr _Parent = 4 bytes (pointer to parent)
_Genptr _Right = 4 bytes (points to the right subtree)
char _Color = 1 byte (the color of the node)
char _Isnil = 1 byte (true if node is head)

this* = 4 bytes (this pointer)

So there is a 18 byte overhead per node and 4 bytes per pair, which sums up to 22 bytes.

Strings in maps

Now inserting a string shorter than 16 bytes into a map<string, unsigned int> will consume 32+22+4=58 bytes. It could even be more if memory alignment kicks in for any of the allocations. In most cases this is perfectly fine and is not even worth considering optimizing. In our case it was not plausible to have a memory overhead factor of 5. Our language classifier takes about 14Mb on disk and should not take much more when loaded into memory – it blew up to about 65Mb. As it consists of 43 languages with probably around 30000 unique words per class (language) it gets really bloated.

One solution

We needed to maintain the search and insertion speed of maps (time complexity O(log n)) but get rid of the overhead. Insertions are needed when classifiers are trained.

Maintaining search speed

Since we already had limited features to the maximum length of 32 bytes we could use that information to create what we call memory lanes. A memory lane only consists of tokens of the same size followed by the frequency. In that manner we created 32 lanes, lane 1 with all tokens of size 1, lane 2 with all tokens of size 2 and so on. Tokens in memory lanes are sorted so we can use binary search.

Memory lane 1 could look like this (tokens of size 1 followed by the frequency)
a0031i0018y0003

and memory lane 3 like this
can0011far0004the0019zoo0001

By doing so we get rid of all overhead and maintaining search at O(log n).

Maintaining insertion speed (almost)

Maps allow fast insertions in O(log n) so we kept an intermediate map for each memory lane. When a classifier is trained, new tokens they go into the map and the frequency of those that already exist in the memory lane is increased. When the training session is over the intermediate maps are merged to their respective memory lane. This can be done in O(n) and is the major penalty. Note that explicit sorting is never required since maps are ordered. Another penalty occur when both the map and memory lane are filled with tokens – at this point two lookups can happen (first in the memory lane and if it doesn’t exist a search through the map is required).

This solution reduced memory consumption by a factor of 4-5 at the penalty of having to merge new training data into memory lanes every now and then. This is perfectly fine for us as training often reduce with time (training data get good enough) and classification hence increase.

A similar optimization for Java is described on the LingPipe blog.