Improved classifier accuracy

I am very happy to announce this performance update that means that classification will have better accuracy than before.

When I was building a new topic classifier based on the IAB taxonomy I did notice some weird behaviour for classes with much less training data than the others. As I started to investigate this I was able to understand how the overall classification could be improved, not only those with low training data. After weeks of testing different implementations I found a few improvements that significantly gave better results on the test datasets.

In short classifiers are much more robust and less sensitive to imbalanced data.

This update doesn’t affect any api endpoints it will only give you better probabilities.

I might write a short post on the technicalities of this update.

JSON REST API

Since uClassify was launched back in 2008 we have seen many technological changes. Last year I modernised the site to use bootstrap as a foundation. Now it’s time to take the api to a more modern format.

Initially the uClassify api only had an XML endpoint, however over the years JSON has become more common and I have been getting more and more requests for REST endpoints with JSON format. The graph below shows google trends ‘json api’ (red) vs ‘xml api’ (blue)

XML API VS JSON API
XML API VS JSON API

Today I have launched a beta of the JSON REST API, changes may still occur but it will hopefully be finalised during Mars 2016.

You can find the documentation here, please feel free to leave feedback.

The old XML and URL API endpoints will of course continue to work as before.

Update with limit changes

The last major update has been running very smoothly, this is the first patch since!

Max request size limit increased

After feedback from the community I’ve increased the maximum allowed request size from 1MB to 3MB. I will monitor the servers and make sure this works fine. Maybe it’s possible to increase it further.

Max query string length increase

After the last update, when I updated the IIS server the default max request string url length was lower then previous. Thanks Liz who noticed this. I’ve not set the max size to 65kb.

Max free calls per day decreased

When I looked at the call statistics it didn’t make much sense to offer 5000 free calls per day. Most people aren’t even close to this, by lowering it to 1000 calls per day only a few will be affected, but most will not notice anything. This is also motivated by looking on competitors free limits and 1000 calls per day is still very generous. Let me know if you have any questions about this.

Bugs

Besides fixing some typos (thanks to everyone who reported) I’ve made it so you can’t publish untrained classifiers and fixed a so the front page buttons work better on small displays. I’ll also unpublished previous classifiers that are untrained and published.

Future

I am extremely happy with the performance of the new Sentiment classifier. It uses a new version of the classifier that looks at combinations of words among other things. Tests show that this type of classifier improves the performance of all tested data sets, therefore I am trying to figure out how to use it for all new classifiers, but it does require some work.

Let me know if you have any questions.

@jonkagstrom

Sentiment Analysis Api

A Sentiment analyzer tells you if a text it’s positive or negative. For example “I love the new Mad Max Fury road” (positive) or “i am not impressed by the bike” (negative). The Sentiment classifier hosted by uClassify is very popular so I decided to spend some time on improving it.

sentiment

The goal was to improve the classification accuracy, especially for short texts such as Twitter messages, Facebook statuses or other snippets while maintaining high quality results on texts with more information.

The old Sentiment classifier was built by 40k amazon product reviews. The straight forward way to improve a classifier is to add more data. Thanks to the Internet we were able to find multiple data sources we could train our classifier on. In fact it’s now trained on 2.8 million documents!

The results are good very good, the accuracy on large documents (reviews) went from about 75% to 83%. Tweets went from 63% to about 77%.

You can play with it here there is also an API available (free to use).

Datasets used are from sentiment-140 (twitter), amazon product reviews and rotten tomatoes.

Image by Anna Gathu

New 64-bit local server

As a part of the uClassify upgrade I’ve recompiled the local server for 64-bit. This was necessary since I’m working on a huge classifier for sentiment and needed the corpus tool to be able to handle more then what 32-bit pointers could hold.

If you are running a local uClassify server, you can download the 64-bit (and 32-bit) here. The 64-bit server is already used in production for uClassify and should be pretty well tested by now.

You can read more about the local uClassify server here.

uClassify is updating

The old uClassify site has been set to read-only and the database & classifier migration has been done. Now we are just waiting for the DNS to propagate over the nets before the new site can be taken into use. This time on an elastic IP so hopefully this we won’t have to do anymore of those ‘waiting’ operations in the future.

Hopefully it has been fully propagated within 24h.

Let me know if you have any trouble with your account.

Next update and future plans

Lately uClassify has gotten a lot of attention and the user community has grown at a faster rate. We are getting more requests and inquires from our customers and it feels like machine learning is something that many people know of, not only the tech savvy geeks

Whats in the next update

For a couple of months now I’ve been reworking the both the front end and back end to make it easier to go forward. It feels like a necessity to get some of the tech up to date. After all the tech is at least six years old.

The site will be replaced with a modern bootstrap powered front end. Making it much more responsive and easy to maintain. There will also be a few new features in the first release, e.g. ability to upload files to train classifiers. It will also be possible to log in via Google, Twitter and Facebook.train_from_files

The backend is also being reworked to make it easier to work with. Those changes will be completely invisible to users and all public APIs will remain the same.

I hope to have all of this done before the end of June 2015.

Future plans

– Once all the new code is in place and the service is up and running I intend to add a complete JSON Api for the service. E.g. right now you need to use XML for batching calls.

– Open source C# and Java libs for calling the API.

– Add more and better classifiers. Today it’s easier to find good training data for classifiers.

– Classifier performance, I’ve a few ideas of how to improve the accuracy of the classifier further.

Possible to upgrade your account

uClassify has been around since October 2008, and to date has almost 15000 registered users and nearly 2000 classifiers. All this time the web api has been completely free with no restrictions what so ever (number of calls, classifiers, size of classifiers etc). But lately we have gotten a lot more traffic over our web api. This is of course a lot of fun but it also adds more server cost. Therefore I want to try to introduce payment options for those who can pay but keep it free for the majority of users.

Free, Indie, Professional and Enterprise Accounts

What I want to do is to introduce a pricing model that doesn’t affect the majority of users and hopefully only affects those who can afford to pay.

After analyzing the logs I’ve found only a few percent of the users make more than 1000 calls per day. Therefore I decided to introduce a limit of max 5000 calls per day for free accounts. Keeping in mind that I want it affordable for everyone I introduced both ‘Indie’ and ‘Professional’ accounts. Both with a cap of 100.000 calls per day. The indie account is for smaller companies (<100.000€ yearly revenue). The pricing for an Indie account is initially set to 9€/month and for professional 99€/month.

On top of that there is an option to upgrade to 1.000.000 calls / day for a price of 299€/month for high end users. Also I will offer a free Academic account with 1.000.000 calls / day cap for researchers.

To sum up
Free 5000 calls/day
Indie* 9€/month -> 100.000 calls/day
Professional 99€/month -> 100.000 calls/day
Enterprise 299€/month -> 1.000.000 calls/day
Academic 1.000.000 calls/day

*Indie=For small companies and private persons with yearly revenue < 100.000€ Subscription will be done via Paypal.

What will happen to existing accounts?

If you already have a uClassify account it will be upgraded to an Enterprise account with X months expiry time. All of those who are likely to be affected will be emailed with a heads up. But most of you won’t notice this change.

The system will likely be implemented during the nearest weeks.

I am open to suggestions, if you have feedback or think this sucks please let me know! (contact AT uclassify DOT com)

Sentiment analysis with keyword extraction

Lately we have been getting a lot requests to our sentiment classifier, many are from social media analyst companies. In fact our sentiment analysis is now the most popular classifier at uClassify!

I just wanted to share something that could be usable for you guys. By using our latest Api call, ‘classifyKeywords’ you can see which keywords are the strongest triggers for the positive and negative classes. This could reveal additional valuable information for your clients.

For example, if you use the keyword analysis on a long product review, you could use the keywords to extract the sentences where the product is mentioned in a positive or negative way. Why not highlight it in green or red? Highlighting sentences will give a very good overview for human reviewers.

Here is how an XML request looks like (just swap ‘classify’ for ‘classifyKeywords’):

<?xml version=”1.0″ encoding=”utf-8″ ?>
<uclassify xmlns=”http://api.uclassify.com/1/RequestSchema” version=”1.01″>
<texts>
<textBase64 id=”tweet1″>bm93IHNvbWV0aW1lcyBpIHdvbmRlciB3aGF0</textBase64>
</texts>
<readCalls readApiKey=”YOUR_READ_API_KEY_HERE”>
<classifyKeywords id=”ClassifyKeywords” username=”uClassify” classifierName=”Sentiment” textId=”tweet1″/>
</readCalls>
</uclassify>

You can find more info about ‘classifyKeywords’ here.

The sentiment classifier is described in more detail here.

 

Keywords API

With the keywords API you can extract relevant/discriminating words from texts, this opens up a lot of possibilities for developers. Keywords can be used to for tag clouds and answer questions such as why a text is classified into a class. Compared to ordinary tag clouds they bring an extra angle as they are not the overall keywords but only for a certain genre. For example you can find out what parts of a text makes it manly using the gender classifier at the same time running it through the mood classifier and finding out keywords that indicate happy parts.

After having tested the keyword API for a while I’ve just made it public in the XML API now. It works exactly like the classify API but  you will get back a list of keywords for each class as well.

In short this is how a call can look:

<?xml version="1.0" encoding="utf-8" ?>
<uclassify xmlns="http://api.uclassify.com/1/RequestSchema" version="1.01">
  <texts>
    <textBase64 id="UnknownText1">bm93IHNvbWV0aW1lcyBpIHdvbmRlciB3aGF0</textBase64>
  </texts>
  <readCalls readApiKey="YOUR_READ_API_KEY_HERE">
    <classifyKeywords id="ClassifyKeywords" classifierName="MySpamClassifier" textId="UnknownText1"/>
  </readCalls>
</uclassify>

Example response:

<?xml version="1.0" encoding="utf-8" ?>
<uclassify xmlns="http://api.uclassify.com/1/ResponseSchema" version="1.01">
  <status success="true" statusCode="2000"/>
  <readCalls>
    <classifyKeywords id="ClassifyKeywords">
      <classification textCoverage="0.96">
        <class className="Legitimate" p="0.12"/>
        <class className="Spam" p="0.88"/>
      </classification>
      <keywords>
        <class className="Legitimate">uclassify jon computer urlai</class>
        <class className="Spam">viagra cheap pills</class>
      </keywords>
    </classifyKeywords>
  </readCalls>
</uclassify>

More info is available in the XML API documentation.

Happy new years!