The feature extractor update

With the latest version of uClassify we have abandoned the old feature extractor for user trained classifiers. This means that you will get better performance for all new classifier created after 2017-03-26. All already created classifiers won’t be affected.

Background

In the beginning, around 2007, I thought it would be best to allow the users to do preprocessing and use a really simple feature extractor on the server. This feature extractor only separated words by the space character (32 decimal). The idea was that users could preprocess their texts to allow more delimiters (e.g. exchange ! to space on their side). You can also generate your own bigrams by combining them with underscore or something.

Classifiers made by uClassify have used other feature extractors, e.g. the sentiment classifier uses unigrams, bigrams, some stemming and converting to lower case. This improves the performance significantly.

Now, it’s not easy for someone who is new to machine learning and text classification to guess which delimiters to use. Also it can be very counter intuitive. For that reason, I decided to replace the old unigram feature extractor with a new high performance general purpose extractor.

The new feature extractor

The new feature extractor has been found heuristically by running extensive tests over a large set of corpora. The datasets are a part of our internal testing suite that contains over 80 different test sets for a wide range of problems. Therefore I am very confident that the new feature extractor will do really well.

In short here is what it does:

convert text to lower case
separate words on white spaces, exclamation mark, parentheses, period and slash.
generate uni grams
generate bi grams
generate parts of long unigrams

This is similar to what the Sentiment classifier has been using which allows it to differentiate between “I don’t like” and “I like”.

Let me know if you have any feedback, you can reach me at contact AT uclassify DOT com or @jonkagstrom on twitter!