With the latest version of uClassify we have abandoned the old feature extractor for user trained classifiers. This means that you will get better performance for all new classifier created after 2017-03-26. All already created classifiers won’t be affected.
In the beginning, around 2007, I thought it would be best to allow the users to do preprocessing and use a really simple feature extractor on the server. This feature extractor only separated words by the space character (32 decimal). The idea was that users could preprocess their texts to allow more delimiters (e.g. exchange ! to space on their side). You can also generate your own bigrams by combining them with underscore or something.
Classifiers made by uClassify have used other feature extractors, e.g. the sentiment classifier uses unigrams, bigrams, some stemming and converting to lower case. This improves the performance significantly.
Now, it’s not easy for someone who is new to machine learning and text classification to guess which delimiters to use. Also it can be very counter intuitive. For that reason, I decided to replace the old unigram feature extractor with a new high performance general purpose extractor.
The new feature extractor
The new feature extractor has been found heuristically by running extensive tests over a large set of corpora. The datasets are a part of our internal testing suite that contains over 80 different test sets for a wide range of problems. Therefore I am very confident that the new feature extractor will do really well.
In short here is what it does:
convert text to lower case
separate words on white spaces, exclamation mark, parentheses, period and slash.
generate uni grams
generate bi grams
generate parts of long unigrams
This is similar to what the Sentiment classifier has been using which allows it to differentiate between “I don’t like” and “I like”.
Let me know if you have any feedback, you can reach me at contact AT uclassify DOT com or @jonkagstrom on twitter!
Did you know there’s a way to classify texts without having to leave Excel? We have paired up with SeoTools for Excel, a Swiss army knife Excel-plugin, which offers a tailored “Connector” for all uClassify users.
In this blog post, we will show how SeoTools allows you to classify lists of texts or URLs with the classifiers of your choice, and having the results ready for analysis in a matter of seconds.
Don’t be worried if your Excel spreadsheet doesn’t look as the example above. The extra ribbon tab “SeoTools” is added when SeoTools for Excel is installed. At the end of this post you find all the links necessary to setup your uClassify account.
Selecting a classifier
The uClassify Connector is, as the name suggests, connected to uClassify library. Clicking on “Select” opens a window of all available classifiers. It is also possible to choose input type (Text or URL) and if the results include classification and probability.
When you are satisfied with your settings, click “Insert”, and SeoTools will generate the data in columns A and onwards.
Save time and automate the process
Exporting and filtering Excel data from web based platforms takes time, especially if it’s required on a daily or weekly basis. The filtering part of standardized files is also associated with human error. SeoTools solves this with saving and loading of “Configurations”:
Next time, just load a previous configuration and you will get classifications based on the same settings as last time.
Use Formula Mode to supercharge your classification
The beauty of combining uClassify with Excel is the ability to create large numbers of requests automatically. Instead of populating cells with values, select “Formula” before Inserting the results:
Next, you can change the formula to reference a cell and the uClassify Connector will generate results based on the value or text in that cell.
In the following example, company A has been mentioned 100 times on Twitter in the last week and we want to determine the text Language and Sentiment for these tweets.
First, select the Text Language Classifier and enter a random character in the Input field (we will change this in the formula to reference the tweets). Also, don’t forget select “Exclude headers in result” since we only want the values for each row.
When the formula has been inserted in cell C2, change the input “y” to B2, and SeoTools will return the language with the highest probability. Repeat the same steps for the Sentiment classifier, but insert it in cell D2. It should look like this:
To get the results for all rows, select cell C2 and D2 and drag the formula down and SeoTools will generate the classifications for all tweets. In the example below, we’ve started on row 16 to illustrate the results:
Do you want to try it with your uClassify account?
⦁ Sign up for a 14-Day Trial and follow the instructions to download and install the latest version of SeoTools.
⦁ Register your access key under “Upgrade to Pro” and access uClassify in the Connectors menu:
⦁ Next, go to API keys in the top menu of your uClassify account and copy the Read key
⦁ Finally, copy your API-key and paste it in the “Options” menu:
The complete documentation of the uClassify Connector features can be found here.
A brief introduction to our machine translation algorithm
We have implemented statistical machine translation (SMT). SMT is completely data driven. It works by calculating word and phrase probabilities from a large corpus. We have used OPUS and Wiktionary as our primary sources.
From the data sources (mostly bilingual parallel language sets) a dictionary of translations is constructed. For each translation we keep a count and parts of speech tags for both source and target, this is our translation model & pos models and it looks something like:
Translation & pos models
source word|source pos tags|translation count|target word |target pos tags
övermorgon|adv|3|the day after tomorrow|det noun prep noun
For the target language a language model and a grammar model is used. Each consists of 1-5 n grams. The language model consists of word sequences and a frequency, the grammar model of pos tags and their frequencies:
Building a graph
So we have data. Plenty of data. Now we just need to make use of it. When a text is translated a graph is built between all possible translations, most of the time each word has multiple translations and meanings, so the number of combinations grows very quickly. During the graph building we need to remember that source phrases can contract, e.g. ‘i morgon’=>’tomorrow’ and expand ‘övermorgon’=>’the day after tomorrow’.
We look at a maximum of 5 words. Once the graph is built, a traversal is initiated. As we traverse the graph encountered sub phrases are scored and the best path is chosen.
Graph for 'hej världen!'
hej världen !
hi world !
hi world !
hi universe !
hi earth !
hello world !
hello universe !
hello earth !
Unfortunately there is no way to examine all translations so we need to traverse the graph intelligently. We use a beam search with limited depth and width to get the scope down to manageable scales.
The scoring of each phrases combines the four aforementioned aspects of the language:
Translation model: This is the dictionary, source->target each entry has a frequency, from the frequency we can calculate a probability (p1) “the most likely translation for ‘hej’ this word is ‘hello'”
Source grammar model: The pos tag helps us to resolve ambiguity, a probability (p2) is calculated, basically saying “‘hej’/’hello’ is likely an interjection”.
Target language model: We look at 1-5 grams. A n-gram is a sequence of words, for example “hello world” is a 2-gram. Each n-gram has a frequency indicating how common it’s. Again a probability (p3) can be calculated, “the sequence ‘hello world’ is more likely than ‘hi world'”.
Target grammar model: just like the language model we do the same but with pos tags. A probability (p4) is calculated indicating “Yeah a verb followed by a preposition sounds better than two prepositions in a row” etc.
We use a sliding window moving over the phrase and combining probabilities using the chain rule into accumulated P1-P4. We end up with 4 parameters that are finally mixed with different weights according to
Working in log space makes life easier here. Then we just select the phrase with the highest score.
We estimate the weights (w1-w4) by a randomized search that tries to maximize a bleu-score for a test set. The estimation only needs to be rerun when the training data changes. As expected, the most important (highest weight) is assigned the translation model (w1=1), second highest the source grammar model (w3~0.6), third highest the language model (w2~0.3) and finally the target grammar model (w4~0.05). Yes, the as it turns out the target grammar model is not very important, it helps to resolve uncertainty in some cases by predicting pos tags. But I might actually nuke it to favor simplicity in future versions.
There were plenty of unmentioned problems to be solved along the way, but you get the overall idea. One thing that easily puts you off is the size of the data you are dealing with. E.g. downloading TB sized datasets like the google ngrams and processing those. At one point, after 4 days processing those huge zipfiles, Windows Update decided to restart the computer…
We get a lot of requests for classifiers in different languages and as a next step we are building a translation API. The idea is to have an affordable in-house machine translation service that can quickly translate requests to the classifier language, classify the request and send back the response. Since the majority of classifiers are in English, the primary focus will be to target English.
Initially we support French, Spanish and Swedish to English translations.
The API is accessible with your ordinary API read key and a GET/POST REST protocol.
Upon popular request I’ve built a new topics classifier based on the IAB taxonomy. EDIT: We also support the IAB Taxonomy V2 now.
The classifier has two levels of depth, a main category (sports, science…) and a sub category (soccer, physics…). In total there are about 360 different classes following the IAB Quality Assurance Guidelines (QAG) Taxonomy specification.
The class names are composed of 4 parts separated by an underscore, with the following structure:
main topic_sub topic_main id_sub id home and garden_flowers_5_4 sports_climbing_17_3 sports_volleyball_17_7
The last two ids are the IAB ids, this will make it easier for users tho map and integrate the result.
With a free uClassify account you can make 1000 free calls per day, if you need more there are affordable options from 9€ per month. You can sign up here.
List of topics
IAB12 News and IAB24 Uncategorized is not supported.
IAB1 Arts & Entertainment
IAB1-1 Books & Literature
IAB1-2 Celebrity Fan/Gossip
IAB1-3 Fine Art
IAB2-1 Auto Parts
IAB2-2 Auto Repair
IAB2-3 Buying/Selling Cars
IAB2-4 Car Culture
IAB2-5 Certified Pre-Owned
IAB2-10 Electric Vehicle
IAB2-16 Off-Road Vehicles
IAB2-17 Performance Vehicles
IAB2-19 Road-Side Assistance
IAB2-21 Trucks & Accessories
IAB2-22 Vintage Cars
IAB3-4 Business Software
IAB3-8 Green Solutions
IAB3-9 Human Resources
IAB4-1 Career Planning
IAB4-3 Financial Aid
IAB4-4 Job Fairs
IAB4-5 Job Search
IAB4-6 Resume Writing/Advice
IAB4-10 U.S. Military
IAB4-11 Career Advice
IAB5-1 7-12 Education
IAB5-2 Adult Education
IAB5-3 Art History
IAB5-4 College Administration
IAB5-5 College Life
IAB5-6 Distance Learning
IAB5-7 English as a 2nd Language
IAB5-8 Language Learning
IAB5-9 Graduate School
IAB5-11 Homework/Study Tips
IAB5-12 K-6 Educators
IAB5-13 Private School
IAB5-14 Special Education
IAB5-15 Studying Business
IAB6 Family & Parenting
IAB6-2 Babies & Toddlers
IAB6-3 Daycare/Pre School
IAB6-4 Family Internet
IAB6-5 Parenting – K-6 Kids
IAB6-6 Parenting teens
IAB6-8 Special Needs Kids
IAB7 Health & Fitness
IAB7-5 Alternative Medicine
IAB7-9 Bipolar Disorder
IAB7-10 Brain Tumor
IAB7-13 Chronic Fatigue Syndrome
IAB7-14 Chronic Pain
IAB7-15 Cold & Flu
IAB7-17 Dental Care
IAB7-22 GERD/Acid Reflux
IAB7-24 Heart Disease
IAB7-25 Herbs for Health
IAB7-26 Holistic Healing
IAB7-27 IBS/Crohn’s Disease
IAB7-28 Incest/Abuse Support
IAB7-31 Men’s Health
IAB7-34 Panic/Anxiety Disorders
IAB7-36 Physical Therapy
IAB7-38 Senior Health
IAB7-40 Sleep Disorders
IAB7-41 Smoking Cessation
IAB7-42 Substance Abuse
IAB7-43 Thyroid Disease
IAB7-44 Weight Loss
IAB7-45 Women’s Health
IAB8 Food & Drink
IAB8-1 American Cuisine
IAB8-2 Barbecues & Grilling
IAB8-4 Chinese Cuisine
IAB8-8 Desserts & Baking
IAB8-9 Dining Out
IAB8-10 Food Allergies
IAB8-11 French Cuisine
IAB8-12 Health/Low-Fat Cooking
IAB8-13 Italian Cuisine
IAB8-14 Japanese Cuisine
IAB8-15 Mexican Cuisine
IAB9 Hobbies & Interests
IAB9-2 Arts & Crafts
IAB9-5 Board Games/Puzzles
IAB9-6 Candle & Soap Making
IAB9-7 Card Games
IAB9-11 Comic Books
IAB9-13 Freelance Writing
IAB9-15 Getting Published
IAB9-17 Home Recording
IAB9-18 Investors & Patents
IAB9-19 Jewelry Making
IAB9-20 Magic & Illusion
IAB9-25 Roleplaying Games
IAB9-26 Sci-Fi & Fantasy
IAB9-29 Stamps & Coins
IAB9-30 Video & Computer Games
IAB10 Home & Garden
IAB10-3 Environmental Safety
IAB10-5 Home Repair
IAB10-6 Home Theater
IAB10-7 Interior Decorating
IAB10-9 Remodeling & Construction
IAB11 Law, Government, & Politics
IAB11-2 Legal Issues
IAB11-3 U.S. Government Resources
IAB12-1 International News
IAB12-2 National News
IAB12-3 Local News
IAB16-5 Large Animals
IAB16-7 Veterinary Medicine
IAB17-1 Auto Racing
IAB17-10 Figure Skating
IAB17-11 Fly Fishing
IAB17-13 Freshwater Fishing
IAB17-14 Game & Fish
IAB17-16 Horse Racing
IAB17-19 Inline Skating
IAB17-20 Martial Arts
IAB17-21 Mountain Biking
IAB17-22 NASCAR Racing
IAB17-25 Power & Motorcycles
IAB17-26 Pro Basketball
IAB17-27 Pro Ice Hockey
IAB17-32 Saltwater Fishing
IAB17-33 Scuba Diving
IAB17-39 Table Tennis/Ping-Pong
IAB17-44 World Soccer
IAB18 Style & Fashion
IAB18-2 Body Art
IAB19 Technology & Computing
IAB19-1 3-D Graphics
IAB19-3 Antivirus Software
IAB19-5 Cameras & Camcorders
IAB19-6 Cell Phones
IAB19-7 Computer Certification
IAB19-8 Computer Networking
IAB19-9 Computer Peripherals
IAB19-10 Computer Reviews
IAB19-11 Data Centers
IAB19-13 Desktop Publishing
IAB19-14 Desktop Video
IAB19-16 Graphics Software
IAB19-17 Home Video/DVD
IAB19-18 Internet Technology
IAB19-21 Mac Support
IAB19-23 Net Conferencing
IAB19-24 Net for Beginners
IAB19-25 Network Security
IAB19-27 PC Support
IAB19-32 Visual Basic
IAB19-33 Web Clip Art
IAB19-34 Web Design/HTML
IAB19-35 Web Search
IAB20-1 Adventure Travel
IAB20-3 Air Travel
IAB20-4 Australia & New Zealand
IAB20-5 Bed & Breakfasts
IAB20-6 Budget Travel
IAB20-7 Business Travel
IAB20-8 By US Locale
IAB20-13 Eastern Europe
IAB20-21 Mexico & Central America
IAB20-22 National Parks
IAB20-23 South America
IAB20-25 Theme Parks
IAB20-26 Traveling with Kids
IAB20-27 United Kingdom
IAB21 Real Estate
IAB21-3 Buying/Selling Homes
A new keywords API was released a few weeks ago. The old one was not really well designed and needed a revamp.
With the keywords API you can extract keywords from texts with respect to a classifier, for example, if you want to find words that make a text positive or negative you extract keywords with the sentiment classifier or if you want to generate tags for a blog post based on topic, you can run it through a topics classifier or maybe our new IAB Taxonomy classifier.
The result will be a lists of keywords where each keywords is associated with one of the classes. Also each keywords has a probablility, indicating how important each keyword is, a weight if you will. A high value (max 1) means the keyword is very important/relevant.
Example result when extracting text from the sentiment classifier:
I am very happy to announce this performance update that means that classification will have better accuracy than before.
When I was building a new topic classifier based on the IAB taxonomy I did notice some weird behaviour for classes with much less training data than the others. As I started to investigate this I was able to understand how the overall classification could be improved, not only those with low training data. After weeks of testing different implementations I found a few improvements that significantly gave better results on the test datasets.
In short classifiers are much more robust and less sensitive to imbalanced data.
This update doesn’t affect any api endpoints it will only give you better probabilities.
I might write a short post on the technicalities of this update.
Disclaimer: I made this experiment out of curiosity and not academia. I’vent double checked the results and I have used arbitrary-feel-good-in-my-guts constants when running the tests.
In the last post I built a classifier from subtitles of movies that had failed and passed the Bechdel test. I used a dataset with about 2400 movie subtitles labeled whether or not they had passed the Bechdel test. The list of labels was obtained from bechdeltest.com.
In this post I will explore the inner workings of the classifier. What words and phrases reveal if a movie will pass or fail?
Lets just quickly recap what the Bechdel test is, it tests a movie for
The movie has to have at least two women in it,
who talk to each other,
about something besides a man.
Keywords from subtitles
It’s possible to extract keywords from classifiers. Keywords are discriminating indicators (words, phrases) for a specific class (passed or failed). There are many ways to weight them. I let the classifier sort every keyword according to the probability of belonging to a class.
Common, strong, keywords
To get a general overview we can disregard the most extreme keywords and instead consider keywords that appears more frequently. I extracted keywords that had occurred at least in 100 different movies (which is about 5% of the entire dataset).
To start with I looked at unigrams (single words) and removed non alphanumerical characters and transformed the text to lower case. To visualize the result I created two word clouds. One with keywords that indicate a failed test. One with keywords that are discriminative for a passed test.
Bigger words means higher probability of either failing or passing.
Keywords like ‘lads’, ‘assault’, ‘rifle’, ’47’ (ak-47), and ‘russian’ seems to indicate a failed Bechdel test. Also words like ‘logic’, ‘solved’, ‘systems’, ‘capacity’ and ‘civilization’ are indicators of a failed Bechdel test.
The word ‘boobs’ appears a lot more in subtitles of movies that passed the Bechdel tests than those which failed. I don’t know why, but I’ve double checked it. Overall it’s a lot of ‘lipstick’, ‘babies’, ‘washing’, ‘dresses’ and so on.
Keywords only from 2014 and 2015, did anything change?
The word clouds above are generated from 1892 up until now. So I wanted to check if anything had changed since. Below are two word clouds from 2014 and 2015 only. There were less training data (97 and 142 movies) and I only looked at words that appeared in 20 or more titles to avoid extreme features.
Looking at the recent failed word cloud it seems like there are less lads, explosions and ak-47s. Also, Russia isn’t as scary anymore, goodbye the 80s. In general it’s less of the war stuff?
From a quick glance it seems like something is different in the passed cloud too, we find words like ‘math’, ‘invented’, ‘developed’, ‘adventure’ and ‘robert’. Wait what Robert? So it seems like ‘Robert’ occurs in 20 movies that passed and 3 that failed last two years. Robert is probably noise (too small dataset). Furthermore, words like ‘washing’, ‘mall’, ‘slut’ and ‘shopping’ have been neutralized. Interestingly a new modern keyword ‘texted’ is used a lot in movies that passed the Bechdel test.
From a very informal point of view, it looks like we are moving in the right direction. But I think for a better understanding of how language has changed over time with a Bechdel -perspective it’s necessary to set up a more controlled experiment. One where you can follow keywords over time as they gain and lose usage. Like google trends, please feel free to explore it and let me know what you find out 😉
Looking at a recent movie, Mad Max: Fury Road
I decided to examine the subtitles in a recent movie that had passed the test, Mad Max: Fury Road. Todo this I trained a classifier with all subtitles since 1892, except the ones from Mad Max movies. Then extracted the keywords from the Mad Max: Fury Road subtitles.
This movie passes the Bechdel test. An interesting point is that despite the anecdotic presence word such as ‘babies’, ‘girly’ and ‘flowers’ (in the passed class) the words that surface are not linked to traditional femininity -unlike many other movies that have passed the test. Overall it’s much harder to differentiate between the two clouds.
If you haven’t seen it yet go and watch it, it’s very good!
If my experiment is carried out correctly, or at least good enough (read disclaimer at the top:) passing the Bechdel test doesn’t imply a gender equal movie. Even if it certifies the movie has…
At least two woman
that speak to each other
about something else than men …
…unfortunately this ‘something else than men’ often seems to be something linked to ‘traditional femininity’. The good news, when only looking at more recent data the trend seems to be getting more neutral, ‘washing’ is falling down on the list while ‘adventure’ rises.
It would be interesting to come up with a test that also captures the content as well as how women (and others) are represented. Designing the perfect test will probably be infinitely hard, especially for us humans. It seems like we have hard times on settling whether or not any movie is gender equal (just google any movie discussions). Perhaps with enough data, machine learning can design a test that reveal a multidimensional score of how well and why a movie passes or fails certain tests, not only examining gender but looking at all kinds of diversities.
Finally, just for the sake of clarity, I don’t think the Bechdel test is bad, it certainly helps us to think about women’s representation in movies. But maybe don’t always expect a non sexist, gender equal movie just because it passes the Bechdel test.
For bigrams I also removed non alphanumeric characters, that is why you can see some weird stuff like ‘you-don’ which should be ‘you-don’t’. However I decided to keep this because it can capture some interesting features like ‘s-fault’ (e.g. ‘dad’s fault’)
Space has been replaced by ‘-‘ so the word cloud word make sense.
All time bigram keywords
One interesting thing here is the ‘your men’ vs ‘you girls’. I will leave the analysis to you 😉
Doesn’t sound so hard to pass, does it? This test was introduced by Alison Bechdel in 1985 in one of her comics, ‘The rule‘.
The largest database of movies that has been Bechdel tested is on bechdeltest.com. The database contains over 6000 titles from 1892 up until now. How many percent do you think pass the Bechdel test overall? As I write this about 58% of the movies has passed the test. Statistics from here.
Being interested in machine learning and data I thought it would maybe be possible to find a textual correlation between movies that fail and pass the test.
To build a classifier that figures this out requires data. It needs labeled samples to learn from. It should be a list of films that passes and fails the test. The more the better. Then for each movie we need to extract features. Features could be the cover, the title, the description, the subtitles, the audio or anything that is in the movie.
Data & features
I was very happy when I found bechdeltest.com, it has a pretty extensive list exceeding 6000 movie titles with information of whether it passed the Bechtel test or not. Even better, it has a Bechtel test rating of 0-3, where 0) means it fails the first part of the test and 3) that it passes all tests.
Since I am dealing with text classifiers the natural choices for features were:
– The description
– The subtitles
– The title
The descriptions were retrieved using omdbapi.com api which gets the plot from imdb. I retrieved plots from 2433 failed and 3281 passed movies.
The subtitles were a bit more cumbersome to find, I did use about 2400 movies selected randomly and spent some time downloading them from various sites. Pweii.
Finally the training data for the title was easily obtained by just creating samples with the only the movie titles for each class. In total 2696 and 3669 movie titles.
I setup an environment and ran 10-fold-cross-validation for all the data (train on 9/10 samples, test with 1/10 then rotate). For feature extraction I looked at case insensitive unigrams and bigrams.
I trained a classifier reading IMDB plots labeled whether or not the corresponding movie had passed the test. The classifier turned out to have an accuracy of 67% .
By only reading the subtitles uClassify was able to predict whether or not a movie would pass an accuracy of 68%.
One classifier was trained to only look on the movie titles. The accuracy of the classifier was about 55% and this is not surprising at all considering how small the training dataset is.
Finally, I mashed together the subtitles and plots into one classifier that showed a slight increase in accuracy of 69%.
The combined (subtitles+plots) classifier is available here, you can toss movie plots or subtitles (and probably scripts) at it and it will do its best to predict if it passes the Bechdel test or not.
The predictive accuracy of the classifier may not be the best, it certainly doesn’t figure out the 1-3 sub rules by just looking at unigrams and bigrams. But it does capture something to predict 70% correctly. I’m curious to find out exactly what it does make it decisions on and will make another blog post on this.