IAB taxonomy classifier

Upon popular request I’ve built a new topics classifier based on the IAB taxonomy.

The classifier has two levels of depth, a main category (sports, science…) and a sub category (soccer, physics…). In total there are about 360 different classes following the IAB Quality Assurance Guidelines (QAG) Taxonomy specification.

uClassify interface to IAB
uClassify interface to IAB

You can try the online demo here.

Class name format

The class names are composed of 4 parts separated by an underscore, with the following structure:

main topic_sub topic_main id_sub id
home and garden_flowers_5_4
sports_climbing_17_3
sports_volleyball_17_7

The last two ids are the IAB ids, this will make it easier for users tho map and integrate the result.

With a free uClassify account you can make 1000 free calls per day, if you need more there are affordable options from 9€ per month. You can sign up here.

List of topics

IAB12 News and IAB24 Uncategorized is not supported.

IAB1 Arts & Entertainment
IAB1-1 Books & Literature
IAB1-2 Celebrity Fan/Gossip
IAB1-3 Fine Art
IAB1-4 Humor
IAB1-5 Movies
IAB1-6 Music
IAB1-7 Television

IAB2 Automotive
IAB2-1 Auto Parts
IAB2-2 Auto Repair
IAB2-3 Buying/Selling Cars
IAB2-4 Car Culture
IAB2-5 Certified Pre-Owned
IAB2-6 Convertible
IAB2-7 Coupe
IAB2-8 Crossover
IAB2-9 Diesel
IAB2-10 Electric Vehicle
IAB2-11 Hatchback
IAB2-12 Hybrid
IAB2-13 Luxury
IAB2-14 Minivan
IAB2-15 Motorcycles
IAB2-16 Off-Road Vehicles
IAB2-17 Performance Vehicles
IAB2-18 Pickup
IAB2-19 Road-Side Assistance
IAB2-20 Sedan
IAB2-21 Trucks & Accessories
IAB2-22 Vintage Cars
IAB2-23 Wagon

IAB3 Business
IAB3-1 Advertising
IAB3-2 Agriculture
IAB3-3 Biotech/Biomedical
IAB3-4 Business Software
IAB3-5 Construction
IAB3-6 Forestry
IAB3-7 Government
IAB3-8 Green Solutions
IAB3-9 Human Resources
IAB3-10 Logistics
IAB3-11 Marketing
IAB3-12 Metals

IAB4 Careers
IAB4-1 Career Planning
IAB4-2 College
IAB4-3 Financial Aid
IAB4-4 Job Fairs
IAB4-5 Job Search
IAB4-6 Resume Writing/Advice
IAB4-7 Nursing
IAB4-8 Scholarships
IAB4-9 Telecommuting
IAB4-10 U.S. Military
IAB4-11 Career Advice

IAB5 Education
IAB5-1 7-12 Education
IAB5-2 Adult Education
IAB5-3 Art History
IAB5-4 College Administration
IAB5-5 College Life
IAB5-6 Distance Learning
IAB5-7 English as a 2nd Language
IAB5-8 Language Learning
IAB5-9 Graduate School
IAB5-10 Homeschooling
IAB5-11 Homework/Study Tips
IAB5-12 K-6 Educators
IAB5-13 Private School
IAB5-14 Special Education
IAB5-15 Studying Business

IAB6 Family & Parenting
IAB6-1 Adoption
IAB6-2 Babies & Toddlers
IAB6-3 Daycare/Pre School
IAB6-4 Family Internet
IAB6-5 Parenting – K-6 Kids
IAB6-6 Parenting teens
IAB6-7 Pregnancy
IAB6-8 Special Needs Kids
IAB6-9 Eldercare

IAB7 Health & Fitness
IAB7-1 Exercise
IAB7-2 ADD
IAB7-3 AIDS/HIV
IAB7-4 Allergies
IAB7-5 Alternative Medicine
IAB7-6 Arthritis
IAB7-7 Asthma
IAB7-8 Autism/PDD
IAB7-9 Bipolar Disorder
IAB7-10 Brain Tumor
IAB7-11 Cancer
IAB7-12 Cholesterol
IAB7-13 Chronic Fatigue Syndrome
IAB7-14 Chronic Pain
IAB7-15 Cold & Flu
IAB7-16 Deafness
IAB7-17 Dental Care
IAB7-18 Depression
IAB7-19 Dermatology
IAB7-20 Diabetes
IAB7-21 Epilepsy
IAB7-22 GERD/Acid Reflux
IAB7-23 Headaches/Migraines
IAB7-24 Heart Disease
IAB7-25 Herbs for Health
IAB7-26 Holistic Healing
IAB7-27 IBS/Crohn’s Disease
IAB7-28 Incest/Abuse Support
IAB7-29 Incontinence
IAB7-30 Infertility
IAB7-31 Men’s Health
IAB7-32 Nutrition
IAB7-33 Orthopedics
IAB7-34 Panic/Anxiety Disorders
IAB7-35 Pediatrics
IAB7-36 Physical Therapy
IAB7-37 Psychology/Psychiatry
IAB7-38 Senior Health
IAB7-39 Sexuality
IAB7-40 Sleep Disorders
IAB7-41 Smoking Cessation
IAB7-42 Substance Abuse
IAB7-43 Thyroid Disease
IAB7-44 Weight Loss
IAB7-45 Women’s Health

IAB8 Food & Drink
IAB8-1 American Cuisine
IAB8-2 Barbecues & Grilling
IAB8-3 Cajun/Creole
IAB8-4 Chinese Cuisine
IAB8-5 Cocktails/Beer
IAB8-6 Coffee/Tea
IAB8-7 Cuisine-Specific
IAB8-8 Desserts & Baking
IAB8-9 Dining Out
IAB8-10 Food Allergies
IAB8-11 French Cuisine
IAB8-12 Health/Low-Fat Cooking
IAB8-13 Italian Cuisine
IAB8-14 Japanese Cuisine
IAB8-15 Mexican Cuisine
IAB8-16 Vegan
IAB8-17 Vegetarian
IAB8-18 Wine

IAB9 Hobbies & Interests
IAB9-1 Art/Technology
IAB9-2 Arts & Crafts
IAB9-3 Beadwork
IAB9-4 Bird-Watching
IAB9-5 Board Games/Puzzles
IAB9-6 Candle & Soap Making
IAB9-7 Card Games
IAB9-8 Chess
IAB9-9 Cigars
IAB9-10 Collecting
IAB9-11 Comic Books
IAB9-12 Drawing/Sketching
IAB9-13 Freelance Writing
IAB9-14 Genealogy
IAB9-15 Getting Published
IAB9-16 Guitar
IAB9-17 Home Recording
IAB9-18 Investors & Patents
IAB9-19 Jewelry Making
IAB9-20 Magic & Illusion
IAB9-21 Needlework
IAB9-22 Painting
IAB9-23 Photography
IAB9-24 Radio
IAB9-25 Roleplaying Games
IAB9-26 Sci-Fi & Fantasy
IAB9-27 Scrapbooking
IAB9-28 Screenwriting
IAB9-29 Stamps & Coins
IAB9-30 Video & Computer Games
IAB9-31 Woodworking

IAB10 Home & Garden
IAB10-1 Appliances
IAB10-2 Entertaining
IAB10-3 Environmental Safety
IAB10-4 Gardening
IAB10-5 Home Repair
IAB10-6 Home Theater
IAB10-7 Interior Decorating
IAB10-8 Landscaping
IAB10-9 Remodeling & Construction

IAB11 Law, Government, & Politics
IAB11-1 Immigration
IAB11-2 Legal Issues
IAB11-3 U.S. Government Resources
IAB11-4 Politics
IAB11-5 Commentary

IAB12 News*
IAB12-1 International News
IAB12-2 National News
IAB12-3 Local News

IAB13 Personal Finance
IAB13-1 Beginning Investing
IAB13-2 Credit/Debt & Loans
IAB13-3 Financial News
IAB13-4 Financial Planning
IAB13-5 Hedge Fund
IAB13-6 Insurance
IAB13-7 Investing
IAB13-8 Mutual Funds
IAB13-9 Options
IAB13-10 Retirement Planning
IAB13-11 Stocks
IAB13-12 Tax Planning

IAB14 Society
IAB14-1 Dating
IAB14-2 Divorce Support
IAB14-3 Gay Life
IAB14-4 Marriage
IAB14-5 Senior Living
IAB14-6 Teens
IAB14-7 Weddings
IAB14-8 Ethnic Specific

IAB15 Science
IAB15-1 Astrology
IAB15-2 Biology
IAB15-3 Chemistry
IAB15-4 Geology
IAB15-5 Paranormal Phenomena
IAB15-6 Physics
IAB15-7 Space/Astronomy
IAB15-8 Geography
IAB15-9 Botany
IAB15-10 Weather

IAB16 Pets
IAB16-1 Aquariums
IAB16-2 Birds
IAB16-3 Cats
IAB16-4 Dogs
IAB16-5 Large Animals
IAB16-6 Reptiles
IAB16-7 Veterinary Medicine

IAB17 Sports
IAB17-1 Auto Racing
IAB17-2 Baseball
IAB17-3 Bicycling
IAB17-4 Bodybuilding
IAB17-5 Boxing
IAB17-6 Canoeing/Kayaking
IAB17-7 Cheerleading
IAB17-8 Climbing
IAB17-9 Cricket
IAB17-10 Figure Skating
IAB17-11 Fly Fishing
IAB17-12 Football
IAB17-13 Freshwater Fishing
IAB17-14 Game & Fish
IAB17-15 Golf
IAB17-16 Horse Racing
IAB17-17 Horses
IAB17-18 Hunting/Shooting
IAB17-19 Inline Skating
IAB17-20 Martial Arts
IAB17-21 Mountain Biking
IAB17-22 NASCAR Racing
IAB17-23 Olympics
IAB17-24 Paintball
IAB17-25 Power & Motorcycles
IAB17-26 Pro Basketball
IAB17-27 Pro Ice Hockey
IAB17-28 Rodeo
IAB17-29 Rugby
IAB17-30 Running/Jogging
IAB17-31 Sailing
IAB17-32 Saltwater Fishing
IAB17-33 Scuba Diving
IAB17-34 Skateboarding
IAB17-35 Skiing
IAB17-36 Snowboarding
IAB17-37 Surfing/Body-Boarding
IAB17-38 Swimming
IAB17-39 Table Tennis/Ping-Pong
IAB17-40 Tennis
IAB17-41 Volleyball
IAB17-42 Walking
IAB17-43 Waterski/Wakeboard
IAB17-44 World Soccer

IAB18 Style & Fashion
IAB18-1 Beauty
IAB18-2 Body Art
IAB18-3 Fashion
IAB18-4 Jewelry
IAB18-5 Clothing
IAB18-6 Accessories

IAB19 Technology & Computing
IAB19-1 3-D Graphics
IAB19-2 Animation
IAB19-3 Antivirus Software
IAB19-4 C/C++
IAB19-5 Cameras & Camcorders
IAB19-6 Cell Phones
IAB19-7 Computer Certification
IAB19-8 Computer Networking
IAB19-9 Computer Peripherals
IAB19-10 Computer Reviews
IAB19-11 Data Centers
IAB19-12 Databases
IAB19-13 Desktop Publishing
IAB19-14 Desktop Video
IAB19-15 Email
IAB19-16 Graphics Software
IAB19-17 Home Video/DVD
IAB19-18 Internet Technology
IAB19-19 Java
IAB19-20 JavaScript
IAB19-21 Mac Support
IAB19-22 MP3/MIDI
IAB19-23 Net Conferencing
IAB19-24 Net for Beginners
IAB19-25 Network Security
IAB19-26 Palmtops/PDAs
IAB19-27 PC Support
IAB19-28 Portable
IAB19-29 Entertainment
IAB19-30 Shareware/Freeware
IAB19-31 Unix
IAB19-32 Visual Basic
IAB19-33 Web Clip Art
IAB19-34 Web Design/HTML
IAB19-35 Web Search
IAB19-36 Windows

IAB20 Travel
IAB20-1 Adventure Travel
IAB20-2 Africa
IAB20-3 Air Travel
IAB20-4 Australia & New Zealand
IAB20-5 Bed & Breakfasts
IAB20-6 Budget Travel
IAB20-7 Business Travel
IAB20-8 By US Locale
IAB20-9 Camping
IAB20-10 Canada
IAB20-11 Caribbean
IAB20-12 Cruises
IAB20-13 Eastern Europe
IAB20-14 Europe
IAB20-15 France
IAB20-16 Greece
IAB20-17 Honeymoons/Getaways
IAB20-18 Hotels
IAB20-19 Italy
IAB20-20 Japan
IAB20-21 Mexico & Central America
IAB20-22 National Parks
IAB20-23 South America
IAB20-24 Spas
IAB20-25 Theme Parks
IAB20-26 Traveling with Kids
IAB20-27 United Kingdom

IAB21 Real Estate
IAB21-1 Apartments
IAB21-2 Architects
IAB21-3 Buying/Selling Homes

IAB22 Shopping
IAB22-1 Contests & Freebies
IAB22-2 Couponing
IAB22-3 Comparison
IAB22-4 Engines
IAB23 Religion & Spirituality
IAB23-1 Alternative Religions
IAB23-2 Atheism/Agnosticism
IAB23-3 Buddhism
IAB23-4 Catholicism
IAB23-5 Christianity
IAB23-6 Hinduism
IAB23-7 Islam
IAB23-8 Judaism
IAB23-9 Latter-Day Saints
IAB23-10 Pagan/Wiccan

IAB24 Uncategorized*

IAB25 Non-Standard Content
IAB25-1 Unmoderated UGC
IAB25-2 Extreme Graphic/Explicit Violence
IAB25-3 Pornography
IAB25-4 Profane Content
IAB25-5 Hate Content
IAB25-6 Under Construction
IAB25-7 Incentivized

IAB26 Illegal Content
IAB26-1 Illegal Content
IAB26-2 Warez
IAB26-3 Spyware/Malware
IAB26-4 Copyright Infringement

*  IAB12 News and IAB24 Uncategorized is not supported.

Keyword Extraction

A new keywords API was released a few weeks ago. The old one was not really well designed and needed a revamp.

With the keywords API you can extract keywords from texts with respect to a classifier, for example, if you want to find words that make a text positive or negative you extract keywords with the sentiment classifier  or if you want to generate tags for a blog post based on topic, you can run it through a topics classifier or maybe our new IAB Taxonomy classifier.

The result will be a lists of keywords where each keywords is associated with one of the classes. Also each keywords has a probablility, indicating how important each keyword is, a weight if you will. A high value (max 1) means the keyword is very important/relevant.

Example result when extracting text from the sentiment classifier:

[
  [
    {
      "className": "positive",
      "p": 0.698862,
      "keyword": "happy"
    },
    {
      "className": "negative",
      "p": 0.831895,
      "keyword": "worse"
    },
    {
      "className": "negative",
      "p": 0.736696,
      "keyword": "bad"
    },
    {
      "className": "negative",
      "p": 0.914509,
      "keyword": "stinks"
    }
  ]
]

You can use the extracted keywords together with their probability to create word clouds just like I did when I investigated the Bechtel test.

passed_mad_max_unigrams
These are keywords extracted and indicating a passed Bechdel test. Can you guess which movie?

Here is the keywords documentation.

New URL REST API

The new URL REST API is our simplest to use API. You can copy paste the API url in the browser and get the result. The read api key and text are passed as parameters in the url.

Here is an example:
https://api.uclassify.com/v1/uClassify/Sentiment/classify/?readKey=YOUR_READ_API_KEY_HERE&text=I+am+so+happy+today

The result is simply a JSON dictionary with class=>probabilities:

{
"negative": 0.133639,
"positive": 0.866361
}

The only thing you need to do is to sign up for a free account (allows you 1000 calls per day) and replace ‘YOUR_READ_API_KEY_HERE’ with your read api key (found after you log in).

Here is the documentation for the api. The API is a simplified subset of our standard JSON REST API, you can read more the uClassify API differences here.

Happy classifying!

Improved classifier accuracy

I am very happy to announce this performance update that means that classification will have better accuracy than before.

When I was building a new topic classifier based on the IAB taxonomy I did notice some weird behaviour for classes with much less training data than the others. As I started to investigate this I was able to understand how the overall classification could be improved, not only those with low training data. After weeks of testing different implementations I found a few improvements that significantly gave better results on the test datasets.

In short classifiers are much more robust and less sensitive to imbalanced data.

This update doesn’t affect any api endpoints it will only give you better probabilities.

I might write a short post on the technicalities of this update.

What can machine learning teach us about the Bechdel test?

Disclaimer: I made this experiment out of curiosity and not academia. I’vent double checked the results and I have used arbitrary-feel-good-in-my-guts constants when running the tests.

In the last post I built a classifier from subtitles of movies that had failed and passed the Bechdel test. I used a dataset with about 2400 movie subtitles labeled whether or not they had passed the Bechdel test. The list of labels was obtained from bechdeltest.com.

In this post I will explore the inner workings of the classifier. What words and phrases reveal if a movie will pass or fail?

Lets just quickly recap what the Bechdel test is, it tests a movie for

  1. The movie has to have at least two women in it,
  2. who talk to each other,
  3. about something besides a man.

Keywords from subtitles

It’s possible to extract keywords from classifiers. Keywords are discriminating indicators (words, phrases) for a specific class (passed or failed). There are many ways to weight them. I let the classifier sort every keyword according to the probability of belonging to a class.

Common, strong, keywords

To get a general overview we can disregard the most extreme keywords and instead consider keywords that appears more frequently. I extracted keywords that had occurred at least in 100 different movies (which is about 5% of the entire dataset).

To start with I looked at unigrams (single words) and removed non alphanumerical characters and transformed the text to lower case. To visualize the result I created two word clouds. One with keywords that indicate a failed test. One with keywords that are discriminative for a passed test.

Bigger words means higher probability of either failing or passing.

fail_all_unigrams
Subtitle keywords indicating a failed Bechdel test

Keywords like ‘lads’, ‘assault’, ‘rifle’, ’47’ (ak-47), and ‘russian’ seems to indicate a failed Bechdel test. Also words like ‘logic’, ‘solved’, ‘systems’, ‘capacity’ and ‘civilization’ are indicators of a failed Bechdel test.

pass_all_unigrams
Subtitle keywords indicating a passed Bechdel test

The word ‘boobs’ appears a lot more in subtitles of movies that passed the Bechdel tests than those which failed. I don’t know why, but I’ve double checked it. Overall it’s a lot of ‘lipstick’, ‘babies’, ‘washing’, ‘dresses’ and so on.

Keywords only from 2014 and 2015, did anything change?

The word clouds above are generated from 1892 up until now. So I wanted to check if anything had changed since. Below are two word clouds from 2014 and 2015 only. There were less training data (97 and 142 movies) and I only looked at words that appeared in 20 or more titles to avoid extreme features.

failed_recent_unigrams
Recent subtitle keywords indicating a failed Bechdel test

Looking at the recent failed word cloud it seems like there are less lads, explosions and ak-47s. Also, Russia isn’t as scary anymore, goodbye the 80s. In general it’s less of the war stuff?

passed_recent_unigrams
Recent subtitle keywords indicating a passed Bechdel test

From a quick glance it seems like something is different in the passed cloud too, we find words like ‘math’, ‘invented’, ‘developed’, ‘adventure’ and ‘robert’. Wait what Robert? So it seems like ‘Robert’ occurs in 20 movies that passed and 3 that failed last two years. Robert is probably noise (too small dataset). Furthermore, words like ‘washing’, ‘mall’, ‘slut’ and ‘shopping’ have been neutralized. Interestingly a new modern keyword ‘texted’ is used a lot in movies that passed the Bechdel test.

From a very informal point of view, it looks like we are moving in the right direction. But I think for a better understanding of how language has changed over time with a Bechdel -perspective it’s necessary to set up a more controlled experiment. One where you can follow keywords over time as they gain and lose usage. Like google trends, please feel free to explore it and let me know what you find out 😉

Looking at a recent movie, Mad Max: Fury Road

I decided to examine the subtitles in a recent movie that had passed the test, Mad Max: Fury Road. Todo this I trained a classifier with all subtitles since 1892, except the ones from Mad Max movies. Then extracted the keywords from the Mad Max: Fury Road subtitles.

failed_mad_max_unigrams
Mad Max Fury Road subtitle keywords indicating a failed Bechdel test
passed_mad_max_unigrams
Mad Max Fury Road subtitle keywords indicating a passed Bechdel test

This movie passes the Bechdel test. An interesting point is that despite the anecdotic presence word such as ‘babies’, ‘girly’ and ‘flowers’ (in the passed class) the words that surface are not linked to traditional femininity -unlike many other movies that have passed the test. Overall it’s much harder to differentiate between the two clouds.

If you haven’t seen it yet go and watch it, it’s very good!

Conclusion

If my experiment is carried out correctly, or at least good enough (read disclaimer at the top:) passing the Bechdel test doesn’t imply a gender equal movie. Even if it certifies the movie has…

  1. At least two woman
  2. that speak to each other
  3. about something else than men …

…unfortunately this ‘something else than men’ often seems to be something linked to ‘traditional femininity’. The good news, when only looking at more recent data the trend seems to be getting more neutral, ‘washing’ is falling down on the list while ‘adventure’ rises.

It would be interesting to come up with a test that also captures the content as well as how women (and others) are represented. Designing the perfect test will probably be infinitely hard, especially for us humans. It seems like we have hard times on settling whether or not any movie is gender equal (just google any movie discussions). Perhaps with enough data, machine learning can design a test that reveal a multidimensional score of how well and why a movie passes or fails certain tests, not only examining gender but looking at all kinds of diversities.

Finally, just for the sake of clarity, I don’t think the Bechdel test is bad, it certainly helps us to think about women’s representation in movies. But maybe don’t always expect a non sexist, gender equal movie just because it passes the Bechdel test.

Credits

Much kudos to bechdeltest.com for maintaining their database. Thanks to omdbapi.com for a simple to use API. The wordclouds were generated by wordclouds.com

Appendix

For bigrams I also removed non alphanumeric characters, that is why you can see some weird stuff  like ‘you-don’ which should be ‘you-don’t’. However I decided to keep this because it can capture some interesting features like ‘s-fault’ (e.g. ‘dad’s fault’)

Space has been replaced by ‘-‘ so the word cloud word make sense.

All time bigram keywords

failed_all_bigrams
Subtitle bigrams indicating a failed Bechdel test
passed_all_bigrams
Subtitle bigrams indicating a passed Bechdel test

One interesting thing here is the ‘your men’ vs ‘you girls’. I will leave the analysis to you 😉

2014 and 2015 bigram keywords

failed_recent_bigrams
Recent subtitle bigrams indicating a failed Bechdel test
passed_recent_bigrams
Recent subtitle bigrams indicating a passed Bechdel test

Can machine learning predict if a movie passes the Bechdel test?

To pass the Bechdel test

  1. The movie has to have at least two women in it,
  2. who talk to each other,
  3. about something besides a man.

Doesn’t sound so hard to pass, does it? This test was introduced by Alison Bechdel in 1985 in one of her comics, ‘The rule‘.

The largest database of movies that has been Bechdel tested is on bechdeltest.com. The database contains over 6000 titles from 1892 up until now. How many percent do you think pass the Bechdel test overall? As I write this about 58% of the movies has passed the test. Statistics from here.

Being interested in machine learning and data I thought it would maybe be possible to find a textual correlation between movies that fail and pass the test.

To build a classifier that figures this out requires data. It needs labeled samples to learn from. It should be a list of films that passes and fails the test. The more the better. Then for each movie we need to extract features. Features could be the cover, the title, the description, the subtitles, the audio or anything that is in the movie.

Data & features

I was very happy when I found bechdeltest.com, it has a pretty extensive list exceeding 6000 movie titles with information of whether it passed the Bechtel test or not. Even better, it has a Bechtel test rating of 0-3, where 0) means it fails the first part of the test and 3) that it passes all tests.

Since I am dealing with text classifiers the natural choices for features were:

– The description

– The subtitles

– The title

The descriptions were retrieved using omdbapi.com api which gets the plot from imdb. I retrieved plots from 2433 failed and 3281 passed movies.

The subtitles were a bit more cumbersome to find, I did use about 2400 movies selected randomly and spent some time downloading them from various sites. Pweii.

Finally the training data for the title was easily obtained by just creating samples with the only the movie titles for each class. In total 2696 and 3669 movie titles.

Results

I setup an environment and ran 10-fold-cross-validation for all the data (train on 9/10 samples, test with 1/10 then rotate). For feature extraction I looked at case insensitive unigrams and bigrams.

I trained a classifier reading IMDB plots labeled whether or not the corresponding movie had passed the test.  The classifier turned out to have an accuracy of 67% .

By only reading the subtitles uClassify was able to predict whether or not a movie would pass an accuracy of 68%.

One classifier was trained to only look on the movie titles. The accuracy of the classifier was about 55% and this is not surprising at all considering how small the training dataset is.

Finally, I mashed together the subtitles and plots into one classifier that showed a slight increase in accuracy of 69%.

Dataset #Failed #Passed Accuracy
Plots 2433 3281 67%
Subtitles 1024 1262 68%
Titles 2696 3669 55%
Subtitles+plot 3457 4543 69%

The combined (subtitles+plots) classifier is available here, you can toss movie plots or subtitles (and probably scripts) at it and it will do its best to predict if it passes the Bechdel test or not.

Conclusion

The predictive accuracy of the classifier may not be the best, it certainly doesn’t figure out the 1-3 sub rules by just looking at unigrams and bigrams. But it does capture something to predict 70% correctly. I’m curious to find out exactly what it does make it decisions on and will make another blog post on this.

Update: Here is an analysis.

Credits

Much kudos to bechdeltest.com for maintaining their database. Thanks to omdbapi.com for a simple to use API.

JSON REST API

Since uClassify was launched back in 2008 we have seen many technological changes. Last year I modernised the site to use bootstrap as a foundation. Now it’s time to take the api to a more modern format.

Initially the uClassify api only had an XML endpoint, however over the years JSON has become more common and I have been getting more and more requests for REST endpoints with JSON format. The graph below shows google trends ‘json api’ (red) vs ‘xml api’ (blue)

XML API VS JSON API
XML API VS JSON API

Today I have launched a beta of the JSON REST API, changes may still occur but it will hopefully be finalised during Mars 2016.

You can find the documentation here, please feel free to leave feedback.

The old XML and URL API endpoints will of course continue to work as before.

Latest version

Machine Learning is certainly picking up, we are getting a lot more users and requests and we are really excited by this. So we started 2016 by making an maintenance update, mostly fixes:

  • Fixed broken xml schema links
  • Fixed keyword extraction for classifiers using other than uni-grams (e.g. sentiment classifier)
  • Fixed Twitter external login
  • Updated backend libs to latest
  • Added service terms
  • Pricing adjustments the indie local server 99EUR->299EUR, Enterprise 5999EUR->3999EUR

Right now we are working on the much wished-for Json Api that will be in the next major release.

Update with limit changes

The last major update has been running very smoothly, this is the first patch since!

Max request size limit increased

After feedback from the community I’ve increased the maximum allowed request size from 1MB to 3MB. I will monitor the servers and make sure this works fine. Maybe it’s possible to increase it further.

Max query string length increase

After the last update, when I updated the IIS server the default max request string url length was lower then previous. Thanks Liz who noticed this. I’ve not set the max size to 65kb.

Max free calls per day decreased

When I looked at the call statistics it didn’t make much sense to offer 5000 free calls per day. Most people aren’t even close to this, by lowering it to 1000 calls per day only a few will be affected, but most will not notice anything. This is also motivated by looking on competitors free limits and 1000 calls per day is still very generous. Let me know if you have any questions about this.

Bugs

Besides fixing some typos (thanks to everyone who reported) I’ve made it so you can’t publish untrained classifiers and fixed a so the front page buttons work better on small displays. I’ll also unpublished previous classifiers that are untrained and published.

Future

I am extremely happy with the performance of the new Sentiment classifier. It uses a new version of the classifier that looks at combinations of words among other things. Tests show that this type of classifier improves the performance of all tested data sets, therefore I am trying to figure out how to use it for all new classifiers, but it does require some work.

Let me know if you have any questions.

@jonkagstrom

Sentiment Analysis Api

A Sentiment analyzer tells you if a text it’s positive or negative. For example “I love the new Mad Max Fury road” (positive) or “i am not impressed by the bike” (negative). The Sentiment classifier hosted by uClassify is very popular so I decided to spend some time on improving it.

sentiment

The goal was to improve the classification accuracy, especially for short texts such as Twitter messages, Facebook statuses or other snippets while maintaining high quality results on texts with more information.

The old Sentiment classifier was built by 40k amazon product reviews. The straight forward way to improve a classifier is to add more data. Thanks to the Internet we were able to find multiple data sources we could train our classifier on. In fact it’s now trained on 2.8 million documents!

The results are good very good, the accuracy on large documents (reviews) went from about 75% to 83%. Tweets went from 63% to about 77%.

You can play with it here there is also an API available (free to use).

Datasets used are from sentiment-140 (twitter), amazon product reviews and rotten tomatoes.

Image by Anna Gathu