Language Detector for +370 major and rare languages

We have constructed a language detector consisting of about 374 languages.

It can detect both living and extinct languages (e.g. English and Tupi), identify ancient and constructed (e.g. Latin and Klingon) and even different dialects.

Each language class has been named with its English name followed by an underscore and the corresponding ISO 639-3 three letter code. E.g.

  • Swedish_swe
  • English_eng
  • Chinese_zho
  • Mesopotamian Arabic_acm

You can try it here, it needs a few words to make accurate detections.

Some of the rare languages (about 30) may have insufficient training data. The idea is to improve the classifier as more documents are gathered. Also we may add more languages in the future, so make sure your code can handle that.

Here is the full list of supported languages

Language Name ISO 639-3 Type
Abkhazian abk living
Achinese ace living
Adyghe ady living
Afrihili afh constructed
Afrikaans afr living
Ainu ain living
Akan aka living
Albanian sqi living
Algerian Arabic arq living
Amharic amh living
Ancient Greek grc historical
Arabic ara living
Aragonese arg living
Armenian hye living
Arpitan frp living
Assamese asm living
Assyrian Neo-Aramaic aii living
Asturian ast living
Avaric ava living
Awadhi awa living
Aymara aym living
Azerbaijani aze living
Balinese ban living
Bambara bam living
Banjar bjn living
Bashkir bak living
Basque eus living
Bavarian bar living
Baybayanon bvy living
Belarusian bel living
Bengali ben living
Berber ber living
Bhojpuri bho living
Bishnupriya bpy living
Bislama bis living
Bodo brx living
Bosnian bos living
Breton bre living
Bulgarian bul living
Buriat bua living
Burmese mya living
Catalan cat living
Cebuano ceb living
Central Bikol bcl living
Central Huasteca Nahuatl nch living
Central Khmer khm living
Central Kurdish ckb living
Central Mnong cmo living
Chamorro cha living
Chavacano cbk living
Chechen che living
Cherokee chr living
Chinese zho living
Choctaw cho living
Chukot ckt living
Church Slavic chu ancient
Chuvash chv living
Coastal Kadazan kzj living
Cornish cor living
Corsican cos living
Cree cre living
Crimean Tatar crh living
Croatian hrv living
Cuyonon cyo living
Czech ces living
Danish dan living
Dhivehi div living
Dimli diq living
Dungan dng living
Dutch nld living
Dutton World Speedwords dws constructed
Dzongkha dzo living
Eastern Mari mhr living
Egyptian Arabic arz living
Emilian egl living
English eng living
Erzya myv living
Esperanto epo constructed
Estonian est living
Ewe ewe living
Extremaduran ext living
Faroese fao living
Fiji Hindi hif living
Finnish fin living
French fra living
Friulian fur living
Fulah ful living
Gagauz gag living
Galician glg living
Gan Chinese gan living
Ganda lug living
Garhwali gbm living
Georgian kat living
German deu living
Gilaki glk living
Gilbertese gil living
Goan Konkani gom living
Gothic got ancient
Guarani grn living
Guerrero Nahuatl ngu living
Gujarati guj living
Gulf Arabic afb living
Haitian hat living
Hakka Chinese hak living
Hausa hau living
Hawaiian haw living
Hebrew heb living
Hiligaynon hil living
Hindi hin living
Hmong Daw mww living
Hmong Njua hnj living
Ho hoc living
Hungarian hun living
Iban iba living
Icelandic isl living
Ido ido constructed
Igbo ibo living
Iloko ilo living
Indonesian ind living
Ingrian izh living
Interlingua ina constructed
Interlingue ile constructed
Iranian Persian pes living
Irish gle living
Italian ita living
Jamaican Creole English jam living
Japanese jpn living
Javanese jav living
Jinyu Chinese cjy living
Judeo-Tat jdt living
K’iche’ quc living
Kabardian kbd living
Kabyle kab living
Kadazan Dusun dtp living
Kalaallisut kal living
Kalmyk xal living
Kamba kam living
Kannada kan living
Kara-Kalpak kaa living
Karachay-Balkar krc living
Karelian krl living
Kashmiri kas living
Kashubian csb living
Kazakh kaz living
Kekchķ kek living
Keningau Murut kxi living
Khakas kjh living
Khasi kha living
Kinyarwanda kin living
Kirghiz kir living
Klingon tlh constructed
Kölsch ksh living
Komi kom living
Komi-Permyak koi living
Komi-Zyrian kpv living
Kongo kon living
Korean kor living
Kotava avk constructed
Kumyk kum living
Kurdish kur living
Ladin lld living
Ladino lad living
Lakota lkt living
Lao lao living
Latgalian ltg living
Latin lat ancient
Latvian lav living
Laz lzz living
Lezghian lez living
Lįadan ldn constructed
Ligurian lij living
Lingala lin living
Lingua Franca Nova lfn constructed
Literary Chinese lzh historical
Lithuanian lit living
Liv liv living
Livvi olo living
Lojban jbo constructed
Lombard lmo living
Louisiana Creole lou living
Low German nds living
Lower Sorbian dsb living
Luxembourgish ltz living
Macedonian mkd living
Madurese mad living
Maithili mai living
Malagasy mlg living
Malay zlm living
Malay msa living
Malayalam mal living
Maltese mlt living
Mambae mgm living
Mandarin Chinese cmn living
Manx glv living
Maori mri living
Marathi mar living
Marshallese mah living
Mazanderani mzn living
Mesopotamian Arabic acm living
Mi’kmaq mic living
Middle English enm historical
Middle French frm historical
Min Nan Chinese nan living
Minangkabau min living
Mingrelian xmf living
Mirandese mwl living
Modern Greek ell living
Mohawk moh living
Moksha mdf living
Mon mnw living
Mongolian mon living
Morisyen mfe living
Moroccan Arabic ary living
Na nbt living
Narom nrm living
Nauru nau living
Navajo nav living
Neapolitan nap living
Nepali npi living
Nepali nep living
Newari new living
Ngeq ngt living
Nigerian Fulfulde fuv living
Niuean niu living
Nogai nog living
North Levantine Arabic apc living
North Moluccan Malay max living
Northern Frisian frr living
Northern Luri lrc living
Northern Sami sme living
Norwegian nor living
Norwegian Bokmål nob living
Norwegian Nynorsk nno living
Novial nov constructed
Nyanja nya living
Occitan oci living
Official Aramaic arc ancient
Ojibwa oji living
Old Aramaic oar ancient
Old English ang historical
Old Norse non historical
Old Russian orv historical
Old Saxon osx historical
Oriya ori living
Orizaba Nahuatl nlv living
Oromo orm living
Ossetian oss living
Ottoman Turkish ota historical
Palauan pau living
Pampanga pam living
Pangasinan pag living
Panjabi pan living
Papiamento pap living
Pedi nso living
Pennsylvania German pdc living
Persian fas living
Pfaelzisch pfl living
Picard pcd living
Piemontese pms living
Pipil ppl living
Pitcairn-Norfolk pih living
Polish pol living
Pontic pnt living
Portuguese por living
Prussian prg living
Pulaar fuc living
Pushto pus living
Quechua que living
Quenya qya constructed
Romanian ron living
Romansh roh living
Romany rom living
Rundi run living
Russia Buriat bxr living
Russian rus living
Rusyn rue living
Samoan smo living
Samogitian sgs living
Sango sag living
Sanskrit san ancient
Sardinian srd living
Saterfriesisch stq living
Scots sco living
Scottish Gaelic gla living
Serbian srp living
Serbo-Croatian hbs living
Seselwa Creole French crs living
Shona sna living
Shuswap shs living
Sicilian scn living
Silesian szl living
Sindarin sjn constructed
Sindhi snd living
Sinhala sin living
Slovak slk living
Slovenian slv living
Somali som living
South Azerbaijani azb living
Southern Sami sma living
Southern Sotho sot living
Spanish spa living
Sranan Tongo srn living
Standard Latvian lvs living
Standard Malay zsm living
Sumerian sux ancient
Sundanese sun living
Swabian swg living
Swahili swa living
Swahili swh living
Swati ssw living
Swedish swe living
Swiss German gsw living
Tagal Murut mvv living
Tagalog tgl living
Tahitian tah living
Tajik tgk living
Talossan tzl constructed
Talysh tly living
Tamil tam living
Tarifit rif living
Tase Naga nst living
Tatar tat living
Telugu tel living
Temuan tmw living
Tetum tet living
Thai tha living
Tibetan bod living
Tigrinya tir living
Tok Pisin tpi living
Tokelau tkl living
Tonga ton living
Tosk Albanian als living
Tsonga tso living
Tswana tsn living
Tulu tcy living
Tupķ tpw extinct
Turkish tur living
Turkmen tuk living
Tuvalu tvl living
Tuvinian tyv living
Udmurt udm living
Uighur uig living
Ukrainian ukr living
Umbundu umb living
Upper Sorbian hsb living
Urdu urd living
Urhobo urh living
Uzbek uzb living
Venda ven living
Venetian vec living
Veps vep living
Vietnamese vie living
Vlaams vls living
Vlax Romani rmy living
Volapük vol constructed
Võro vro living
Walloon wln living
Waray war living
Welsh cym living
Western Frisian fry living
Western Mari mrj living
Western Panjabi pnb living
Wolof wol living
Wu Chinese wuu living
Xhosa xho living
Xiang Chinese hsn living
Yakut sah living
Yiddish yid living
Yoruba yor living
Yue Chinese yue living
Zaza zza living
Zeeuws zea living
Zhuang zha living
Zulu zul living

Attribution

The classifier has been trained by reading texts in many different languages. Finding high quality, non noisy texts is really difficult. Many thanks to

  1. Wikipedia that exists in so many languages
  2. Tatoeba which is a great resources for clean sentences in many languages

About the translation algorithm

A brief introduction to our machine translation algorithm

We have implemented statistical machine translation (SMT). SMT is completely data driven. It works by calculating word and phrase probabilities from a large corpus. We have used OPUS and Wiktionary as our primary sources.

Data models

From the data sources (mostly bilingual parallel language sets) a dictionary of translations is constructed. For each translation we keep a count and parts of speech tags for both source and target, this is our translation model & pos models and it looks something like:

Translation & pos models
source word|source pos tags|translation count|target word |target pos tags
om|conj|12|if|conj
om|adv|7|again|adj
övermorgon|adv|3|the day after tomorrow|det noun prep noun
...

For the target language a language model and a grammar model is used. Each consists of 1-5 n grams. The language model consists of word sequences and a frequency, the grammar model of pos tags and their frequencies:

Language model
phrase|count
hello word|493920
hi world|19444
...
Grammar model
pos tags|count
prep noun|454991
prep prep|3183
...

Building a graph

So we have data. Plenty of data. Now we just need to make use of it. When a text is translated a graph is built between all possible translations, most of the time each word has multiple translations and meanings, so the number of combinations grows very quickly. During the graph building we need to remember that source phrases can contract, e.g. ‘i morgon’=>’tomorrow’ and expand ‘övermorgon’=>’the day after tomorrow’.

We look at a maximum of 5 words. Once the graph is built, a traversal is initiated. As we traverse the graph encountered sub phrases are scored and the best path is chosen.

Graph for 'hej världen!'
hej       världen       !
--------------------------
Translations:
hi        world         !
hello     universe
howdy     earth
hey

Combinations:
hi        world         !
hi        universe      !
hi        earth         !
hello     world         !
hello     universe      !
hello     earth         !
...

Unfortunately there is no way to examine all translations so we need to traverse the graph intelligently. We use a beam search with limited depth and width to get the scope down to manageable scales.

Scoring phrases

The scoring of each phrases combines  the four aforementioned aspects of the language:

Translation model: This is the dictionary, source->target each entry has a frequency, from the frequency we can calculate a probability (p1) “the most likely translation for ‘hej’ this word is ‘hello'”

Source grammar model: The pos tag helps us to resolve ambiguity, a probability (p2) is calculated, basically saying “‘hej’/’hello’ is likely an interjection”.

Target language model: We look at 1-5 grams. A n-gram is a sequence of words, for example “hello world” is a 2-gram. Each n-gram has a frequency indicating how common it’s. Again a probability (p3) can be calculated, “the sequence ‘hello world’ is more likely than ‘hi world'”.

Target grammar model: just like the language model we do the same but with pos tags. A probability (p4) is calculated indicating “Yeah a verb followed by a preposition sounds better than two prepositions in a row” etc.

We use a sliding window moving over the phrase and combining probabilities using the chain rule into accumulated P1-P4. We end up with 4 parameters that are finally mixed with different weights according to

score=P1^w1*P2^w2*P3^w3*P4^w4

Working in log space makes life easier here. Then we just select the phrase with the highest score.

We estimate the weights (w1-w4) by a randomized search that tries to maximize a bleu-score for a test set. The estimation only needs to be rerun when the training data changes. As expected, the most important (highest weight) is assigned the translation model (w1=1), second highest the source grammar model (w3~0.6), third highest the language model (w2~0.3) and finally the target grammar model (w4~0.05). Yes, the as it turns out the target grammar model is not very important, it helps to resolve uncertainty in some cases by predicting pos tags. But I might actually nuke it to favor simplicity in future versions.

There were plenty of unmentioned problems to be solved along the way, but you get the overall idea. One thing that easily puts you off is the size of the data you are dealing with. E.g. downloading TB sized datasets like the google ngrams and processing those. At one point, after 4 days processing those huge zipfiles, Windows Update decided to restart the computer…

Translation API

We get a lot of requests for classifiers in different languages and as a next step we are building a translation API. The idea is to have an affordable in-house machine translation service that can quickly translate requests to the classifier language, classify the request and send back the response. Since the majority of classifiers are in English, the primary focus will be to target English.

Initially we support French, Spanish and Swedish to English translations.

Translation demo

The API is accessible with your ordinary API read key and a GET/POST REST protocol.

You can test and read all about the translation API here.

Hello world

Please don’t hesitate to report any weirdness to me!

What can machine learning teach us about the Bechdel test?

Disclaimer: I made this experiment out of curiosity and not academia. I’vent double checked the results and I have used arbitrary-feel-good-in-my-guts constants when running the tests.

In the last post I built a classifier from subtitles of movies that had failed and passed the Bechdel test. I used a dataset with about 2400 movie subtitles labeled whether or not they had passed the Bechdel test. The list of labels was obtained from bechdeltest.com.

In this post I will explore the inner workings of the classifier. What words and phrases reveal if a movie will pass or fail?

Lets just quickly recap what the Bechdel test is, it tests a movie for

  1. The movie has to have at least two women in it,
  2. who talk to each other,
  3. about something besides a man.

Keywords from subtitles

It’s possible to extract keywords from classifiers. Keywords are discriminating indicators (words, phrases) for a specific class (passed or failed). There are many ways to weight them. I let the classifier sort every keyword according to the probability of belonging to a class.

Common, strong, keywords

To get a general overview we can disregard the most extreme keywords and instead consider keywords that appears more frequently. I extracted keywords that had occurred at least in 100 different movies (which is about 5% of the entire dataset).

To start with I looked at unigrams (single words) and removed non alphanumerical characters and transformed the text to lower case. To visualize the result I created two word clouds. One with keywords that indicate a failed test. One with keywords that are discriminative for a passed test.

Bigger words means higher probability of either failing or passing.

fail_all_unigrams
Subtitle keywords indicating a failed Bechdel test

Keywords like ‘lads’, ‘assault’, ‘rifle’, ’47’ (ak-47), and ‘russian’ seems to indicate a failed Bechdel test. Also words like ‘logic’, ‘solved’, ‘systems’, ‘capacity’ and ‘civilization’ are indicators of a failed Bechdel test.

pass_all_unigrams
Subtitle keywords indicating a passed Bechdel test

The word ‘boobs’ appears a lot more in subtitles of movies that passed the Bechdel tests than those which failed. I don’t know why, but I’ve double checked it. Overall it’s a lot of ‘lipstick’, ‘babies’, ‘washing’, ‘dresses’ and so on.

Keywords only from 2014 and 2015, did anything change?

The word clouds above are generated from 1892 up until now. So I wanted to check if anything had changed since. Below are two word clouds from 2014 and 2015 only. There were less training data (97 and 142 movies) and I only looked at words that appeared in 20 or more titles to avoid extreme features.

failed_recent_unigrams
Recent subtitle keywords indicating a failed Bechdel test

Looking at the recent failed word cloud it seems like there are less lads, explosions and ak-47s. Also, Russia isn’t as scary anymore, goodbye the 80s. In general it’s less of the war stuff?

passed_recent_unigrams
Recent subtitle keywords indicating a passed Bechdel test

From a quick glance it seems like something is different in the passed cloud too, we find words like ‘math’, ‘invented’, ‘developed’, ‘adventure’ and ‘robert’. Wait what Robert? So it seems like ‘Robert’ occurs in 20 movies that passed and 3 that failed last two years. Robert is probably noise (too small dataset). Furthermore, words like ‘washing’, ‘mall’, ‘slut’ and ‘shopping’ have been neutralized. Interestingly a new modern keyword ‘texted’ is used a lot in movies that passed the Bechdel test.

From a very informal point of view, it looks like we are moving in the right direction. But I think for a better understanding of how language has changed over time with a Bechdel -perspective it’s necessary to set up a more controlled experiment. One where you can follow keywords over time as they gain and lose usage. Like google trends, please feel free to explore it and let me know what you find out 😉

Looking at a recent movie, Mad Max: Fury Road

I decided to examine the subtitles in a recent movie that had passed the test, Mad Max: Fury Road. Todo this I trained a classifier with all subtitles since 1892, except the ones from Mad Max movies. Then extracted the keywords from the Mad Max: Fury Road subtitles.

failed_mad_max_unigrams
Mad Max Fury Road subtitle keywords indicating a failed Bechdel test
passed_mad_max_unigrams
Mad Max Fury Road subtitle keywords indicating a passed Bechdel test

This movie passes the Bechdel test. An interesting point is that despite the anecdotic presence word such as ‘babies’, ‘girly’ and ‘flowers’ (in the passed class) the words that surface are not linked to traditional femininity -unlike many other movies that have passed the test. Overall it’s much harder to differentiate between the two clouds.

If you haven’t seen it yet go and watch it, it’s very good!

Conclusion

If my experiment is carried out correctly, or at least good enough (read disclaimer at the top:) passing the Bechdel test doesn’t imply a gender equal movie. Even if it certifies the movie has…

  1. At least two woman
  2. that speak to each other
  3. about something else than men …

…unfortunately this ‘something else than men’ often seems to be something linked to ‘traditional femininity’. The good news, when only looking at more recent data the trend seems to be getting more neutral, ‘washing’ is falling down on the list while ‘adventure’ rises.

It would be interesting to come up with a test that also captures the content as well as how women (and others) are represented. Designing the perfect test will probably be infinitely hard, especially for us humans. It seems like we have hard times on settling whether or not any movie is gender equal (just google any movie discussions). Perhaps with enough data, machine learning can design a test that reveal a multidimensional score of how well and why a movie passes or fails certain tests, not only examining gender but looking at all kinds of diversities.

Finally, just for the sake of clarity, I don’t think the Bechdel test is bad, it certainly helps us to think about women’s representation in movies. But maybe don’t always expect a non sexist, gender equal movie just because it passes the Bechdel test.

Credits

Much kudos to bechdeltest.com for maintaining their database. Thanks to omdbapi.com for a simple to use API. The wordclouds were generated by wordclouds.com

Appendix

For bigrams I also removed non alphanumeric characters, that is why you can see some weird stuff  like ‘you-don’ which should be ‘you-don’t’. However I decided to keep this because it can capture some interesting features like ‘s-fault’ (e.g. ‘dad’s fault’)

Space has been replaced by ‘-‘ so the word cloud word make sense.

All time bigram keywords

failed_all_bigrams
Subtitle bigrams indicating a failed Bechdel test
passed_all_bigrams
Subtitle bigrams indicating a passed Bechdel test

One interesting thing here is the ‘your men’ vs ‘you girls’. I will leave the analysis to you 😉

2014 and 2015 bigram keywords

failed_recent_bigrams
Recent subtitle bigrams indicating a failed Bechdel test
passed_recent_bigrams
Recent subtitle bigrams indicating a passed Bechdel test

Can machine learning predict if a movie passes the Bechdel test?

To pass the Bechdel test

  1. The movie has to have at least two women in it,
  2. who talk to each other,
  3. about something besides a man.

Doesn’t sound so hard to pass, does it? This test was introduced by Alison Bechdel in 1985 in one of her comics, ‘The rule‘.

The largest database of movies that has been Bechdel tested is on bechdeltest.com. The database contains over 6000 titles from 1892 up until now. How many percent do you think pass the Bechdel test overall? As I write this about 58% of the movies has passed the test. Statistics from here.

Being interested in machine learning and data I thought it would maybe be possible to find a textual correlation between movies that fail and pass the test.

To build a classifier that figures this out requires data. It needs labeled samples to learn from. It should be a list of films that passes and fails the test. The more the better. Then for each movie we need to extract features. Features could be the cover, the title, the description, the subtitles, the audio or anything that is in the movie.

Data & features

I was very happy when I found bechdeltest.com, it has a pretty extensive list exceeding 6000 movie titles with information of whether it passed the Bechtel test or not. Even better, it has a Bechtel test rating of 0-3, where 0) means it fails the first part of the test and 3) that it passes all tests.

Since I am dealing with text classifiers the natural choices for features were:

– The description

– The subtitles

– The title

The descriptions were retrieved using omdbapi.com api which gets the plot from imdb. I retrieved plots from 2433 failed and 3281 passed movies.

The subtitles were a bit more cumbersome to find, I did use about 2400 movies selected randomly and spent some time downloading them from various sites. Pweii.

Finally the training data for the title was easily obtained by just creating samples with the only the movie titles for each class. In total 2696 and 3669 movie titles.

Results

I setup an environment and ran 10-fold-cross-validation for all the data (train on 9/10 samples, test with 1/10 then rotate). For feature extraction I looked at case insensitive unigrams and bigrams.

I trained a classifier reading IMDB plots labeled whether or not the corresponding movie had passed the test.  The classifier turned out to have an accuracy of 67% .

By only reading the subtitles uClassify was able to predict whether or not a movie would pass an accuracy of 68%.

One classifier was trained to only look on the movie titles. The accuracy of the classifier was about 55% and this is not surprising at all considering how small the training dataset is.

Finally, I mashed together the subtitles and plots into one classifier that showed a slight increase in accuracy of 69%.

Dataset #Failed #Passed Accuracy
Plots 2433 3281 67%
Subtitles 1024 1262 68%
Titles 2696 3669 55%
Subtitles+plot 3457 4543 69%

The combined (subtitles+plots) classifier is available here, you can toss movie plots or subtitles (and probably scripts) at it and it will do its best to predict if it passes the Bechdel test or not.

Conclusion

The predictive accuracy of the classifier may not be the best, it certainly doesn’t figure out the 1-3 sub rules by just looking at unigrams and bigrams. But it does capture something to predict 70% correctly. I’m curious to find out exactly what it does make it decisions on and will make another blog post on this.

Update: Here is an analysis.

Credits

Much kudos to bechdeltest.com for maintaining their database. Thanks to omdbapi.com for a simple to use API.

Classifier Visualization

I’m currently working on a new keywords API for uClassify. This will allow users to get information about what words that are good discriminators for certain classes. To test this API I spent last weekend to built a visualization application for urlai.com.

Here is a screenshot how the visualization prototype show data:

 

I would very much like to get some feedback on this, you can try it here, please comment below.

 

Sentiment API

sentiment
Understanding if a text is positive or negative is easy for humans, but a lot harder for computers. We can “read between the lines”, get jokes and identify irony. Computers aren’t quite there yet but the gap is quickly closing in.

Our contribution is a free Sentiment API that can help users to do market research, brand surveys and see trends around their campaigns. The API will not only reveal if a document is positive or negative, it will also indicate how positive or negative it is.

Demo

You can try it directly here and get your own API by signing up.

Dataset credits

The Amazon reviews dataset have been collected and formatted by Mark Dredze at the John Hopkins University. Many thanks for sharing!

Mattias Östmar has manually built the Mood classifier with his sweat and tears, thanks for letting me use it for this classifier. A big hug! (he really likes hugs)

Techy description

The API is available in XML and REST, with XML/JSON responses.

The sentiment classifier is based on circa 40000 Amazon reviews from 25 different product genres. Documents from the genres are merged into one classifier with two classes, positive and negative.

On top of this I’ve integrated the Mood classifier, my hope is that it will help to capture some more emotional traits in texts, also it complements the Amazon training data since it is taken from another domain (obvious difference is that it contains swear words).

The expected accuracy is about 78% (macro-precision: 77% and macro-recall: 77%) running 10-fold cross validation.

Image by Anna Gathu.

Category classifier

We’ve received a lot of requests for a topic/category classifier, that is a classifier that labels a text or webpage with a topic (e.g. ‘Computers’, ‘Sports’ or ‘Games’). The basic idea of uClassify is that users build and share classifiers and I have been hoping that this classifier would pop up eventually. When I look through the list of +800 private classifiers I found a couple category classifiers but usually used to label for a specific domain (e.g. only ‘Sports’ or some more narrow topic set). However no one has yet built a public general topic classifier.

Finding topics

Building a topic classifier is not something you just sneeze out of your nose (as we say in Sweden), it takes some preprocessing. First of all you need to decide what categories you should use, luckily people already have constructed good structures such as Yahoo Directories and Open Directory Project (ODP).

I decided to go with ODP and create a set of hierarchical classifiers describing the two first levels of ODP. The top level classifier consists of the following topics: Arts, Business, Computers, Games, Health, Home, Recreation, Science, Society and Sports. Note that I’ve removed some from the original ODP (World, Reference, Regional, Shopping and News for various reasons).

Each topic in the top level classifier has a corresponding child classifier that in turn consists of all level 2 topics, for example, the classifier ‘Computers’ include, among other: Algorithms, Artificial Intelligence, Artificial Life, …, Virtual Reality.

Finding data

ODP provides RDF dumps of their directory – huge XML files (+2Gb) that includes the entire directory with topic titles, descriptions and external links. I decided to try making use of this directory, so I wrote a SAX parser that extracted the topics and links. Then I downloaded and cleaned 60 links from each category and used that as training data.

Result

You can try the general topic classifier here. And you can find the sub classifiers here named ‘Business Topics’ etc.

Hierarchy

Click on the image to get it in full scale.
Topics

Artificial Intelligence to determine an authors age

Young and old people

We have just released ageanalyzer.com, a site that reads a blog and guesses the age of the author!

Background

Our writing style reflects us in many ways, for example texts written in anger probably differs from words written in joy.  Reading a text intuitively gives us a clue about the author as you start forming a picture in your head.  Sometimes it’s easy to pinpoint how you got this picture and at other times harder.

We wanted to know if we could give computers the same intuition, in this particular project we are interesting in finding out if a computer can tell the age of an author – only given a text.

To do this experiment we collected 7000 blogs that had age information in the profile and split it into 6 different age groups, 13-17, 18-25, 26-35, 36-50, 51-65 and 65+. We then created a classifier on uClassify and fed it with the training data. Viola!

Expected results

After running tests on the training data (10-fold-cross-validation) it was clear that our classifier was able to find differences between the six age groups. We expect the proportion of correctly classified blogs would be around 30% compared to a baseline of 17% which would be expected if the classifier was guessing out of the blue.

We have added a poll to the site to help us see how well (or poorly) it works!

Try AgeAnalyzer out here!

Stock prediction results week 20

In short, a much longer evaluation period is needed to verify if there is any significance in those classifications. It did quite well for Tuesday and Wednesday (60% and 83% correct), bad on Thursday (36%) and what can be expected by random on Friday (48%). I may setup a site that will feed predictions automatically over a long period of time. What also would be really interesting is to feed stock news into the training data. Below is detailed information of how the classifications turned out, if anyone is interested!

Friday 15/5

Predicted winners, actual winners:TXI, AVP, AAWW, DD, GEF, WCC, CTSH, ACTG, HIT, AKS, DRQ, AVT, ALB, ACXM, BWS, LTD, CBT, CHRS, DWSN, DUG, FXP, BMS, GYMB, PETM, AFFX, COL, ALOG, ADVNB, AXE, AN, WIT, MDC, DY, OIS, JOSB, IBI, ASPM, TEX

Predicted winners, actual losers: GPI, ANAD, FWRD, KMT, ATPG, JBLU, ESL, USG, ALV, KBR, ANF, AYR, STAA, ADPT, TSO, LINTA, ME, GIB, BC, AIR, BGC, USTR, CBI, EXPE, COG, FCG, ATI, EAC, DOW, ETH, AYI, CIM, SCHN, FNM, ARW, TER, TKTM, AMSG, CEDC, AUO, TRW, ENER, GLBL, CHL, CONN, OI, JOE, HXL, PLL, ADBE, CHU, EMN, COH, APC, CPX, GGB, EEQ, TIE, ENG, AFL, A, FO, X, JRCC, AAI, KBW, CHRD, CAJ, CTX

Predicted losers, actual losers:UNT, GT, ABT, WAB, AWI, WGOV, EQIX, CLR, CRBC, CAM, UEPS, CLF, IP, DEI, UYM, DVN, LDSH, WRES, UBS, UFPI, GLF, WHR, DVR, EGN, NOA, CHK, JCI, CEO, CBR, FIW, DSI, ACV, XEC, HTCH, ENZN, OC, AMR, CRS, KWK, WIN

Predicted losers, actual winners:TTMI, UTSI, TWI, ID, DCI, CNTF, GNK, AAPL, WWW, GGG, COMS, UA, GWR, ARQL

Accuracy (78/161) : 48%

Thursday 14/5

Predicted winners, actual winners:ESL, MDC, PLL, PETM, DVN, COG, LDSH, ACV, TSO, CRS, DLB, PTNR

Predicted winners, actual losers:FXP, DUG, ADPT

Predicted losers, actual losers:KMT, WAB, CRBC, DWSN, ADVNB

Predicted losers, actual winners:CPX, TTMI, UNT, FNM, HIT, UTSI, GT, AWI, GIB, GPI, DD, ME, ANAD, CHRD, TXI, WGOV, EQIX, ATPG, DOW, AIR, ALV, AVP, USG, CLR, ACTG, ATI, CIM, ACXM, TWI, CAM, ID, UEPS, COH, AYR, CLF, BWS, ALOG, CEDC, ANF, IP, DCI, GNK, UYM, AFL

Accuracy (17/47) : 36%

Wednesday 13/5

Predicted winners, actual winners:

Predicted winners, actual losers: HIT, EGN, GIB, DLB, TTMI, GT

Predicted losers, actual losers: FWRD, DWSN, CEDC, UTSI, FNM, KMT, AWI, TXI, ANAD, ME, WGOV, GPI, ATPG, AVP, EQIX, DOW, AIR, ALV, CLR, USG, CRBC, ATI, ACTG, CIM, ACXM, ID, TWI, CAM, AYR, CLF, BWS, USTR, FCG, EAC, DEI, WCC, EXPE, CBI, ALB, GLF, UFPI, WHR, AVT

Predicted losers, actual winners: ESL, ADPT, ANF

Accuracy (43/52) : 83%

Tuesday 12/2

Predicted winners, actual winners: DWSN, UFPI, FXP, ESL, DUG, EXPE, DLB, CONN, DD, COMS

Predicted winners, actual losers: CEDC, DCI, AVT, FWRD, CTSH, GEF, ALV, AUO, DRQ, UNT, AKS, XEC, TXI, FCG, EGN, BWS, AAPL, GIB, COG, AWI, DOW, WCC, EAC, CIM, FIW, EQIX, HIT, USG, ETH, AYR, UEPS, AAWW

Predicted losers, actual losers: WWW, ATI, CP, CRBC, ECA, WHR, TRW, DISCA, GGG, USTR, GWR, CBI, DISH, CLF, ALB, CRS, AYI, AMSG, FFIV, CHL, ARW, DVR, TER, CHRS, GLS, CBT, ACXM, GLBL, UBS, DSI, GT, TTMI, CVC, ANF, TWI, FNM, CKP, ADPT, TSO, CLR, GPI, ANAD, AVP, CAM, GLF, ATPG, CBB, ACTG

Predicted losers, actual winners: GLT, TKTM, ENER, DEI, CBR, UTSI, AIR

Accuracy (58/97) : 60%