Archive for the ‘News’ Category

We moved to Amazon EC2 after a big crash

Monday, January 5th, 2009

During Christmas some unfortunate events occurred - on the 26th of December Ultimahosts (who we were paying to maintain our servers) had a crash and managed to wipe out all our servers. This was very frustrating, but I expected it to be online again soon, recovered from their backups.

On the 28th they let me know that they had accidentally destroyed all backups. How is it possible for a single datacenter to screw up so much?? I don’t know.

Most classifiers are intact and users registered 17-25 can be recovered

Luckily I had taken manual backups myself - one on all the classifiers on the 25th of December and one on the user database on the 17th of December. This means that most classifiers are intact, but users who registered between 17-25 of December are gone. You guys can re-register with the same username and I will attach it to your old classifiers (send me an e-mail). I am really sorry about this and for the inconvenience it has caused.

New servers on Amazon EC2

I spent over 60 hours reinstalling and moving uclassify to Amazon EC2. This feels really good (now that it’s done). We can easily scale and we have an own good backup system using Amazon EBS + daily offsite backups.

I’m really sorry for any inconvenience,

Jon Kågström

Ps. Thanks to Google cache I was able to recover all posts for this blog…

LibraryThing annouces uClassify competition

Monday, December 22nd, 2008

On LibraryThing you can add your own books to a personal library. By doing this you start to get recommendations from either other users who has read the same book or automatically by the system. There are also several forums where users can discuss books - just like a really really big book club. At the time I signed up there were over 34 million books added. I added a couple of books I have recently read and to my surprise all of them already existed in the system, even the Swedish ones. After adding them I was immediately getting lots of recommendations, such as “The Satanic Verses” and “Robot : mere machine to transcendent mind”. Really cool!

Now with all these books some kind of categorization could help.

Competition

LibraryThing are encouraging their users to create something cool with uClassify. The prize is $100 Amazon gift certificate and Toby Segaran’s “Programming Collective Intelligence”. LibraryThing also presents a couple of cool ideas which you can use such as fictional vs non-fiction. The competition ends on February 1 2009 so what are you waiting for?

Buzz & Development

Monday, December 8th, 2008

Yesterday we were mentioned on ReadWriteWeb which generated a lot of visits and more importantly - classifiers. 30 new classifiers were created within a time period of 10 hours, even though many are just created out of curiosity to quickly test the system - some will hopefully mature and have web applications built around it.

What’s going on techwise

As you have noticed we are continuously improving our system by carefully adding new features. The following tasks are planned for the GUI

We are soon installing a new more flexible menu system.

Users will be able to create profiles with descriptions and links. Also classifiers should be able to have a link to the web site it’s implemented.

Better information about training - right now there is no feedback on how much training has been done or is required. We want to give users an idea of how the training data performs.

What’s going on commercialwise

Everything is free on uClassify and that is how it will stay.

Our commercial idea is to offer companies the possibility to buy their own classification servers. For large databases with texts that needs to be classified it’s intractable to send every text for a roundtrip to uclassify.com. Instead companies could be interested in doing this efficiently locally. A products page with server information will appear soon.

What’s your mood?

Tuesday, December 2nd, 2008

Today, 2 months after our launch, our users have created over 200 classifiers. Most are unpublished and under construction. PRfekt, the team behind the popular Typealyzer, recently published a new classifier that determines the mood of a text - whether a text is happy or upset. You can try it for yourself here!

So lets test some snippets!

Jamis is (justly) upset and writes:

Is anyone else annoyed by the “just speak your choice” automation in so many telephone menus? I feel like an idiot mumbling “YES!” or “CHECK BALANCE!” into my phone. Maybe it’s the misanthrope in me coming to the front, but I’d much rather push buttons than talk to a pretend person.

The mood classifier says 98.1% upset.

Spam is no fun either, or as Ed-Anger notes:

“I’m madder than a rooster in an empty hen house at Internet spammers and I won’t take it anymore. Those creeps clutter up my e-mail with their junk, everything from penis enlargement pills to some lady telling me she’ll give me a million dollars if I’ll help her get her money out of Africa. “Rush me 10 grand quick as possible and we’ll get the whole thing started,” she says.”

The mood classifier says 97.0% upset.

Now over to some happy blogs, amour-amour has a confesion:

“I love my iphone in a way I never thought possible!! When my fiance got his and spent 23 hours gazing at it lovingly, uploading (or is it downloading??) apps and buying accessories for it I put it down to him just being a technology geek.”

The mood classifier says 79.8% happy.

Finally Nitwik Nastik comments a Rickey Gervais:

“This is a hilarious stand-up routine by British Comedian Ricky Gervais on Bible and Creationism. It’s really funny how he ridicules the creationist stories from the book of Genesis (the book of genesis can be found here)and point out to it’s obvious logical blunders. Sometimes it may be difficult to understand his accent and often he will make some funny comments under his breath, so try to listen carefully.”

The mood classifier says 69.7% happy.

The author recommends at least two hundred words (more text than my samples) which seems reasonable!

GenderAnalyzer thoughts

Saturday, November 22nd, 2008

First, thanks to everyone who is testing GenderAnalyzer, we have had incredible feedback. We received emails from many people that are facinated and a few that thinks it sucks =) GenderAnalyzer is still generating a lot of traffic and people are blogging about it.

Our learnings

Determining the gender of an author is not easy, besides the classification there is a chain of technical events that must work in order to get a reliable result. As many of you have noticed the accuracy has dropped to 53% which is far lower than expected based on our tests. There may be several reasons for this low accuracy and I will mention some of them here.

  • Our trainingdata of 2000 blogs is automatically collected from blogspot. Runing internal tests (10 fold cross validation) on this data gives us an accurcy of 75% this effectivly means “Given that the corpus is a perfect representation of real world data, the classifier is able to give any real world data the correct label by a chance of 75%”. So our trainingdata is probably not very representative, as a matter of fact it’s very stereotypical (see for yourself here). Using data from all kind of sources should give us a better model.
  • When someone is testing a blog we are not crawling through posts on the blog to get a good amount of text. We are only hitting the given url and using the text (and html) that appear there as test data. So a page with mostly images or frames will give bad test data. Does anyone know a nice library that - given an url crawls blog posts? Via RSS perhaps?
  • We are trying to encode test data to utf-8 which is the format of the training data - it could be that we are missing some encodings.
  • And of course - the difference between male and female writing is not significant?

What’s next?

We are currently collecting a new set of training data that is much more representative. We will switch to this classifier during the next week and start a new poll for it. It’s going to be very exciting!

GenderAnalyzer showdown + server upgrade

Monday, November 3rd, 2008

Today genderanalyzer.com was featured on BoingBoing this resulted in that our server could not handle all the requests. We have now upgraded the server and it should be happy to serve all requests.

While the server was unable to respond to all requests - accuracy in the poll dropped from 63% to 55% (since the error message makes people vote that it’s not guessing right). However now the accuracy is slowly recovering!

Sorry for any inconvenience this might have caused.

Spam, huh?

Thursday, October 23rd, 2008

We are currently working on a prototype to identify spam blogs - splogs. Spam blogs can be really tricky to identify even to the human eye, as i-trepreneur.com writes in a recent post:

Why? These Splogs are user friendly. They were not made for search engines but for real visitors. There’s excellent design, well organized sections, working RSS feed. All the information on such Splogs is manually selected from the most popular resources on the net and is properly referenced. Only fresh content is used so it is not identified as duplicate instantly.

Pointing out that madconomist dot com and business-opportunities dot biz are two well made splogs which people are commenting and linking. I can’t tell by just looking at them with my bare eyes - so is’t spam huh? A later post on that philosophical aspect!

A prototype

We have set up a prototype to identify spam blogs. Right now it’s really rudimentary but shows potential. In the future by using clusters of classifiers hosted here at uclassify we think we can create a sufficiently good splog classifier.

Check out the project here, www.spamhuh.com. Remember that it’s only an early prototype!

Concerning the two hard to detect spam blogs above spamhuh.com is able to correctly identify one of them :)

Try it out and let us know what you think!!

Everybody can classify

Sunday, October 19th, 2008

Creating your own classifiers has never been easier, we have developed a Click’n’Classify Graphical User Interface (GUI). This means that you can manually create and train your classifiers without knowing any programming at all. This is very good way to test an idea, if the classifier works well – build your web site around it or use it for whatever purpose.

The GUI allows you to do everything that you can do via our Application Programming Interface (API). Also, just like phpMyAdmin shows the SQL queries our uClassify GUI will show the XML queries so you can easily understand and use the API from your site.

Features

  • Create and remove classifiers
  • Add and remove classes
  • Train and untrain classes
  • See basic information about your classifiers

Screenshot - Create a classifier

This shows a screenshot of how it looks like when you are about to create a classifier, just log in and try it yourself!

Creating a classifier is easy

Screenshot - Training a classifier

Just copy and paste the texts you want to use as training data.

Training a classifier is easy

Happy classifying!

oFaust.com - another site using uclassify

Friday, October 3rd, 2008

We are very happy to anouce that yet another site is using the uclassify web service! ofaust.com is a literature expert who finds out to which classical author a text resemble most. The developers let us know that it has been trained on over 80 different works of classical authors such as Plato, Shakespeare, Tolstoy and of course Goethe.

o'Faust

The beta is now up and running, please sign up create your own web site using cool classifications!

uClassify beta!

Wednesday, October 1st, 2008

Today we are very pleased to announce the beta release of a new web service that allows everyone to access text classifiers for free. In short, by using a web api (e.g. google maps), everyone can create and train their own classifiers.

Two sites using the api already exists, be inspired and come up with your own classifiers

Typealyzer.com - Analyzes the personality of a blog author.

GenderAnalyzer.com - Figures out if a text is written by a man or woman.

During beta we will test the server for usability, stability, scalability and performance.

All comments and feedback are very appreciated!!

Best regards,

Jon Kågström, Roger Karlsson and Emil Kågström.