<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>uClassify blog - free text classifier web service</title>
	<atom:link href="http://blog.uclassify.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.uclassify.com</link>
	<description>Free text classifier web service</description>
	<pubDate>Mon, 05 Jan 2009 11:35:47 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>We moved to Amazon EC2 after a big crash</title>
		<link>http://blog.uclassify.com/amazon-ec2/</link>
		<comments>http://blog.uclassify.com/amazon-ec2/#comments</comments>
		<pubDate>Mon, 05 Jan 2009 11:30:41 +0000</pubDate>
		<dc:creator>Jon</dc:creator>
		
		<category><![CDATA[News]]></category>

		<category><![CDATA[amazon]]></category>

		<category><![CDATA[backup]]></category>

		<category><![CDATA[classifiers]]></category>

		<category><![CDATA[crash]]></category>

		<category><![CDATA[ec2]]></category>

		<category><![CDATA[recovery]]></category>

		<guid isPermaLink="false">http://blog.uclassify.com/?p=64</guid>
		<description><![CDATA[During Christmas some unfortunate events occurred - on the 26th of December Ultimahosts (who we were paying to maintain our servers) had a crash and managed to wipe out all our servers. This was very frustrating, but I expected it to be online again soon, recovered from their backups.
On the 28th they let me know [...]]]></description>
			<content:encoded><![CDATA[<p>During Christmas some unfortunate events occurred - on the 26th of December Ultimahosts (who we were paying to maintain our servers) had a crash and managed to wipe out all our servers. This was very frustrating, but I expected it to be online again soon, recovered from their backups.</p>
<p>On the 28th they let me know that they had accidentally destroyed all backups. How is it possible for a single datacenter to screw up so much?? I don&#8217;t know.</p>
<h3>Most classifiers are intact and users registered 17-25 can be recovered</h3>
<p>Luckily I had taken manual backups myself - one on all the classifiers on the 25th of December and one on the user database on the 17th of December. This means that most classifiers are intact, but users who registered between 17-25 of December are gone. You guys can re-register with the same username and I will attach it to your old classifiers (send me an e-mail). I am really sorry about this and for the inconvenience it has caused.</p>
<h3>New servers on Amazon EC2</h3>
<p>I spent over 60 hours reinstalling and moving uclassify to Amazon EC2. This feels really good (now that it&#8217;s done). We can easily scale and we have an own good backup system using Amazon EBS + daily offsite backups.</p>
<p>I&#8217;m really sorry for any inconvenience,</p>
<p><strong>Jon Kågström</strong><br />
<em><br />
Ps. Thanks to Google cache I was able to recover all posts for this blog&#8230;</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.uclassify.com/amazon-ec2/feed/</wfw:commentRss>
		</item>
		<item>
		<title>LibraryThing annouces uClassify competition</title>
		<link>http://blog.uclassify.com/librarything-annouces-uclassify-competition/</link>
		<comments>http://blog.uclassify.com/librarything-annouces-uclassify-competition/#comments</comments>
		<pubDate>Tue, 23 Dec 2008 02:36:15 +0000</pubDate>
		<dc:creator>Jon</dc:creator>
		
		<category><![CDATA[News]]></category>

		<category><![CDATA[categorization]]></category>

		<category><![CDATA[competition]]></category>

		<category><![CDATA[LibraryThing]]></category>

		<category><![CDATA[prize]]></category>

		<category><![CDATA[service]]></category>

		<category><![CDATA[uclassify]]></category>

		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://localhost/?p=20</guid>
		<description><![CDATA[On LibraryThing you can add your own books to a personal library. By doing this you start to get recommendations from either other users who has read the same book or automatically by the system. There are also several forums where users can discuss books - just like a really really big book club. At [...]]]></description>
			<content:encoded><![CDATA[<p>On <a title="LibraryThing" href="http://librarything.com">LibraryThing</a> you can add your own books to a personal library. By doing this you start to get recommendations from either other users who has read the same book or automatically by the system. There are also several forums where users can discuss books - just like a really really big book club. At the time I signed up there were over 34 million books added. I added a couple of books I have recently read and to my surprise all of them already existed in the system, even the Swedish ones. After adding them I was immediately getting lots of recommendations, such as &#8220;The Satanic Verses&#8221; and &#8220;Robot : mere machine to transcendent mind&#8221;. Really cool!</p>
<p>Now with all these books some kind of categorization could help.</p>
<h3>Competition</h3>
<p><a href="http://www.librarything.com/thingology/2008/12/uclassify-library-mashup-with-prize.php" title="uClassify mashup competition">LibraryThing are encouraging</a> their users to create something cool with uClassify. The prize is $100 Amazon gift certificate and Toby Segaran&#8217;s &#8220;Programming Collective Intelligence&#8221;. LibraryThing also presents a couple of cool ideas which you can use such as fictional vs non-fiction. The competition ends on February 1 2009 so what are you waiting for?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.uclassify.com/librarything-annouces-uclassify-competition/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Tutorial - Creating your own classifier</title>
		<link>http://blog.uclassify.com/tutorial-creating-your-own-classifier/</link>
		<comments>http://blog.uclassify.com/tutorial-creating-your-own-classifier/#comments</comments>
		<pubDate>Thu, 18 Dec 2008 13:36:03 +0000</pubDate>
		<dc:creator>Jon</dc:creator>
		
		<category><![CDATA[Tutorial]]></category>

		<category><![CDATA[add]]></category>

		<category><![CDATA[categorize]]></category>

		<category><![CDATA[categorizer]]></category>

		<category><![CDATA[category]]></category>

		<category><![CDATA[class]]></category>

		<category><![CDATA[classifier]]></category>

		<category><![CDATA[classify]]></category>

		<category><![CDATA[create]]></category>

		<category><![CDATA[News]]></category>

		<category><![CDATA[train]]></category>

		<guid isPermaLink="false">http://localhost/?p=56</guid>
		<description><![CDATA[This is a brief tutorial of how to create your own classifier. I&#8217;ve used the term class synonymously to category and classifier to categorizer.
1. Determine the classifier domain
Before a classifier can start to classify it needs to be created and trained. First you should ask yourself what you want the classifier to do, is it [...]]]></description>
			<content:encoded><![CDATA[<p>This is a brief tutorial of how to create your own classifier. I&#8217;ve used the term class synonymously to category and classifier to categorizer.</p>
<h3>1. Determine the classifier domain</h3>
<p>Before a classifier can start to classify it needs to be created and trained. First you should ask yourself what you want the classifier to do, is it a spam filter? a news categorizer? Let&#8217;s assume it&#8217;s a news categorizer for this tutorial. So we create a news classifier with the name &#8216;Example News Categorizer&#8217;.</p>
<div id="attachment_211" class="wp-caption alignnone" style="width: 549px"><img class="size-medium wp-image-211" style="border: 1px solid black;" title="Create a classifier" src="http://blog.uclassify.com/wp-content/uploads/2008/12/createclassifier.jpg" alt="" width="539" height="118" />
<p class="wp-caption-text">Fig 1. Create the classifier</p>
</div>
<h3>2. Define the relevant classes</h3>
<p>Secondly you need define what classes your classifier should include. Choosing relevant classes is straightforward - just ask yourself what categories are relevant for the domain you have chosen.  Once you have selected the classes you want the classifier to distinguish between you create them. This is easy in our Graphical User Interface but can also be done via our web API. For our small example we create the following three classes: Science, Sports and Entertainment. You can create as many classes as you want.</p>
<div id="attachment_217" class="wp-caption alignnone" style="width: 560px"><img class="size-medium wp-image-217" style="border: 1px solid black;" title="Create the classes" src="http://blog.uclassify.com/wp-content/uploads/2008/12/createclasses.jpg" alt="" width="550" height="369" />
<p class="wp-caption-text">Fig 2. Create the classes (categories)</p>
</div>
<p>You can also add and remove classes dynamically - so don&#8217;t worry if you aren&#8217;t 100% sure that you have included all.</p>
<h3>3. Collect training data</h3>
<p>Before the classifier can start to categorize texts into the classes we need to learn it how texts belonging to the different classes look. This is the hardest part as it requires you to collect actual training data. You can collect it from any source you find appropriate.</p>
<h4>3.1 Amount of training data</h4>
<p>It&#8217;s hard to generalize the amount texts needed for a classifier to work as it&#8217;s highly dependent on the domain. Simple domains such as classifying the language of a text only requires a small amount while harder problems such as seeing difference between texts written by males and females requires much more training data. However to test an idea I suggest at least 20 documents per category. With each document in the same format of those that will be used for classification later (e.g. for a spam filter you train it on e-mails). 20 is the bare minimum - from there the classifier only gets more accurate.</p>
<p>For our news categorizer I collected 20 plain text articles per class from random sources on Internet.</p>
<h4>3.2 Automate the collecting!</h4>
<p>In some cases you can automate the data collection by finding trusted sources on Internet. For example for our news classifier I could jack into three RSS feeds for Science, Sports and Entertainment and automatically gather the data. Ahhh, no manual collecting!! Nice.</p>
<h3>4. Train the classifier</h3>
<p>So you have collected training data in some form (perhaps text files on your hard drive or lists of urls or some feeds), now it&#8217;s time to train the classifier. This can be done manually in the GUI or automated if you have some basic programming skills. For our tutorial I found 20 news articles per class and copied and pasted the them manually into the GUI, it took me about 30 minutes.</p>
<div id="attachment_224" class="wp-caption alignnone" style="width: 595px"><a href="http://blog.uclassify.com/wp-content/uploads/2008/12/train.jpg"><img class="size-medium wp-image-224" style="border: 1px solid black;" title="Training" src="http://blog.uclassify.com/wp-content/uploads/2008/12/train.jpg" alt="Screenshot of training" width="585" height="361" /></a>
<p class="wp-caption-text">Fig 3. Training the classifier via the GUI</p>
</div>
<h4>4.1 Automate the training! (requires novice programming skills)</h4>
<p>Training a classifier through the GUI can be cumbersome if large amounts of training data is tractable. My suggestion is to create a small script in your favorite language that automatically trains the classifier. If your training data is laying around on your machine locally (perhaps automatically collected?=) you can just batch it into our web API. If you haven&#8217;t collected the training data yet you could create a script that automatically collects it and train the classifier with it!</p>
<h3>4. Start classifying</h3>
<p>This is the fun part, when you have created your classifier you can start to use it. You can always test it in our GUI. Further you can (and should) build your own web site around it via our web API - providing the world with more semantics and cool classifications that never have been seen before! Also - remember that you can use your classifiers commercially and make money on it!</p>
<p>I&#8217;ve published the example classifier, don&#8217;t expect it to work perfectly - it has only been trained on 20 articles per class! Test it here - <a title="Example News Categorizer" href="http://www.uclassify.com/browse/uClassify/Example-News-Categorizer">Example News Categorizer</a></p>
<h3>Summary</h3>
<ul>
<li>Find out what you want to classify on and create a classifier</li>
<li>Define and create the categories</li>
<li>Collect training data for each category</li>
<li>Train each category on the gathered data</li>
<li>Build a really cool web site around it!</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.uclassify.com/tutorial-creating-your-own-classifier/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Buzz &#038; Development</title>
		<link>http://blog.uclassify.com/buzz-development/</link>
		<comments>http://blog.uclassify.com/buzz-development/#comments</comments>
		<pubDate>Tue, 09 Dec 2008 02:35:25 +0000</pubDate>
		<dc:creator>Jon</dc:creator>
		
		<category><![CDATA[News]]></category>

		<category><![CDATA[classification]]></category>

		<category><![CDATA[commercial]]></category>

		<category><![CDATA[server]]></category>

		<category><![CDATA[uclassify]]></category>

		<guid isPermaLink="false">http://localhost/?p=18</guid>
		<description><![CDATA[Yesterday we were mentioned on ReadWriteWeb which generated a lot of visits and more importantly - classifiers. 30 new classifiers were created within a time period of 10 hours, even though many are just created out of curiosity to quickly test the system - some will hopefully mature and have web applications built around it.
What&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday we were mentioned on <a title="uClassify on ReadWriteWeb" href="http://www.readwriteweb.com/archives/uclassify_create_your_own_text_classifiers.php" target="_blank">ReadWriteWeb</a> which generated a lot of visits and more importantly - classifiers. 30 new classifiers were created within a time period of 10 hours, even though many are just created out of curiosity to quickly test the system - some will hopefully mature and have web applications built around it.</p>
<h3>What&#8217;s going on techwise</h3>
<p>As you have noticed we are continuously improving our system by carefully adding new features. The following tasks are planned for the GUI</p>
<p>We are soon installing a new more <strong>flexible menu</strong> system.</p>
<p>Users will be able to <strong>create profiles</strong> with descriptions and links. Also classifiers should be able to have a link to the web site it&#8217;s implemented.</p>
<p><strong>Better information about training</strong> - right now there is no feedback on how much training has been done or is required. We want to give users an idea of how the training data performs.</p>
<h3>What&#8217;s going on commercialwise</h3>
<p>Everything is free on uClassify and that is how it will stay.</p>
<p>Our commercial idea is to offer companies the possibility to buy their own classification servers. For large databases with texts that needs to be classified it&#8217;s intractable to send every text for a roundtrip to  uclassify.com. Instead companies could be interested in doing this efficiently locally. A products page with server information will appear soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.uclassify.com/buzz-development/feed/</wfw:commentRss>
		</item>
		<item>
		<title>What’s your mood?</title>
		<link>http://blog.uclassify.com/whats-your-mood/</link>
		<comments>http://blog.uclassify.com/whats-your-mood/#comments</comments>
		<pubDate>Wed, 03 Dec 2008 02:34:03 +0000</pubDate>
		<dc:creator>Jon</dc:creator>
		
		<category><![CDATA[News]]></category>

		<category><![CDATA[classification]]></category>

		<category><![CDATA[classifier]]></category>

		<category><![CDATA[happy]]></category>

		<category><![CDATA[mood]]></category>

		<category><![CDATA[prfekt]]></category>

		<category><![CDATA[sentiment]]></category>

		<category><![CDATA[uclassify]]></category>

		<category><![CDATA[upset]]></category>

		<guid isPermaLink="false">http://localhost/?p=16</guid>
		<description><![CDATA[Today, 2 months after our launch, our users have created over 200 classifiers. Most are unpublished and under construction. PRfekt, the team behind the popular Typealyzer, recently published a new classifier that determines the mood of a text - whether a text is happy or upset. You can try it for yourself here!
So lets test [...]]]></description>
			<content:encoded><![CDATA[<p>Today, 2 months after our launch, our users have created over 200 classifiers. Most are unpublished and under construction. <a title="PEfekt" href="http://www.prfekt.se" target="_blank">PRfekt</a>, the team behind the popular <a title="Typealyzer Myers Briggs classifier" href="http://www.typealyzer.com" target="_blank">Typealyzer</a>, recently published a new classifier that determines the mood of a text - whether a text is happy or upset. You can <a title="PRFekt's mood classifier" href="http://www.uclassify.com/browse/prfekt/Mood" target="_blank">try it for yourself</a> here!</p>
<p>So lets test some snippets!</p>
<p><a title="Jamis is upset" href="http://www.37signals.com/svn/posts/1425-when-hi-tech-is-too-much-tech" target="_blank">Jamis</a> is (justly) upset and writes:</p>
<p>&#8220;<em>Is anyone else annoyed by the “just speak your choice” automation in so many telephone menus? I feel like an idiot mumbling “YES!” or “CHECK BALANCE!” into my phone. Maybe it’s the misanthrope in me coming to the front, but I’d much rather push buttons than talk to a pretend person.</em>&#8221;</p>
<p>The <a title="Mood classifier" href="http://www.uclassify.com/browse/prfekt/Mood" target="_blank">mood classifier</a> says <strong>98.1% upset</strong>.</p>
<p>Spam is no fun either, or as <a title="Ed Angers" href="http://www.weeklyworldnews.com/opinion/internet-spammers-are-driving-me-crazy/" target="_blank">Ed-Anger</a> notes:</p>
<p><em>&#8220;I’m madder than a rooster in an empty hen house at Internet spammers and I won’t take it anymore. Those creeps clutter up my e-mail with their junk, everything from penis enlargement pills to some lady telling me she’ll give me a million dollars if I’ll help her get her money out of Africa. “Rush me 10 grand quick as possible and we’ll get the whole thing started,” she says.&#8221;</em></p>
<p>The <a title="Mood classifier" href="http://www.uclassify.com/prfekt/Mood" target="_blank">mood classifier</a> says <strong>97.0% upset</strong>.</p>
<p>Now over to some happy blogs, <a title="Amour-Amour happy" href="http://amouramourblog.blogspot.com/2008/11/i-have-confession.html" target="_blank">amour-amour</a> has a confesion:</p>
<p><em>&#8220;I love my iphone in a way I never thought possible!! When my fiance got his and spent 23 hours gazing at it lovingly, uploading (or is it downloading??) apps and buying accessories for it I put it down to him just being a technology geek.&#8221;</em></p>
<p>The <a title="Mood classifier" href="http://www.uclassify.com/browse/prfekt/Mood" target="_blank">mood classifier</a> says <strong>79.8% happy</strong>.</p>
<p>Finally <a title="Nitwik and Gervais" href="http://nitwitnastik.wordpress.com/2008/11/14/ricky-gervais-bible-and-creationism/" target="_self">Nitwik Nastik</a> comments a Rickey Gervais:</p>
<p><em>&#8220;This is a hilarious stand-up routine by British Comedian Ricky Gervais on Bible and Creationism. It’s really funny how he ridicules the creationist stories from the book of Genesis (the book of genesis can be found here)and point out to it’s obvious logical blunders. Sometimes it may be difficult to understand his accent and often he will make some funny comments under his breath, so try to listen carefully.&#8221;</em></p>
<p>The <a title="Mood classifier" href="http://www.uclassify.com/browse/prfekt/Mood" target="_self">mood classifier</a> says <strong>69.7% happy</strong>.</p>
<p>The author recommends at least two hundred words (more text than my samples) which seems reasonable!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.uclassify.com/whats-your-mood/feed/</wfw:commentRss>
		</item>
		<item>
		<title>GenderAnalyzer thoughts</title>
		<link>http://blog.uclassify.com/genderanalyzer-thoughts/</link>
		<comments>http://blog.uclassify.com/genderanalyzer-thoughts/#comments</comments>
		<pubDate>Sun, 23 Nov 2008 02:31:05 +0000</pubDate>
		<dc:creator>Jon</dc:creator>
		
		<category><![CDATA[News]]></category>

		<category><![CDATA[accuracy]]></category>

		<category><![CDATA[analyzer]]></category>

		<category><![CDATA[classifier]]></category>

		<category><![CDATA[female]]></category>

		<category><![CDATA[gender]]></category>

		<category><![CDATA[genderanalyzer]]></category>

		<category><![CDATA[male]]></category>

		<guid isPermaLink="false">http://localhost/?p=12</guid>
		<description><![CDATA[First, thanks to everyone who is testing GenderAnalyzer, we have had incredible feedback. We received emails from many people that are facinated and a few that thinks it sucks =) GenderAnalyzer is still generating a lot of traffic and people are blogging about it.
Our learnings
Determining the gender of an author is not easy, besides the [...]]]></description>
			<content:encoded><![CDATA[<p>First, thanks to everyone who is testing <a title="Gender Analyzer" href="http://www.genderanalyzer.com" target="_blank">GenderAnalyzer</a>, we have had incredible feedback. We received emails from many people that are facinated and a few that thinks it sucks =) GenderAnalyzer is still generating a lot of traffic and people are blogging <a title="Gender Analyzer blog search on Google" href="http://blogsearch.google.com/blogsearch?q=genderanalyzer&amp;um=1&amp;ie=UTF-8&amp;scoring=d" target="_blank">about it</a>.</p>
<h3>Our learnings</h3>
<p>Determining the gender of an author is not easy, besides the classification there is a chain of technical events that must work in order to get a reliable result. As many of you have noticed the accuracy has dropped to 53% which is far lower than expected based on our tests. There may be several reasons for this low accuracy and I will mention some of them here.</p>
<ul>
<li>Our <strong>trainingdata</strong> of 2000 blogs is automatically collected from blogspot. Runing internal tests (10 fold cross validation) on this data gives us an accurcy of 75% this effectivly means <em>“Given that the corpus is a perfect representation of real world data, the classifier is able to give any real world data the correct label by a chance of 75%”</em>. So our trainingdata is probably not very representative, as a matter of fact it&#8217;s very stereotypical (see for yourself <a title="Test Gender Analyzer on word basis" href="http://www.uclassify.com/Browse.aspx" target="_blank">here</a>). Using data from all kind of sources should give us a better model.</li>
<li>When someone is testing a blog we are not crawling through posts on the blog to get a good <strong>amount of text</strong>. We are only hitting the given url and using the text (and html) that appear there as test data. So a page with mostly images or frames will give bad test data. Does anyone know a nice library that - given an url crawls blog posts? Via RSS perhaps?</li>
<li>We are trying to <strong>encode</strong> test data to utf-8 which is the format of the training data - it could be that we are missing some encodings.</li>
<li>And of course - the difference between male and female writing is <strong>not significant</strong>?</li>
</ul>
<h3>What&#8217;s next?</h3>
<p>We are currently collecting a new set of training data that is much more representative. We will switch to this classifier during the next week and start a new poll for it. It&#8217;s going to be very exciting!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.uclassify.com/genderanalyzer-thoughts/feed/</wfw:commentRss>
		</item>
		<item>
		<title>GenderAnalyzer showdown + server upgrade</title>
		<link>http://blog.uclassify.com/genderanalyzer-showdown/</link>
		<comments>http://blog.uclassify.com/genderanalyzer-showdown/#comments</comments>
		<pubDate>Tue, 04 Nov 2008 02:32:32 +0000</pubDate>
		<dc:creator>Jon</dc:creator>
		
		<category><![CDATA[News]]></category>

		<category><![CDATA[boingboing]]></category>

		<category><![CDATA[genderanalyzer]]></category>

		<guid isPermaLink="false">http://localhost/?p=14</guid>
		<description><![CDATA[Today genderanalyzer.com was featured on BoingBoing this resulted in that our server could not handle all the requests. We have now upgraded the server and it should be happy to serve all requests.
While the server was unable to respond to all requests - accuracy in the poll dropped from 63% to 55% (since the error [...]]]></description>
			<content:encoded><![CDATA[<p>Today <a title="Gender Analyzer" href="http://www.genderanalyzer.com">genderanalyzer.com</a> was featured on <a title="GenderAnalyzer on BoingBoing" href="http://www.boingboing.net/2008/11/03/gender-analyzer-did.html" target="_blank">BoingBoing</a> this resulted in that our server could not handle all the requests. We have now upgraded the server and it should be happy to serve all requests.</p>
<p>While the server was unable to respond to all requests - accuracy in the poll dropped from 63% to 55% (since the error message makes people vote that it&#8217;s not guessing right). However now the accuracy is slowly recovering!</p>
<p>Sorry for any inconvenience this might have caused.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.uclassify.com/genderanalyzer-showdown/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Spam, huh?</title>
		<link>http://blog.uclassify.com/spam-huh/</link>
		<comments>http://blog.uclassify.com/spam-huh/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 02:46:40 +0000</pubDate>
		<dc:creator>Jon</dc:creator>
		
		<category><![CDATA[News]]></category>

		<category><![CDATA[blog]]></category>

		<category><![CDATA[classifier]]></category>

		<category><![CDATA[spam]]></category>

		<category><![CDATA[spamhuh]]></category>

		<category><![CDATA[splog]]></category>

		<category><![CDATA[uclassify]]></category>

		<guid isPermaLink="false">http://localhost/?p=11</guid>
		<description><![CDATA[We are currently working on a prototype to identify spam blogs - splogs. Spam blogs can be really tricky to identify even to the human eye, as i-trepreneur.com writes in a recent post:
Why? These Splogs are user friendly. They were not made for search engines but for real visitors. There’s excellent design, well organized sections, [...]]]></description>
			<content:encoded><![CDATA[<p>We are currently working on a prototype to identify spam blogs - splogs. Spam blogs can be really tricky to identify even to the human eye, as <a title="Evil Spam" href="http://i-trepreneur.com/2008/10/21/these-evil-splogs/" target="_blank">i-trepreneur.com</a> writes in a recent post:</p>
<p><em>Why? These Splogs are user friendly. They were not made for search engines but for real visitors. There’s excellent design, well organized sections, working RSS feed. All the information on such Splogs is manually selected from the most popular resources on the net and is properly referenced. Only fresh content is used so it is not identified as duplicate instantly.</em></p>
<p>Pointing out that <strong>madconomist dot com</strong> and <strong>business-opportunities dot biz</strong> are two well made splogs which people are commenting and linking. I can&#8217;t tell by just looking at them with my bare eyes - so is&#8217;t spam huh? A later post on that philosophical aspect! </p>
<h3>A prototype</h3>
<p>We have set up a prototype to identify spam blogs. Right now it&#8217;s really rudimentary but shows potential. In the future by using clusters of classifiers hosted here at uclassify we think we can create a sufficiently good splog classifier.</p>
<p>Check out the project here, <a href="http://www.spamhuh.com/" title="Spam, huh?" target="_blank">www.spamhuh.com</a>. Remember that it&#8217;s only an early prototype!</p>
<p>Concerning the two hard to detect spam blogs above spamhuh.com is able to correctly identify one of them <img src='http://blog.uclassify.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><a href="http://www.spamhuh.com/" title="Spam, huh?" target="_blank">Try it out</a> and let us know what you think!!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.uclassify.com/spam-huh/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Everybody can classify</title>
		<link>http://blog.uclassify.com/everybody-can-classify/</link>
		<comments>http://blog.uclassify.com/everybody-can-classify/#comments</comments>
		<pubDate>Mon, 20 Oct 2008 02:28:02 +0000</pubDate>
		<dc:creator>Jon</dc:creator>
		
		<category><![CDATA[News]]></category>

		<category><![CDATA[classifier]]></category>

		<category><![CDATA[classify]]></category>

		<category><![CDATA[gui]]></category>

		<category><![CDATA[interface]]></category>

		<category><![CDATA[site]]></category>

		<category><![CDATA[uclassify]]></category>

		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://localhost/?p=8</guid>
		<description><![CDATA[Creating your own classifiers has never been easier, we have developed a Click’n’Classify Graphical User Interface (GUI). This means that you can manually create and train your classifiers without knowing any programming at all. This is very good way to test an idea, if the classifier works well – build your web site around it [...]]]></description>
			<content:encoded><![CDATA[<p>Creating your own classifiers has never been easier, we have developed a Click’n’Classify Graphical User Interface (GUI). This means that you can manually create and train your classifiers without knowing any programming at all. This is very good way to test an idea, if the classifier works well – build your web site around it or use it for whatever purpose. </p>
<p>The GUI allows you to do everything that you can do via our <a href="http://www.uclassify.com/ApiDocumentation.aspx" title="API Documentation">Application Programming Interface (API)</a>. Also, just like phpMyAdmin shows the SQL queries our uClassify GUI will show the XML queries so you can easily understand and use the API from your site.</p>
<h3>Features</h3>
<ul>
<li>Create and remove classifiers</li>
<li>Add and remove classes</li>
<li>Train and untrain classes</li>
<li>See basic information about your classifiers</li>
</ul>
<h3>Screenshot - Create a classifier</h3>
<p>This shows a screenshot of how it looks like when you are about to create a classifier, just <a href="http://www.uclassify.com/Login.aspx" title="Log in / Sign up">log in</a> and try it yourself!</p>
<p><img style="border: 1px solid #999" src="http://blog.uclassify.com/wp-content/uploads/2008/12/createclassifier.jpg" alt="Creating a classifier is easy" /></p>
<h3>Screenshot - Training a classifier</h3>
<p>Just copy and paste the texts you want to use as training data.</p>
<p><img style="border: 1px solid #999" src="http://blog.uclassify.com/wp-content/uploads/2008/12/train.jpg" alt="Training a classifier is easy" /></p>
<p><strong>Happy classifying!</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.uclassify.com/everybody-can-classify/feed/</wfw:commentRss>
		</item>
		<item>
		<title>More memory or smaller memories?</title>
		<link>http://blog.uclassify.com/more-memory-or-smaller-memories/</link>
		<comments>http://blog.uclassify.com/more-memory-or-smaller-memories/#comments</comments>
		<pubDate>Sun, 19 Oct 2008 02:50:26 +0000</pubDate>
		<dc:creator>Jon</dc:creator>
		
		<category><![CDATA[Optimization]]></category>

		<category><![CDATA[classification]]></category>

		<category><![CDATA[insert]]></category>

		<category><![CDATA[map]]></category>

		<category><![CDATA[memory]]></category>

		<category><![CDATA[overhead]]></category>

		<category><![CDATA[search]]></category>

		<category><![CDATA[stl]]></category>

		<category><![CDATA[string]]></category>

		<guid isPermaLink="false">http://localhost/?p=33</guid>
		<description><![CDATA[In order to build a classification server that can handle thousands of classifiers and process huge amounts of data we were sure that we eventually would have to do some major optimizations. To avoid doing any work prematurely we waited with all optimizations that actually require design changes or invoke less code readability until we [...]]]></description>
			<content:encoded><![CDATA[<p>In order to build a classification server that can handle thousands of classifiers and process huge amounts of data we were sure that we eventually would have to do some major optimizations. To avoid doing any work prematurely we waited with all optimizations that actually require design changes or invoke less code readability until we were absolutely sure where to improve.</p>
<p>When we ran our first test in May it was obvious what would be our first bottleneck – the memory consumption of classifiers. It was really bad, raw classifier data that expanded by a factor of about 5 into the primary memory - a tiny classifier of 1Mb would take 5Mb as soon it’s fetched into memory. It was really easy to pinpoint the memory theives.</p>
<h3>Couple in crime - STL strings and maps</h3>
<p>We were using STL maps to hold frequency distributions for tokens (features). All tokens were mapped to their frequency, map&lt;string, unsigned int&gt; accordingly. This is a very convenient and straightforward way to do it. But the memory overhead is not very attractive.</p>
<h4>VS2005 STL string memory overhead</h4>
<p>The actual sizes of types vary between platforms and STL implementations (these numbers are from the STL that comes with VS2005 on 32 bit Windows XP).</p>
<p>Each string takes at least 32 bytes<br />
<strong>size_type _Mysize</strong> = 4 bytes (string size)<br />
<strong>size_type _Myres</strong> = 4 bytes (reserve size)<br />
<strong>_Elem _Buf</strong> = 16 bytes (internal buffer for strings shorter than 16 bytes)</p>
<p><strong>_Elem* _Ptr</strong> = 4 bytes (pointer to strings that don’t fit in the buffer)<br />
<strong> this*</strong> = 4 bytes (this pointer)</p>
<p>Best case overhead for STL strings is 16 bytes if the internal buffer is filled exactly. Worst case is for empty or strings longer than 15 bytes which gives the overhead of 32 bytes. Therefore string overhead varies from <strong>16 to 32 bytes</strong>.</p>
<h4>VS2005 STL map memory overhead</h4>
<p>Each entry in a map consists of a STL pair - the key and value (first and second). A pair only has the memory overhead of the this pointer (4 bytes) (and that inherited from the types it’s composed of). However the map is a colored tree and consists of linked nodes. Each pair is stored in a node and nodes have quite heavy memory overhead:</p>
<p><strong>_Genptr _Left</strong> = 4 bytes (points to the left subtree)<br />
<strong>_Genptr _Parent</strong> = 4 bytes (pointer to parent)<br />
<strong>_Genptr _Right</strong> = 4 bytes (points to the right subtree)<br />
<strong>char _Color</strong> =  1 byte (the color of the node)<br />
<strong>char _Isnil</strong> = 1 byte (true if node is head)</p>
<p><strong>this*</strong> = 4 bytes (this pointer)</p>
<p>So there is a 18 byte overhead per node and 4 bytes per pair, which sums up to <strong>22 bytes</strong>.</p>
<h3>Strings in maps</h3>
<p>Now inserting a string shorter than 16 bytes into a map&lt;string, unsigned int&gt; will consume 32+22+4=58 bytes. It could even be more if memory alignment kicks in for any of the allocations. In most cases this is perfectly fine and is not even worth considering optimizing. In our case it was not plausible to have a memory overhead factor of 5. Our language classifier takes about 14Mb on disk and should not take much more when loaded into memory – it blew up to about 65Mb. As it consists of 43 languages with probably around 30000 unique words per class (language) it gets really bloated.</p>
<h3>One solution</h3>
<p>We needed to maintain the search and insertion speed of maps (time complexity O(log n)) but get rid of the overhead. Insertions are needed when classifiers are trained.</p>
<h4>Maintaining search speed</h4>
<p>Since we already had limited features to the maximum length of 32 bytes we could use that information to create what we call memory lanes. A memory lane only consists of tokens of the same size followed by the frequency. In that manner we created 32 lanes, lane 1 with all tokens of size 1, lane 2 with all tokens of size 2 and so on. Tokens in memory lanes are sorted so we can use binary search.</p>
<p>Memory lane 1 could look like this (tokens of size 1 followed by the frequency)<br />
a0031i0018y0003<br />
…<br />
and memory lane 3 like this<br />
can0011far0004the0019zoo0001</p>
<p>By doing so we get rid of all overhead and maintaining search at O(log n).</p>
<h4>Maintaining insertion speed (almost)</h4>
<p>Maps allow fast insertions in O(log n) so we kept an intermediate map for each memory lane. When a classifier is trained, new tokens they go into the map and the frequency of those that already exist in the memory lane is increased. When the training session is over the intermediate maps are merged to their respective memory lane. This can be done in O(n) and is the major penalty. Note that explicit sorting is never required since maps are ordered. Another penalty occur when both the map and memory lane are filled with tokens – at this point two lookups can happen (first in the memory lane and if it doesn’t exist a search through the map is required).</p>
<p>This solution reduced memory consumption by a factor of 4-5 at the penalty of having to merge new training data into memory lanes every now and then. This is perfectly fine for us as training often reduce with time (training data get good enough) and classification hence increase.</p>
<p>A similar optimization for Java is described on the <a title="Optimizing Feature Extraction" href="http://lingpipe-blog.com/2008/10/09/optimizing-feature-extraction/" target="_blank">LingPipe blog</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.uclassify.com/more-memory-or-smaller-memories/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
