A text classifier places documents into their relevant classes (categories). For example, placing spam in the spam folder or web pages about Artificial Intelligence into the AI category. There are different types of text classifiers, the one I will be addressing here is a machine learning one!
Training
To make the classifier understand where documents should go you must first train it. By training you manually set up two or more classes (e.g. spam and legitimate) and describe each class by showing typical documents. In the case of a spam classifier you would train the classifier on spam and legitimate documents. Basically saying to Mrs. Classifier “Hey look at this bunch of documents, they are all spam!” after which you show her the legitimate documents “and these are legitimate!”
By doing so the classifier learns characteristics for each class. This is called supervised training. The training documents are often referred to as the training corpus.
Classifying
Once a classifier has been trained it can be used to find out into which of the predefined classes a previously unseen document is most likely belong. You ask Mrs. Classifier something like “To which of the classes (I have trained you on) is this document most likely to belong?” She would the kindly answer something like “I am 96% certain that it should go into the spam folder.”
It’s not necessary to stop training a classifier when you start classifying. Training and classifying can take place at the same time.
Using our XML API you can communicate with “Mrs. Classifier”!