When you classify texts you will get back class probabilities. Sometimes it’s hard to know what those are based on therefore I’ve added a new score called ‘text coverage’.
Text coverage is the proportion words in the text to classify that is found in the training data. This is helps users to determine how trustworthy the probabilities are. For example if you send a text with 10000 words to the language classifier and get back high English probability but with with a low text coverage (say 0.01). This means that only 100 of the 10000 words where recognized by the language classifier. A reasonable cause could be that the text is written in an unknown language but has some English words in (quotations, borrowed words etc). It’s up to the user to determine how to handle this. Sometimes low text coverage scores are ok, it’s highly dependent on the domain.
The text coverage can be found as an attribute in the <classification> tag and is called ‘textCoverage’.
Let me know if you have any questions about this.