|Developer Information > Design Decisions||2007-03-19 09:47:52 v1.0|
|NGramJ, smart scanning for document properties.
|Download NgramJ | Sourceforge Project Summary | NgramJ Online|
What is NGramJ?
How Does it Work?
How to Contribute?
Files and Directories
I document this, cause this is IMHO the hardest thing to determine ad hoc given a bunch of more or less quality code.
I use to be a performance geek. (byte) NGramJ wasn't that bad in it times, but its basic design is still Java 1.1.8 some decisions are possibly to revisit after such a long time.
CNgram contains a new mechanism to be efficient at task which require many documents to be matched against a set of profiles.
This technique is based on (slightly costly) preprocessing of a set of profiles. Therefore the commandline version is actually slightly slower. But after preprocessing new texts can be classified without additional memory and without actually ever creating any ngram objects. The algorithm immediately competitively scores the text against the given sample profiles. It is needs time O(textlength * number of languages). With a relatively small proportional factor. Whereas the classical method to first convert a given a text into a profile, costs memory proprotional to the different ngrams in the text. And only afterward this profile has to be matched against prescribed profiles which costs time depending on the number of profiles and the number of total profiles involved. If you have very many sample languages and very short texts, the classical method might be superior to the new one.
While the new technique was mainly implemented to achieve high performance in tasks where many documents have to be language classified, it turns out that it can provide different, probably better results:
To avoid memory allocations all ngram scoring has to be "on the fly" based on a currently seen short segment of text. This restrictrion actually allows for some "context" to creep into evaluation. That is a actual ngram can score different depending on preceeding ngrams. This does actually happen. Look at the word "question". When it is analyzed alone it recognized as being French with some tendencies to Italian, Spanish and English. However analyzed as part of the phrase "To be or not to be, this is the question" the preceeding strong English context scores it as being English with a littlebit of French.
Note the new technique of profile preprocessing could be easily ported back to (byte) NGramJ, but this hasn't been done yet.
Ternary Search Tries reimplemented
The competitive sampling procedure relies on a fast implementation of ternary search tries. This trie is "reentrant" in the sense, that you can proceed descending the tree after you found a match.
E.g. say we are parsing the String
This is a library and the executables are mere demos and samples how to embed the library.
|A Spieleck Project||top|