NGramJ, smart ngram algorithms, Design Decisions

	Developer Information > Design Decisions	2007-03-19 09:47:52 v1.0
	NGramJ, smart scanning for document properties. Design Decisions

Download NgramJ | Sourceforge Project Summary | NgramJ Online

Developer Information

Files and Directories

Building NGramJ

Javadocs

Changelog

License

Design Decisions

Other Information

Shortcuts:
References

I document this, cause this is IMHO the hardest thing to determine ad hoc given a bunch of more or less quality code.

Good performance

I use to be a performance geek. (byte) NGramJ wasn't that bad in it times, but its basic design is still Java 1.1.8 some decisions are possibly to revisit after such a long time.

top

Competitive Scoring

CNgram contains a new mechanism to be efficient at task which require many documents to be matched against a set of profiles.

This technique is based on (slightly costly) preprocessing of a set of profiles. Therefore the commandline version is actually slightly slower. But after preprocessing new texts can be classified without additional memory and without actually ever creating any ngram objects. The algorithm immediately competitively scores the text against the given sample profiles. It is needs time O(textlength * number of languages). With a relatively small proportional factor. Whereas the classical method to first convert a given a text into a profile, costs memory proprotional to the different ngrams in the text. And only afterward this profile has to be matched against prescribed profiles which costs time depending on the number of profiles and the number of total profiles involved. If you have very many sample languages and very short texts, the classical method might be superior to the new one.

While the new technique was mainly implemented to achieve high performance in tasks where many documents have to be language classified, it turns out that it can provide different, probably better results:

To avoid memory allocations all ngram scoring has to be "on the fly" based on a currently seen short segment of text. This restrictrion actually allows for some "context" to creep into evaluation. That is a actual ngram can score different depending on preceeding ngrams. This does actually happen. Look at the word "question". When it is analyzed alone it recognized as being French with some tendencies to Italian, Spanish and English. However analyzed as part of the phrase "To be or not to be, this is the question" the preceeding strong English context scores it as being English with a littlebit of French.

Note the new technique of profile preprocessing could be easily ported back to (byte) NGramJ, but this hasn't been done yet.

top

Ternary Search Tries reimplemented

The competitive sampling procedure relies on a fast implementation of ternary search tries. This trie is "reentrant" in the sense, that you can proceed descending the tree after you found a match.

E.g. say we are parsing the String quest and we are currently proceeding the t. Then we have to look at the following ngrams in sequence: the 1-gram t, then the 2-gram st, then the 3-gram est. In a search trie of reversed ngrams we can first determine the leaf for t and then proceed into the subtrees for first ts and second tse, thus the lookup of all ngrams ending up with t can be done by only one traversal down the search trie.

top

No GUI

This is a library and the executables are mere demos and samples how to embed the library.