NGramJ, smart ngram algorithms, What is NGramJ?

		2007-03-19 09:47:52 v1.0
	NGramJ, smart scanning for document properties. What is NGramJ?

Download NgramJ | Sourceforge Project Summary | NgramJ Online

What is NGramJ?

Practical Usages

Developer Information

Other Information

Shortcuts:
Changelog
References

ngrams are a rather classical instrument in Natural Language Processing (NLP) applications.

NGramJ is a Java based library containing two types of ngram based applications. It's major focus is to provide robust and state of the art language recognition (or language guessing how some call it more correctly). Both types are meant to be embedded into larger applications.

Language recognition is not the only NLP application of ngrams and NGramJ can be used as a building block in all kinds of differing applications. However Langugage recognition was my major application and therefore NGramJ is somewhat streamlined for this.

NGramJ: This uses ngrams of bytes to determine from a sequence of bytes both language and encoding. In symbols: NGramJ : byte[] --> (Language, Encoding)
CNgram: This uses ngrams of characters to determine the langauge of a character sequence. In symbols CNgram : char[] --> Language

If you think of applying ngrams to files: NGramJ is the right thing, if you do not know what encoding the files use. On the other hand if you know the encoding, it is better to explicitely use the encoding to read the file and apply CNgram afterward.

Once you are in a program and treat Strings and other kinds of character sequences, CNgram is the only reasonable way to go.

The CNgram library has been developed under consideration of multithreading and performance requirements. CNgram has also a language recognition mechanism which (to some extend) successfully recognizes mixed language documents.

Caution: For historical reasons NGramJ sometimes refers to the (older) byte based ngrams excluding the newer addition of CNgram. I'm sorry about the confusion.

There are alternative Java implementations of n-Grams.