NgramJ logo.Other Information > References2007-03-19 09:47:52 v1.0
NGramJ, smart scanning for document properties.


What is NGramJ?
Getting Started
How Does it Work?
How to Contribute?
Developer Information
Other Information
  Alternative (Java) Implementations
  Other Projects of Us
  What is
The original text_cat PERL version for byte based ngrams like byte NGramJ The .lm profiles used for language/encoding evaluation by NGramJ are mostly taken from the text_cat sources. The page contains links to many more interesting things!
Some theoretical basis of this program.
Comparing Two Language Identification Schemes.
(PDF - PS) In the Proceedings of the 3rd International Conference on the the Statistical Analysis of Textual Data (JADT'95), Rome, Italy, Dec. 1995.
Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Text,
(PDF) Kenneth R. Beesley, 1988. Published in Proceedings of the 29th Annual Conference of the American Translators Association, 12-16 October 1988, pp. 47-54
Nutch contains a language identification algorithm which is somehow between (byte)NGramJ and CNgram. Note that CNgram can read Nutch language reference profiles, the CNgram format is backward compatible. Actually some CNgram language profiles have been taken from the Nutch project.
W3C about n_grams
Hmm, didn't know there is a standard, we read a proprietary format ... yet.

NewsfeedRSS feed
FilefeedRSS feed
Sourceforge Logo