What is NGramJ? Getting Started How Does it Work? Contact How to Contribute? Developer Information Other Information Alternative (Java) Implementations Other Projects of Us What is spieleck.de? References
|
-
http://odur.let.rug.nl/~vannoord/TextCat/
-
The original text_cat PERL version for byte based ngrams like byte NGramJ The .lm profiles used for language/encoding evaluation by NGramJ are mostly taken from the text_cat sources. The page contains links to many more interesting things!
-
http://www.nonlineardynamics.com/trenkle/papers/sdair-94-bc.ps.gz
-
Some theoretical basis of this program.
-
Comparing Two Language Identification Schemes.
-
(PDF - PS) In the Proceedings of the 3rd International Conference on the the Statistical Analysis of Textual Data (JADT'95), Rome, Italy, Dec. 1995.
-
Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Text,
-
(PDF) Kenneth R. Beesley, 1988. Published in Proceedings of the 29th Annual Conference of the American Translators Association, 12-16 October 1988, pp. 47-54
-
Nutch
-
Nutch contains a language identification algorithm which is somehow between (byte)NGramJ and CNgram. Note that CNgram can read Nutch language reference profiles, the CNgram format is backward compatible. Actually some CNgram language profiles have been taken from the Nutch project.
-
W3C about n_grams
-
Hmm, didn't know there is a standard, we read a proprietary format ... yet.
|