2007-03-19 09:47:52 1.0

How Does it Work?

General Setting

An ngram is a (short) sequence of atoms like bytes, characters, words or whatsoever. In this settings we only care about bytes or characters. Though ngrams of words are a more recent technology.

As it turns out many properties of a underlying text (for example language, style, but even thematic focus) have a statistical stable impact on the ngram profile of this text. The ngram profile is the (statistical) distribution of ngrams, that is how often a certain ngram appears in a certain sequence.

Once you know you are hunting for a set of properties which is well reflected by a set of corresponding profiles, you can setup an automated search for these properties by the following steps.

For each property create a reference profile from sample texts.
For a text with unknown property, determine it's text profile and check if it is close to a reference profile.

The above description is far from being an algorithm. Actually there has to be some careful investigation if a property is refleced by ngrams profiles. You have to set up reference profiles from (as large as possible) reference suits of texts with that property. And you have to precise what you mean with closeness of two ngram profiles.

NGramJ only cares about byte or character based ngrams. While there are other applications, the major application of this is the recognition of the language of a document. This is a somehow easier part of ngram application, for several reasons. To name two:

There are plenty of text samples preclassified kicking arround everywhere. I.e. many files on the internet contain a language markup.
Moreover there are plenty of reference profiles for this applications arround freely on the internet. (Most profiles in NGramJ are taken from the net.
There is a competitive aspect with scoring. Assuming you have a text in exactly one language, your notion of profile closeness only has to guarantee the the right reference profile is closer than every other one. Actually many ngram langauge recognition systems therefore closeness measures tuned for simplicity rather than for using information really carefully.

The CNgram part of NGramJ however uses a somewhat more elaborate notion of competitive ngram closeness which in the end assigns language percentages to a piece of text. This measurement is good enough that not only texts in one language but also texts with two major languages have been successfully classified with NGramJ!