NgramJ logo.
NGramJHow Does it Work?
2007-03-19 09:47:52 1.0

How Does it Work?

General Setting

An ngram is a (short) sequence of atoms like bytes, characters, words or whatsoever. In this settings we only care about bytes or characters. Though ngrams of words are a more recent technology.

As it turns out many properties of a underlying text (for example language, style, but even thematic focus) have a statistical stable impact on the ngram profile of this text. The ngram profile is the (statistical) distribution of ngrams, that is how often a certain ngram appears in a certain sequence.

Once you know you are hunting for a set of properties which is well reflected by a set of corresponding profiles, you can setup an automated search for these properties by the following steps.

  1. For each property create a reference profile from sample texts.
  2. For a text with unknown property, determine it's text profile and check if it is close to a reference profile.

The above description is far from being an algorithm. Actually there has to be some careful investigation if a property is refleced by ngrams profiles. You have to set up reference profiles from (as large as possible) reference suits of texts with that property. And you have to precise what you mean with closeness of two ngram profiles.

NGramJ only cares about byte or character based ngrams. While there are other applications, the major application of this is the recognition of the language of a document. This is a somehow easier part of ngram application, for several reasons. To name two:

The CNgram part of NGramJ however uses a somewhat more elaborate notion of competitive ngram closeness which in the end assigns language percentages to a piece of text. This measurement is good enough that not only texts in one language but also texts with two major languages have been successfully classified with NGramJ!

Sourceforge Logo