lc4j, a language categorization Java library

On this page you will find the "language categorization library" for the Java programming language.
You can download the library (with source code) directly here.

Requirements

Any modern system (JDK >= 1.4.1) should be enough in order to compile/run this library.
However, please note that the following libraries are needed for this application to compile/run:

and they must be placed in your $CLASSPATH.

Features

lc4j has been designed to be a compact, fast and scalable Java library that implements the algorithms described in:


Cavnar, W. B. and J. M. Trenkle, "N-Gram-Based Text Categorization"
In Proceedings of Third Annual Symposium on Document Analysis and
Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics,
pp. 161-175, 11-13 April 1994.

(Citeseer entry)

to categorize texts using n-grams. A directory with sample language models is provided to implement a language guesser from scratch.

This idea behind this program and most of the sample texts have their roots in TextCat, a free Perl library which implements the text categorization algorithm.

Examples

Let’s say that you want to determine the language of this phrase:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Morbi id eros
You just fire up lc4j (yes, it is a library but also has a main) with the one-line command:
echo "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Morbi id eros" | java net.olivo.lc4j.LanguageCategorization

and the program will reply in a way similar to this:

time taken to load all available language models: 0.371s
time taken to effectively determine the language: 0.027s
probable language(s): [latin.lm]

If more than one language matches, other languages are reported (in decreasing order of probability) in the list. If more than the maximum number of matched languages are matched, an "UNKNOWN" language is reported. This and other parameters can be changed with command-line parameters.

lc4j is cool, since (about half of them are those from TextCat) it is packed with lots of different languages and creating your own is easier than ever.
As a general rule of thumb, please remember that the longer the text, the better the output.
The number of language-models given in this version of lc4j is over 150, but since many languages are just duplicates in various common encodings, the actual number of distinct languages recognized is more or less 80-90. Adding a new language is easy however: just follow the instructions provided in the accompanying files.

Obviously, being a Java library, lc4j can be easily integrated inside your Java application (say a crawler, or a data-mining app) and can give you all that you need to determine the language of any document.
Moreover, as explained in the paper cited above, there is the possibility of using this algorithm to make some clustering of the documents by topic, although better algorithms exist in this field; for the sake of demonstration, a sample agglomerative-clustering algorithm that uses the distance between n-gram tables as measure is provided in the JAR file.

Download

Click here to download the library and its source code (about 748 KB)

Bugs

Altough this is far from being a bug, there is one thing that you should be careful about: language-models and encodings. In fact lc4j requires that it in order to recognize a language given in some encoding there should be a corresponding file that contains some sentences of that language in that encoding.
So, if you have a Chinese text in UTF-8 encoding and the program has – say – been trained only in Chinese – BIG5 encoding, the results will unlikely be correct.
A bug regarding how multi-bytes encoding are read, has been fixed in version 0.4 released on Dec 18th, 2011, thanks to the input of David Dahan.

For the rest, currently there is no known bug. If you find any, please contact me.
Thanks!

If you want further information about my projects, please visit my software section.

  1. #1 by lobotommy on 22/06/2012 - 21:27

    Of course the phrase is not actually Latin. It originally was (a Cicero quote), but “lorem” was clipped from “dolorem,” and “consectetuer adipiscing elit” are nonsense words in every language. It devolved that way, per the folklore, to represent font samples more conveniently.

(will not be published)


css.php