Uncovering Languages from written documents

advertisement
Uncovering Languages
from written documents
Nikitas N. Karanikolas
TEI of Athens, nnk@teiath.gr
and
Panagiotis Ouranos
TEI of Athens, pouran24@gmail.com
PCI 2014, Athens – Greece, October 2 – 4, 2014
Motivation




Understanding what is the language used in
a processable electronic document.
More problematic are cases where the input
text is composed of several languages.
This is a common situation on Web
documents.
It is a prerequisite for NLP tasks, like full text
indexing, summarization, classification,
computer assisted assessment.
PCI 2014, 3-Oct-2014
Background:
coding systems, codepages





Automatic language identification of text can be
further decomposed to coding system identification
and next to language identification.
Coding systems are: ASCII (7 bit), EBCDIC (8 bit),
extended ASCII (8 bit) and Unicode.
Codepages are used in 8 Bit coding systems.
They define the location (the code) of the basic
graphemes (English letters, punctuation symbols
and numbers) and the location of other international
or region-specific graphemes (characters).
Codepage based files can contain only couples of
languages (e.g. English and Greek).
PCI 2014, 3-Oct-2014
Background:
some codepages
standard
Informal name
Microsoft’s similar
ISO-8859-1
Latin 1
Windows 1252
ISO-8859-2
Latin 2
Windows 1250
ISO-8859-5
Latin/Cyrillic
Windows 1251
ISO-8859-6
Latin/Arabic
Windows 1256
ISO-8859-7
Latin/Greek
Windows 1253
ISO-8859-8
Latin/Hebrew
Windows 1255
ISO-8859-9
Latin-5 or Turkish
Windows 1254
ISO-8859-13
Latin-7 or Baltic Rim
Windows 1257
PCI 2014, 3-Oct-2014
Background:
Unicode






a newer coding system designed to represent textbased data written in any language.
occupy 32 bits for each single character.
can be implemented by different character encodings.
UTF-32 is a 32-bit fixed-width encoding, able to encode
every Unicode character.
UTF-16 is a variable-width encoding, uses either 16-bit
or 32-bit, it is able to encode every Unicode character.
UTF-8 uses one byte for any ASCII character (same
code in both UTF-8 and ASCII encoding) and up to four
bytes for other characters.
PCI 2014, 3-Oct-2014
Background:
the extend of problem




The identification of codepage, in case of single–
byte encoding, is a very helpful achievement.
For example, the identification of ISO-8859-7 is
enough to know that the text contains only Greek
and (possibly) English words.
The identification of ISO-8859-5 restricts the
languages to a few (Bulgarian, Byelorussian,
Russian, Serbian, Ukrainian and English).
In case of Unicode encoding, the alphabet of each
character is directly identified but the distinction between
languages sharing the same alphabet is the same
problem as in previous example (languages using the
Cyrillic alphabet)
PCI 2014, 3-Oct-2014
Our approach:
Assumptions




Language Identification of Multi-lingual Web
Documents.
We do not make the assumption that web
documents are written in a single language that
we are trying to identify.
We assume that the same documents can
contain many languages in different segments.
Regarding the granularity of segmentations, we
assume that each paragraph has a single
language.
PCI 2014, 3-Oct-2014
Our approach:
Input and Output





Input: a URL.
Output: a file with .mnt file extension.
It is a plain text encoded with UTF-8 and
having a language identifier tag before every
paragraph.
Tag has the form \langxxxx
xxxx is a decimal number, the same one used
in RTF files
PCI 2014, 3-Oct-2014
Command
C:\> MultiNationalText
http://store4share.net/test5.html c:\log\test5 1
File test5.mnt
\lang1058 (Ουκρανικά)
Для організації саме такої роботи Микола Азаров доручив внести пропозиції щодо бюджетного
фінансування сервісного обслуговування дорогої апаратури, забезпечення витратними.........
\lang1049 (Ρώσικα / Λευκορώσικα)
Об этом 17 марта на пресс-конференции в Минске заявил лидер кампании "Говори правду"
Владимир Некляев. "Народный референдум" не является коалицией – это политическая кампания.
"Мы будем делать все, чтобы в 2015 году оппозиция выдвинула .........
\lang1049 (Ρώσικα)
Сборная Финляндии смогла пробиться в раунд плей-офф турнира благодаря сборной Швейцарии,
которая не пустила туда сборную Латвии. .........
\lang1049 (Ρώσικα / Βουλγάρικα)
Фирмата провежда специална политика за предоставяне на услугите на преференциални цени за
образованието, науката, медицината, армията и полицията. .........
\lang3098 (Σέρβικα)
Крунска улица је крајем 1900. године указом о категоризацији одређена као улица за подизање вила,
али тек после Првог светског рата добиће резиденцијални карактер – када су изграђена .........
\lang1032 (Ελληνικά)
Κατηγορείτε τους άλλους: Οι άνθρωποι κάνουν λάθη. Οι συνάδελφοι δεν κάνουν τη δουλειά τους καλά. Οι
courier δεν φέρνουν τα πακέτα σας στην ώρα τους. .........
\lang1029 (Τσέχικα)
Nespisovná mluva obyvatelstva je rozlišena územně. V Čechách převládá interdialekt (nadnářeční útvar),
zvaný obecná čeština , který se vyvinul na podkladě hlavních rysů nářeční .........
\lang1045 (Πολωνικά)
Poprzedzający Wielkanoc tydzień, w czasie którego Kościół i wierni wspominają najważniejsze dla wiary
chrześcijańskiej wydarzenia, nazywany jest Wielkim Tygodniem. .........
Our approach:
Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Request site data (allocate and fill RawBuffer)
Create output folder
Crawl and save external dependencies and main file, using sockets
Parse of html head tags and update Global variables
(content_type, content_language, isUTF, CodepageID)
Remove any html tag and save the result (in TextData)
Validate that values of variables content_type and content_language
are acceptable
Create array (pParagraphs) with borders of paragraphs
(Attributes of pParagraphs elements references into TextData).
FOR each item in array pParagraphs
IF it is recognized as a codepage based buffer
Convert paragraph to UTF-8
End IF
Replace Escapes with UTF-8 encodings
Identify the language of item (paragraph)
Save the language identifier
Save (append) paragraph into .mnt file
End FOR
PCI 2014, 3-Oct-2014
Our approach:
Algorithm – Step 4
given the following html head tags:
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8">
<meta http-equiv="Content-Language"
content="el">
Step 4 updates Global variables, as follows:
content_type:=“utf-8”;
content_language:=“el”;
isUTF:=True;
CodepageID:=65001;
PCI 2014, 3-Oct-2014
Our approach:
Algorithm - Step 13





It checks the existence of character patterns that are
unique (or almost unique) for one of the supported
languages.
The existence of each such pattern increases the weight
(the possibility) for the corresponding language.
These patterns can be short words, postfixes (word
endings). [positives]
It can be also single letters, in case that these letters are
used only by one language among the languages that
share the same alphabet. [positives]
the existence of letters that have been removed for a
given language from the (shared between many
languages) alphabet, automatically nullify the possibility
for the given language. [negatives]
PCI 2014, 3-Oct-2014
Our approach:
Algorithm – Step 13 – positives
Pattern
Language
Type
Љ
Serbian
Letter
Њ
Serbian
Letter
они
Russian
Short word
ет
Russian
Postfix
ў
Belarusian
Letter
гэта
Belarusian
Short word
да
Bulgarian
Short word
що
Ukrainian
Short word
Θ
Greek
Letter
že
Czech
Short word
się
Polish
Short word
Our approach:
Algorithm – Step 13 – negatives



It checks also the existence of letters that
have been removed for a given language
from the (shared between many languages)
alphabet.
The existence of such a letter, automatically
nullify the possibility for the given language.
For example, letters Ѥ, Я, Ю, Ѱ and Ъ (that
exist in Cyrillic alphabet) are of no use in the
Serbian language.
PCI 2014, 3-Oct-2014
Utilization




So far, our approach supports eight
languages: Belarusian, Bulgarian, Czech,
Greek, Polish, Russian, Serbian and
Ukrainian.
It can be used as a preparatory step before
full text indexing, summarization,
classification or any other NLP task.
We also aim to be used for archiving and
offline reproduction of web pages.
More than the .mnt, it creates an index.htm
file with resolved extrernal dependencies.
PCI 2014, 3-Oct-2014
Evaluation






We have created a rather small set of seven multilanguage documents.
Each document contains, usually, seven to nine
paragraphs.
Each paragraph is written in one of the eight
supported languages.
Each document intermixes, usually, four to five of
the eight supported languages.
In this way, we have a set of fifty-eight paragraphs.
Our system/approach correctly identifies forty of
them. Thus the success rate is about sixty-nine per
cent (69%).
PCI 2014, 3-Oct-2014
Conclusions





With a small set of rules and a compact design, we have
achieved an acceptable success rate.
It seems to be an easy target to increase this rate with
the introduction of some more patterns.
The success rate will be possible to approach the
hundred per cent with the introduction of more
elaborated rules (e.g. with collocations).
The interesting and naïve point is that our approach
identifies the language per paragraph and not a
language for the whole document.
The other interesting point is the introduction of the .mnt
file type. This file type can be used for interchange of
multi-language documents
PCI 2014, 3-Oct-2014
Uncovering Languages
from written documents

Thank you for your attention

We will try to answer your Questions
Nikitas N. Karanikolas
nnk@teiath.gr
http://users.teiath.gr/nnk/
PCI 2014, 3-Oct-2014
Download