Uncovering Languages from written documents Nikitas N. Karanikolas TEI of Athens, [email protected] and Panagiotis Ouranos TEI of Athens, [email protected] PCI 2014, Athens – Greece, October 2 – 4, 2014 Motivation Understanding what is the language used in a processable electronic document. More problematic are cases where the input text is composed of several languages. This is a common situation on Web documents. It is a prerequisite for NLP tasks, like full text indexing, summarization, classification, computer assisted assessment. PCI 2014, 3-Oct-2014 Background: coding systems, codepages Automatic language identification of text can be further decomposed to coding system identification and next to language identification. Coding systems are: ASCII (7 bit), EBCDIC (8 bit), extended ASCII (8 bit) and Unicode. Codepages are used in 8 Bit coding systems. They define the location (the code) of the basic graphemes (English letters, punctuation symbols and numbers) and the location of other international or region-specific graphemes (characters). Codepage based files can contain only couples of languages (e.g. English and Greek). PCI 2014, 3-Oct-2014 Background: some codepages standard Informal name Microsoft’s similar ISO-8859-1 Latin 1 Windows 1252 ISO-8859-2 Latin 2 Windows 1250 ISO-8859-5 Latin/Cyrillic Windows 1251 ISO-8859-6 Latin/Arabic Windows 1256 ISO-8859-7 Latin/Greek Windows 1253 ISO-8859-8 Latin/Hebrew Windows 1255 ISO-8859-9 Latin-5 or Turkish Windows 1254 ISO-8859-13 Latin-7 or Baltic Rim Windows 1257 PCI 2014, 3-Oct-2014 Background: Unicode a newer coding system designed to represent textbased data written in any language. occupy 32 bits for each single character. can be implemented by different character encodings. UTF-32 is a 32-bit fixed-width encoding, able to encode every Unicode character. UTF-16 is a variable-width encoding, uses either 16-bit or 32-bit, it is able to encode every Unicode character. UTF-8 uses one byte for any ASCII character (same code in both UTF-8 and ASCII encoding) and up to four bytes for other characters. PCI 2014, 3-Oct-2014 Background: the extend of problem The identification of codepage, in case of single– byte encoding, is a very helpful achievement. For example, the identification of ISO-8859-7 is enough to know that the text contains only Greek and (possibly) English words. The identification of ISO-8859-5 restricts the languages to a few (Bulgarian, Byelorussian, Russian, Serbian, Ukrainian and English). In case of Unicode encoding, the alphabet of each character is directly identified but the distinction between languages sharing the same alphabet is the same problem as in previous example (languages using the Cyrillic alphabet) PCI 2014, 3-Oct-2014 Our approach: Assumptions Language Identification of Multi-lingual Web Documents. We do not make the assumption that web documents are written in a single language that we are trying to identify. We assume that the same documents can contain many languages in different segments. Regarding the granularity of segmentations, we assume that each paragraph has a single language. PCI 2014, 3-Oct-2014 Our approach: Input and Output Input: a URL. Output: a file with .mnt file extension. It is a plain text encoded with UTF-8 and having a language identifier tag before every paragraph. Tag has the form \langxxxx xxxx is a decimal number, the same one used in RTF files PCI 2014, 3-Oct-2014 Command C:\> MultiNationalText http://store4share.net/test5.html c:\log\test5 1 File test5.mnt \lang1058 (Ουκρανικά) Для організації саме такої роботи Микола Азаров доручив внести пропозиції щодо бюджетного фінансування сервісного обслуговування дорогої апаратури, забезпечення витратними......... \lang1049 (Ρώσικα / Λευκορώσικα) Об этом 17 марта на пресс-конференции в Минске заявил лидер кампании "Говори правду" Владимир Некляев. "Народный референдум" не является коалицией – это политическая кампания. "Мы будем делать все, чтобы в 2015 году оппозиция выдвинула ......... \lang1049 (Ρώσικα) Сборная Финляндии смогла пробиться в раунд плей-офф турнира благодаря сборной Швейцарии, которая не пустила туда сборную Латвии. ......... \lang1049 (Ρώσικα / Βουλγάρικα) Фирмата провежда специална политика за предоставяне на услугите на преференциални цени за образованието, науката, медицината, армията и полицията. ......... \lang3098 (Σέρβικα) Крунска улица је крајем 1900. године указом о категоризацији одређена као улица за подизање вила, али тек после Првог светског рата добиће резиденцијални карактер – када су изграђена ......... \lang1032 (Ελληνικά) Κατηγορείτε τους άλλους: Οι άνθρωποι κάνουν λάθη. Οι συνάδελφοι δεν κάνουν τη δουλειά τους καλά. Οι courier δεν φέρνουν τα πακέτα σας στην ώρα τους. ......... \lang1029 (Τσέχικα) Nespisovná mluva obyvatelstva je rozlišena územně. V Čechách převládá interdialekt (nadnářeční útvar), zvaný obecná čeština , který se vyvinul na podkladě hlavních rysů nářeční ......... \lang1045 (Πολωνικά) Poprzedzający Wielkanoc tydzień, w czasie którego Kościół i wierni wspominają najważniejsze dla wiary chrześcijańskiej wydarzenia, nazywany jest Wielkim Tygodniem. ......... Our approach: Algorithm 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Request site data (allocate and fill RawBuffer) Create output folder Crawl and save external dependencies and main file, using sockets Parse of html head tags and update Global variables (content_type, content_language, isUTF, CodepageID) Remove any html tag and save the result (in TextData) Validate that values of variables content_type and content_language are acceptable Create array (pParagraphs) with borders of paragraphs (Attributes of pParagraphs elements references into TextData). FOR each item in array pParagraphs IF it is recognized as a codepage based buffer Convert paragraph to UTF-8 End IF Replace Escapes with UTF-8 encodings Identify the language of item (paragraph) Save the language identifier Save (append) paragraph into .mnt file End FOR PCI 2014, 3-Oct-2014 Our approach: Algorithm – Step 4 given the following html head tags: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta http-equiv="Content-Language" content="el"> Step 4 updates Global variables, as follows: content_type:=“utf-8”; content_language:=“el”; isUTF:=True; CodepageID:=65001; PCI 2014, 3-Oct-2014 Our approach: Algorithm - Step 13 It checks the existence of character patterns that are unique (or almost unique) for one of the supported languages. The existence of each such pattern increases the weight (the possibility) for the corresponding language. These patterns can be short words, postfixes (word endings). [positives] It can be also single letters, in case that these letters are used only by one language among the languages that share the same alphabet. [positives] the existence of letters that have been removed for a given language from the (shared between many languages) alphabet, automatically nullify the possibility for the given language. [negatives] PCI 2014, 3-Oct-2014 Our approach: Algorithm – Step 13 – positives Pattern Language Type Љ Serbian Letter Њ Serbian Letter они Russian Short word ет Russian Postfix ў Belarusian Letter гэта Belarusian Short word да Bulgarian Short word що Ukrainian Short word Θ Greek Letter že Czech Short word się Polish Short word Our approach: Algorithm – Step 13 – negatives It checks also the existence of letters that have been removed for a given language from the (shared between many languages) alphabet. The existence of such a letter, automatically nullify the possibility for the given language. For example, letters Ѥ, Я, Ю, Ѱ and Ъ (that exist in Cyrillic alphabet) are of no use in the Serbian language. PCI 2014, 3-Oct-2014 Utilization So far, our approach supports eight languages: Belarusian, Bulgarian, Czech, Greek, Polish, Russian, Serbian and Ukrainian. It can be used as a preparatory step before full text indexing, summarization, classification or any other NLP task. We also aim to be used for archiving and offline reproduction of web pages. More than the .mnt, it creates an index.htm file with resolved extrernal dependencies. PCI 2014, 3-Oct-2014 Evaluation We have created a rather small set of seven multilanguage documents. Each document contains, usually, seven to nine paragraphs. Each paragraph is written in one of the eight supported languages. Each document intermixes, usually, four to five of the eight supported languages. In this way, we have a set of fifty-eight paragraphs. Our system/approach correctly identifies forty of them. Thus the success rate is about sixty-nine per cent (69%). PCI 2014, 3-Oct-2014 Conclusions With a small set of rules and a compact design, we have achieved an acceptable success rate. It seems to be an easy target to increase this rate with the introduction of some more patterns. The success rate will be possible to approach the hundred per cent with the introduction of more elaborated rules (e.g. with collocations). The interesting and naïve point is that our approach identifies the language per paragraph and not a language for the whole document. The other interesting point is the introduction of the .mnt file type. This file type can be used for interchange of multi-language documents PCI 2014, 3-Oct-2014 Uncovering Languages from written documents Thank you for your attention We will try to answer your Questions Nikitas N. Karanikolas [email protected] http://users.teiath.gr/nnk/ PCI 2014, 3-Oct-2014