Writing semantic resources for the Russian language

реклама
Russian Module for NooJ:
Semantic annotation
Conception and realisation
of semantic tags
for the Russian language
for Max Silberztein’s Nooj software
NOOJ Conference
Inalco, Saarbruecken
June 5th, 2013
Vincent BÉNET
INALCO
CREE
Recherche assistée par ordinateur
1
Russian Module for NooJ:
design and implementation
of lexical and grammatical ressources
 one main dictionary (95000 entries)
 two annex dictionaries

one for proper nouns

one for noun-adjectives
2
Russian Module for NooJ:
design and implementation
of basic semantic ressources
How ?
-by adding tags to the general dictionary
-by writing grammars
Semantic Tagging or Annotation ?
3
Writing semantic resources for the
Russian language
The semantic tags
of the Russian national Corpus:
Taxonomy (a lexeme's thematic class) – for nouns, verbs,
adjectives and adverbs.
Mereology (“part – whole” and “element – aggregate”
relationships) – for concrete and abstract nouns
Topology – for concrete names
Causation – for verbs
Evaluation – for abstract and concrete nouns, adjectives
and adverbs
4
Writing semantic resources for the
Russian language
27 semantic taxonomic tags for verbs
t:move — movement (бежать, дергаться, бросить, нести)
t:be — sphere of existence (жить, возникнуть, убить)
t:loc — location (лежать, стоять, положить)
t:poss — sphere of possession (иметь дать, подарить, приобрести)
t:ment — mental sphere (знать, верить, догадаться, помнить)
t:perc — perception (смотреть, слышать, нюхать, чуять)
t:speech — speech (говорить, советовать, спорить, каламбурить)
t:sound — sounds (гудеть, шелестеть)
t:light — light (гаснуть, лучиться)
5
Semantic information in the Russian
national corpus (Verbs)
6
Writing semantic resources for the
Russian language
khodit’,V+Mvt+Indet+ipf+intr+FLX=ходить
Idti,V+Mvt+Det+ipf+intr+FLX=идти
Vkhodit’,V+Mvt+Pvb+ipf+intr+FLX=ходить
Vojti ’,V+Mvt+Pvb+pf+intr+FLX=идти
Vykhodit’,V+Mvt+Pvb+ipf+intr+FLX=ходить
Priezzhat’,V+Mvt+Pvb+ipf+intr+FLX=акать
7
Grammar to locate the verbs of motion
8
Searching for « verbs of motion »
with Nooj
9
Searching for « verbs of motion »
with Nooj
10
Writing semantic resources for the
Russian language
— concrete nouns (девочка, стол, молоко)
— abstract nouns (вождение, яркость, время)
— proper names (Иван, Эйнштейн, Петроград)
— person (человек, учитель)
— ethnonyms (эфиоп, итальянка)
— kinship terms (брат, бабушка)
— supernatural creatures (русалка, инопланетянин)
— animals (корова, жираф, сорока, ящерица, муравей)
— plants (береза, роза, трава)
a.s.o.
11
Semantic information in the Russian
national corpus (Nouns)
12
Semantic information in the Russian
national corpus (Adjectives)
13
Semantic information in the Russian
national corpus (Adverbs)
14
Writing basic semantic resources for the
Russian language
Nooj properties.def file
N_Genre = m | f | n ;
N_SGenr = an | inan ;
N_Nombre = s | p;
N_Cas = Im | Vi | Ro | R2 | Da | Tv | Pr | P2 | Zv ;
…
V_Type = Mvt;
V_Morph = Pref | Suff;
15
Writing basic semantic resources for the
Russian language
Nooj properties.def file
A_Sem = Animal; Color ( Hum = App)
N_Sem = Hum | Prof | Parents | Body
Conc | Abstr | Org | Text |
Animal | Food | Health | Arts | Lit | Music | Sports
Topo | Country | River | City | Mount| Lake |
Posit | Time | Color ;
ADV_Sem = Time |Topo | Modal;
V_Sem = Color | Topo | Posit |Modal;
16
Writing semantic resources for the
Russian language
mal’chik, N+an+Hum+FLX=bul’dog
pered tem kak,CONJ+UNAMB+Time
Moskva,N+f+inan+City+FLX=Москва
Don,N+m+inan+River+FLX=Дон
Katar,N+Country+m+s+FLX=Ленинград
Nora,N+Forename+Hum+f+an+FLX=Лена
17
Writing semantic resources for the
Russian language
zelënyj,A+Color+FLX=novyj
zelenovatyj,A+ Color+FLX=
zelënen’kij, A+Color+FLX=novyj
temno-zelënyj, A+Color+FLX=novyj
zelen’,N+f+inan+Color+FLX=smes’
zelenet’,V+intr+ipf+Color+FLX=belet’
zazelenet’,V+intr+pf+Color+FLX=belet’
zazelenet’sja,V+sja+pf+Color+FLX=….
18
Writing basic semantic resources for the
Russian language
Prof = 900
Parent = 160 items
Forenames = 2280
Animal = 370
Food = 280 (Liquid = 25 )
Body = 285
Health = 175
Arts = 65
Lit = 40
Music = 155
Sport = 65
Topo = 40
Country = 180
River = 15
City = 175
Mount = 5
Lake = 5
Posit = 25
Time = 135
Modal = 15
Color = 275
19
Searching for « colors » with Nooj
20
Searching for « body parts » with Nooj
21
Searching for « parents (relatives) »
words with Nooj
22
Writing basic semantic resources for the
Russian language
NEXT WORK TO BE DONE….
-Completion of the dictionary for concrete nouns
using thematic dictonaries
-a new parameter to the dictionary +Translation=
to use Nooj as a resource to build basic
dictionaries for parallel corpuses.
23
Russian Module for NooJ:
Semantic annotation
Thank you for your attention
[email protected]
NOOJ Conference Inalco,
Saarbruecken
June 5th, 2013
24
Скачать