Title:
|
Tagging and parsing Icelandic text
|
~aturallanguageprocessing (~LP) is a very young discipline in Iceland. Therefore,
there is a lack of publicly available basic tools for processing the morphologically
complex Icelandic language.
III this thesis, we investigate the effectiveness and viability of using (mainly)
rule-based methods for analysing the synta.x of Icelandic text. For this purpose,
and because our work has a practical focus, we develop a ~LP toolkit, IceNLP. The
toolkit consists of a tokeniser, the morphological analyser IceMorphy, the part-ofspeech
tagger IceTagger', and the shallow parser IcePan;er'.
The task of the tokeniser is to split a sequence of characters into linguistic units
and identify where one sentence ends and another one begins.
IceMorphy is used for guessing part-of-speech tags for unknown words and
filling in tag profile gaps ill a dictionary.
Ice Tagger' is a linguistic rule-based tagger which achieves considerably higher
tagging accuracy than previously reported results using taggers based on datadriven
techniques. Furthermore, by using several tagger integration and combination
methods. we increase substantially the tagging accuracy of Icelandic text,
with regard to previous work.
Our shallow parser, IceParser, is an incremental finite-state parser, the first
parser puulished for the Icelandic language. It produces shallow syntactic annotation,
using an annotation scheme specifically developed in this work. Furthermore,
we create a grammar definition corpus, a representative collection of sentences
annotated using the annotation scheme.
The development of our toolkit is a step towards the goal of building a Basic
Language Resource Kit (BLARK) for the Icelandic language. Our toolkit has been
made available for use in the research community, and should therefore encourage
further research and development of XLP tools.
|