
I have moved to the Spoken Language Group at Saarland University: my webpage at UdS
I am a postgraduate student at the NCLT. I work on treebank-based acquisition of multilingual LFG resources. This research is part of the larger multilingual SFI-funded GramLab project. My supervisor is Prof. Josef van Genabith.
Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models and outputs a probability distribution over tag-lemma pair sequences. The lemmatization module exploits the idea of recasting lemmatization as a classification task by using class labels which encode mappings from wordforms to lemmas. Experimental evaluation results and error analysis on three morphologically rich languages show that the system achieves high accuracy with no language-specific feature engineering or additional resources.
The distinction between raising and subject-control verbs, although crucial for the construction of semantics, is not easy to make given access to only the local syntactic configuration of the sentence. In most contexts raising verbs and control verbs display identical superficial syntactic structure. Linguists apply grammaticality tests to distinguish these verb classes. Our idea is to learn to predict the raising-control distinction by simulating such grammaticality judgments by means of pattern searches. Experiments with regression tree models show that using pattern counts from large unannotated corpora can be used to assess how likely a verb form is to appear in raising vs. control constructions. For this task it is beneficial to use the much larger but also noisier Web corpus rather than the smaller and cleaner Gigaword corpus. A similar methodology can be useful for detecting other lexical semantic distinctions: it could be used whenever a test employed to make linguistically interesting distinctions can be reduced to a pattern search in an unannotated corpus.
Function labels enrich constituency parse tree nodes with information
about their abstract syntactic and semantic roles. A common way to
obtain function-labeled trees is to use a two-stage architecture where
first a statistical parser produces the constituent structure and then
a second component such as a classifier adds the missing function
tags.
In order to achieve optimal results, training examples for
machine-learning-based classifiers should be as similar as possible to
the instances seen during prediction. However, the method which has
been used so far to obtain training examples for the function labeling
classifier suffers from a serious drawback: the training examples come
from perfect treebank trees, whereas test examples are derived from
parser-produced, imperfect trees.
We show that extracting training instances from the reparsed training
part of the treebank results in better training material as measured
by similarity to test instances. We show that our training method
achieves statistically significantly higher f-scores on the function
labeling task for the English Penn Treebank. Currently our method
achieves 91.47% f-score on the section 23 of WSJ, the highest score
reported in the literature so far.
Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages.
Data-driven grammatical function tag assignment has been studied for English using the Penn-II Treebank data. In this paper we address the question of whether such methods can be applied successfully to other languages and treebank resources. In addition to tag assignment accuracy and f-scores we also present results of a task-based evaluation. We use three machine-learning methods to assign Cast3LB function tags to sentences parsed with Bikel's parser trained on the Cast3LB treebank. The best performing method, SVM, achieves an f-score of 86.87% on gold-standard trees and 66.67% on parser output - a statistically significant improvement of 6.74% over the baseline. In a task-based evaluation we generate LFG functional-structures from the function-tag-enriched trees. On this task we achieve an f-score of 75.67%, a statistically significant 3.4% improvement over the baseline.
Treebank utilities (21-03-2007): A few commands useful when dealing with trees in Penn Treebank format. README file. Linux x86 binary
Some photos: here and here. I (very) occasionally write something in my blog.