Initially, automatic definition extraction has been mainly addressed using pattern-based approaches.
These approaches are often based on patterns represented by regular expressions that are common in definitions (Joho and Sanderson, 2000; Prager et al., 2002) or on a more complex representation of the lexico-syntactic patterns of sentences (Muresan and Klavans, 2002; Saggion, 2004; Storrer and Wellinghof, 2006; Walter and Pinkal, 2006). More recently, the pattern-based approaches have been complemented with or completely replaced by machine learning approaches (Blair-Goldensohn et al., 2004; Miliaraki and Androutsopoulos, 2004; Fahmi and Bouma, 2006). A pattern-based approach inspired by the work from Muresan and Klavans (2002) has been adopted (Westerhout and Monachesi, 2007a).During the development of the pattern-based approach, a number of shortcomings emerged that are difficult to address with the pattern-based approach itself. The most important problem is the fact that definition patterns are often used in non-definitions as well.
To solve this problem, a machine learning approach is used in succession to the pattern-based approachas a filtering or refining step (Westerhout and Monachesi, 2007b, 2008; Westerhout, 2009a,b). The advantage of applying machine learning techniques is that the system can be employed to investigate for a set of manually selected features which of them are best at distinguishing definitions from non-definitions. Work carried out on automatic creation of glossaries usually tends to be rule based, taking into consideration mainly part-of-speech (POS) information as the main linguistic feature. Park et al. (2002) extract glosses1 from technical texts (such as computer manuals), using additional linguistic information added to the texts to enable them to identify possible glosses. For this purpose they use several external tools, including POS tagging and morphological analysis, to add the linguistic information to the texts. They manually identify the linguistic sequences which could constitute glosses, and describe the sequences as cascade finite-state transducers. Since sequences sometimes also capture non-gloss items, Park et al.
include rules to discard certain forms from the candidate set. Muresan & Klavans (2002) propose a rule-based system, DEFINDER, to extract definitions from technical medical texts, which can then be fed into a dictionary. The corpora used consist of consumer-oriented texts, where a medical term is explained in general language words in order to provide a paraphrase. Malaise et al.
(2004) attempts definition extraction with the purpose of extracting the semantic relation present in definitions. In this work the extraction is carried out from an anthropological corpus consisting of different formats, and the evaluation corpus is in the field of dietetics, a medical type of resource. They apply lexico-syntactic patterns in addition to cue phrases, focusing on hypernym and synonym relations in sentences. The authors conclude that the lexical markers used to extract definitions and their relations are sometimes limited to domain-specific contexts, and thus can be reused only within a particular domain. It is possible to apply the rules to a new domain to try and discover possible new pairs of definitions and relations, however it might not necessarily be as effective as in the domain the rules were created for.
Storrer & Wellingho_ (2006) report on work work carried out in definition extraction from technical texts by using valency frames. Valency frames contain linguistic information indicating what arguments a verb takes, such as object, subject, position and prepositions. Frames are used to match structures of sentences, and thus extracting definitions using rules which are centred around verbs. This is a rule-based expert-driven approach, with all information being provided by human experts (valency frame, definition categorisation). The evaluation was carried out on a corpus containing manually identified definitions.
Walter & Pinkal (2006) work on definition extraction from German court decisions using parsing based on linguistic information. Their corpus consists of court decisions, restricted to environmental law due to domain specific terminology in legal texts. They identify two styles of definitions: (i) normative knowledge is described as text that connects legal consequences to descriptions of certain facts”, (ii) terminological knowledge consists in definitions of some of the concepts used in the first descriptions”. They also observe that definitions contained broad linguistic variations and that simply using keyword or pattern matching would not suffice to extract such sentences. On the other hand, through the recognition of the similarities between the structural elements, and with added linguistic knowledge, the task of definition extraction becomes more effective.
Przepiorkowski et al. (2007) attempt definition extraction for a Slavic group of languages (Bulgarian, Czech and Polish). They use eLearning texts and apply rule-based techniques to extract definitions. They classify sentences according to linguistic types, and derive three sets of grammar rules for each language (sharing similarities between them).
The result for the Czech group is higher through the use of a technique which enables the capture of multi-sentence definitions. The authors claim that since there is no established evaluation methodology for definition extraction, they decided to include sentences which were partially captured as definitions. Thus a grammar rule does not need to capture a definitional sentence in full, but can capture only part of the sentence, and the whole sentence is presented as a definition. By classifying such sentences as well, the results for Czech increased in and recall.
Fahmi & Bouma (2006) use medical pages from the Dutch Wikipedia as their corpus, starting with a rule-based approach to definition extraction and then turning to machine learning techniques to try and improve their results. They start by extracting all sentences containing the verb to be, together with some other grammatical restrictions, as the initial set of possible definition sentences. These sentences are then manually categorised into definitions and non-definitions. From these sentences they observe which features can be important to distinguish the set of definitional sentences from non-definitional ones. Features identified include (i) text properties (n-grams, root forms, and parentheses), (ii) sentence position, (iii) syntactic properties (position of the subject of sentence and whether the subject contains a determiner) and (iv) named entities (location, organisation and person).