The Glawinette Lexicon

Nabil Hathout

Glawinette is a French derivational lexicon created from the GLAWI electronic dictionary. Glawinette’s entries are pairs of morphologically related lexemes like accomplir_V:accomplissement_N where V and N are lexeme POSs. Glawinette provides the word family (morphological family) of each of its entries and a characterization of the derivational relation of the lexemes in the pair.

The relations are characterized by means of two patterns: a broad alternation pattern (BAP) made up of two regular expressions which describe the most general form relation that exists between the two words such as ^(.+)r:^(.+)ssement for accomplir_V:accomplissement_N, where the sequence (.+) represents the character string accompli; a fine-grained alternation pattern (FAP) made up of two regular expressions a form relation between the two words that uses better motivated derivational exponents such as ^(.+)ir:^(.+)issement for accomplir_V:accomplissement_N, where the sequence (.+) represents the character string accompl.

Glawinette contains 156090 lexeme pairs which are divided into 15843 word families and 5384 derivational series.

The resource was built from the relations in the morphological sections of GLAWI and in the morphological definitions of the dictionary, i.e., definitions like that of accomplissement which contain a member of the derivational family of the definiens (in this case accomplir):

accomplissement : action d’accomplir ou résultat de cette action

The creation of Glawinette actually uses all the word pairs (definiens, definienda) that can be constituted from the definitions of GLAWI. This list is then filtered out so that only morphologically related word pairs are kept. Pair filtering and pattern computation is performed by means of proportional analogies. In a first step, an analogy signature (BAP) is assigned to each pair. These signatures are used to exclude the pairs that form less than 5 analogies with other pairs extracted from the dictionary. In a second step, we refine the remaining pairs and calculate the FAPs as follows: 1. the pairs of each analogy series are separated into two sets of words; 2. we compute for each set of words the patterns (regular expressions) that describe at least 5 words of the set ; 3. word patterns are aligned to form pattern pairs; 4. only word pairs whose patterns describe at least 10% of the series are retained; 5. PAFs are computed by selecting for each word pair the pattern that is most connected to patterns of other pairs.

A manual evaluation of Glawinette performed on 200 randomly selected word pairs shows that more than 99% of the word pairs in the resource are morphologically related and that 75% of them have a FAP that matches what a linguist would assign as exponents to both words in the pair.

Glawinette is made available in tsv and json formats. It is released under a Creative Commons By-SA 3.0 license.