Organizing and improving the Démonette database using FCA (formal concept analysis)

Nyoman Juniarta, Yannick Toussaint

Download the programs

Introduction

In this work, we applied FCA to Démonette with two objectives. The first is to systematically represent the relation among derivational families. This representation should allow us to observe which families share the same set of derivations, and to see which family’s derivation set is more complex than another family’s. The second objective is to detect families having anomalies, i.e. families having either missing or incorrect derivations.

Results

Each family is represented as a derivational graph. The attributes of a node is a lexeme and its part of speech. An edge thus corresponds to a derivation between two lexemes, and is described by the orientation (direct, indirect, or undecidable) and the morphological pattern (e.g. X-Xeur). Currently, D´emonext contains 25,444 families. For each family, we also define “fingerprint”, which is a graph having identical structure as the family’s derivational graph but without the lexemes. Consequently, several families can share the same fingerprint. Among 25,444 families, there are 6,657 unique fingerprints. We applied FCA, more specifically AOC-poset (partially ordered set of attribute-object-concepts), to obtain the poset of families and fingerprints.

Figure 1: AOC-poset (in black) and the fingerprints (in blue).

An example of AOC-poset from five families is shown in Figure 1. We see that f1 “grows” by adding an indirect X-X derivation to become f2 or by adding two derivations to become f3. From the poset in Figure 1, we can observe the relation among families. The family cramer (with only one derivation cramerVcramageN) and the family roder (with only one derivation roderVrodageN) share the same set of derivations. These two families are less complex than haubaner and jaunir. Finally, the derivations of ajouter is a combination of the derivations of haubaner and jaunir.

By exploring the poset and the number of families corresponding to each fingerprint, we can detect anomalies (missing or false derivations). An example of a missing derivation that we found is that of the family orpailleur, which has only two lexemes orpaillageN and orpailleurN, and an indirect derivation between them. This is considered an anomaly since many families having that derivation also contain a verb. We then propose the addition of the lexeme orpaillerV and the corresponding derivations.

Furthermore, we observe a case of possibly false derivations in the family détracter. This family has two direct X-Xion derivations : détracterV → détractionN and détracterV → détractationN, while several other families contain only one X-Xion from a verb. However, we found that certain families have extra valid derivations for different spelling, e.g. essuyement-essuiement, débuscage-débusquage, etc., which should not be regarded as incorrect derivations.

These findings were presented to linguists to be validated. We built a web application which can be accessed via https: //github.com/nyomanjuniarta/Demonext-web. We visualize the family having anomalies and a “normal” family side-by-side, to facilitate the linguists in deciding whether there are actually missing or incorrect derivations.