Objectives

Abstract

Demonext consists in the construction of a French morphological database (MDB) that describes the derivational properties of words in a systematic manner. The MDB will meet multiple needs, such as empirical confirmation of morphological hypothesis and elaboration of new ones, design of natural language processing (NLP) tools, vocabulary teaching and the treatment of developmental or acquired language disorders.

The lexicon of a language like French is composed mainly of morphologically complex words: prefixed, suffixed, converted or compound. This structural information is generally available in the etymological sections of dictionaries, but the variability of its formulation makes it difficult to exploit. For languages such as English, German, Dutch or Czech, there are morphological databases (MDB) that describe the derivation properties of words in a systematic way: : CELEX, CatVar, DerivBase, etc….. This information is essential because many others can be inferred from it, the most important being the meaning of these words. Currently, there is a prototype of the MDB, the Demonette database (see here and here), developed by the two main partners of the project and which can be considered as an exploratory study of the present project. Having a widely covered MDB with rich and reliable descriptions in French would make it possible to meet multiple needs, such as empirical confirmation and hypothesis development in morphology, the development of NLP tools, vocabulary teaching, and the diagnosis and treatment of developmental or acquired lexical disorders.

To meet these challenges, we propose to build the Demonext MDB. This large-scale resource will have rich descriptions of lexemes (i. e. lexical units) and derivation relationships and the paradigms in which they fit, represent information explicitly and uniformly, ensure systematic traceability of all the information it provides, and be compatible with the main current morphological theories (morpheme-based; lexeme-based; paradigm-based).

Methods / Approaches

The principles underlying this resource will give it an original organization compared to existing MDBs. An entry of Demonext corresponds to a morphological relationship between two lexemes. The whole of the relationships shared by a lexeme with its morphological “parents” will define its derivational family. For example, NATION forms a family with NATIONAL, INTERNATIONAL, NATIONALITY, NATIONALIZATION, INTERNATIONALIZATION, etc.. An even more original feature of Demonext is that it will describe on a large scale the derivation paradigms that structure the lexicon and organize it into interconnected networks (for example, any relation obeying the X↔ XAL scheme, where X is a name, is part of a network that can be generalized in the form of a quadruplet {X, XAL, XALISER, XALISATION}).

Demonext also distinguishes itself from existing MDBs by another remarkable feature which is that each entry will be provided with a set of semantic information: morphological relationships are semantically annotated and the words they link to semantic types. The annotation of relationships will be made by means of glosses defining one of the words relative to the meaning of the other. For example, NATIONALIZATION can be defined in relation to nationalization by a gloss as “action of nationalization”. The morpho-semantic typing of lexems connected by a relationship (such as CAUSE_CHANGE for NATIONALIZE or ACTION for NATIONALIZATION) will be based on the content of the Framenet network, which has an extended set of types.

One of the principles that will guide the design of Demonext is that it can be fed by a variety of French lexical resources, as long as they can be freely redistributed. These resources will be cumulatively integrated into Demonext; the format of the knowledge they contain will be unified; important missing information will be calculated automatically when possible.

Expected Results

Demonext will thus be a large-scale MDB with an original structure of interconnected networks, whose arcs and summits will be equipped with a variety of information: morphosemantic, morphophonological, derivation, statistics, etc. Demonext will also be able to offer a wide range of services. A second outcome of the project consists of a set of teaching tools and materials, such as collections of exercises and tests. These derivatives exploiting Demonext will be examples of its possible uses and its expected societal impact for primary and secondary teachers, students and higher-education teaching staff, speech-language pathologists, specialists in construction morphology and statistical modelling of the lexicon. Demonext will be distributed under a Creative Commons free license and will be made accessible to the various categories of users who will have interfaces according to the intended use: interfaces for interrogation, editing and visualization for specialized audiences; simplified and ergonomic access for the general public. It will be available for download via the EQUIPEX Ortolang and the REDAC platform.

Being Demonext a database hosting an annotated morphological network of derivative, formal, semantic, semantic and frequency descriptions, we expect it to have an impact in several scientific and social fields. Demonext will offer linguists (morphologists, psycholinguists, L1 or L2 didacticians) an experimental field with extensive coverage, and will offer a wide range of information ranging from statistical measurements to semantic properties, morphological decompositions, categorical and phonological characteristics.

Perspectives

In morphology research, Demonext will contribute to the emergence of a more quantitative and experimental morphology, by enabling large-scale testing of hypotheses and the development of new ones. It will also make it possible to improve the visibility of the results of studies on derivation in French and probably lead to more formalized analyses.

The task of statistical modelling of competition between processes will bring not only a better understanding of the structure and dynamics of the French derivation system, but also the tools and methods to explore and model this system.

In higher education, the production of representations in a variety of formalisms will allow the development of exercises for MOOCs.

In NLP, the breadth of its coverage and the richness of its content will favour its integration into processing chains in information retrieval, data mining, analysis of feelings, etc. Semantic descriptions will be useful for creating terminology and exploiting corpus.

In pedagogy, Demonext will participate in the diversification of vocabulary teaching techniques for primary school teachers, through the introduction of specific vocabulary acquisition techniques based on research data.

Finally, in speech and language therapy, the resource will enable the development of evaluation and therapy materials focused on the morphological level, whether to improve this level of treatment, when it is deficient or to mobilize it, when it is preserved, in the development of compensatory strategies.