Here's a list of my academic outputs, chronologically sorted:
Thesis
-
Dependency as Modality, Parsing as Permutation: A Neurosymbolic Perspective on Categorial Grammars (PhD Thesis).
.
LOT Dissertation Series.
June 2023.
[ ] [ paper ]Since their inception, categorial grammars have been front runners in the quest for a formally elegant, computationally attractive and adequately flexible theory of linguistic form and meaning. As a result of developments in theoretical computer science, Lambek-style categorial grammars have gradually been recognized for what they truly are: type-systems proper. Words enact typed constants, and interact with one another via means of grammatical rules enacted by type inferences, composing larger phrases in the process. The end result is at the same time a parse, a proof and a program, bridging the seemingly disparate fields of linguistics, formal logics and computer science; a testament to the holy triptych of language, logic and computation. The transition from form to meaning is traditionally handled in a Montague-style fashion via a series of homomorphic translations that gradually remove or simplify nuances of the syntactic type calculus to move towards a uniform and expressive semantic calculus. Alluring as this might be, it poses pragmatic problems for the whole programme to come to fruition. For the setup to work on the semantic level, one has no choice but to start from the hardest part, namely the type-theoretic treatment of natural language syntax. Phenomena like movement, word-order variation, discontinuities and the like require careful treatment that needs to be both general enough to encompass the full range of grammatical utterances, yet strict enough to ward off ungrammatical derivations.Breaking away from tradition, this thesis takes an operational shortcut in targeting a ``deeper’’ calculus of grammatical composition, engaging only minimally with surface syntax. Where previously functional functional syntactic types would be position-conscious, requiring their arguments in predetermined positions upon a binary tree, they are now agnostic to both tree structure and sequential order, alleviating the need for fine-grained syntactic refinements. This simplification comes at the cost of a misalignment between provability and grammaticality: the laxer semantic calculus permits more proofs than linguistically allowed. To partially circumvent this underspecification, the thesis takes an additional step away from the established norm, proposing the incorporation of unary type operators extending the analytical axis from plain function-argument structures to function-argument structures with fixed grammatical roles. The new type calculus produces mixed unary/n-ary trees, each unary tree denoting a dependency domain, and each n-ary tree underneath it denoting the phrases which together form that domain. Although still underspecified, these peculiar structures directly subsume non-projective labeled dependency trees. More than that, they have their roots set firmly in type theory, paving the way to their meaningful semantic interpretation.
On more practical grounds and in order to investigate the formalism’s expressive adequacy, an extraction algorithm is designed and employed to convert syntactic analyses of Dutch sentences represented as dependency graphs (stemming from the Lassy small corpus) into proofs of the target logic. The vast majority of input analyses is successfully handled, giving rise to a large and versatile proofbank, a collection of sentences paired with tectogrammatic theorems and their corresponding programs, and an elaborate type lexicon, providing type assignments to almost one million lexical tokens within a given linguistic context.
The proofbank and the underlying lexicon both find use as training data in the design and implementation of a neurosymbolic proof search system able to efficiently navigate the logic’s expansive theorem space. The system consists of three major components that alternate role within the processing pipeline. Component number one is a supertagger responsible for assigning a type to each input word — the tagger is formulated on the basis of a hyper-efficient heterogeneous graph convolution kernel that boasts state-of-the-art accuracy among categorial grammar datasets. Rather than produce type asignments in the form of conditional probabilities over a predefined type vocabulary, the supertagger instead constructs types dynamicaly, following their algebraic decomposition. As such, it is unconstrained by sparsity and data under-representation, generalizing well to rare assignments and even producing correct assignments for types never seen during the course of training. Component number two is a neural permutation module that exploits the linearity constraint of the target logic in order to simplify proof search as optimal transport learning, associating resources (conditional validities) to the processes that require them (conditions). This reformulation allows for a massively parallel and easily optimizable implementation, unobstructed by the structure manipulation breaks common in conventional parsers. Component number three is the type system itself, responsible for navigating the produced structures and thus asserting their well-formedness. Results suggest efficiency superior to, and performance on par with, established baselines across categorial formalisms, despite the ambiguity inherent to the logic.
Conference & Workshop Proceedings
-
Nominal Class Assignment in Swahili: A Computational Account.
.
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024).
October 2024.
[ ] [ paper ]We discuss the open question of the relation between semantics and nominal class assignment in Swahili. We approach the problem from a computational perspective, aiming first to quantify the extent of this relation, and then to explicate its nature, taking extra care to suppress morphosyntactic confounds. Our results are the first of their kind, providing a quantitative evaluation of the semantic cohesion of each nominal class, as well as a nuanced taxonomic description of its semantic content. -
Learning Structure-Aware Representations of Dependent Types.
.
Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024, to appear).
December 2023.
[ ] [ paper ] [ code ]Agda is a dependently-typed programming language and a proof assistant, pivotal in proof formalization and programming language theory. This paper extends the Agda ecosystem into machine learning territory, and, vice versa, makes Agda-related resources available to machine learning practitioners. We introduce and release a novel dataset of Agda program-proofs that is elaborate and extensive enough to support various machine learning applications – the first of its kind. Leveraging the dataset’s ultra-high resolution, detailing proof states at the sub-type level, we propose a novel neural architecture targeted at faithfully representing dependently-typed programs on the basis of structural rather than nominal principles. We instantiate and evaluate our architecture in a premise selection setup, where it achieves strong initial results. -
Algebraic Positional Encodings.
.
Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024, to appear).
December 2023.
[ ] [ paper ] [ code ]We introduce a novel positional encoding strategy for Transformer-style models, addressing the shortcomings of existing, often ad hoc, approaches. Our framework provides a flexible mapping from the algebraic specification of a domain to an interpretation as orthogonal operators. This design preserves the algebraic characteristics of the source domain, ensuring that the model upholds the desired structural properties. Our scheme can accommodate various structures, including sequences, grids and trees, as well as their compositions. We conduct a series of experiments to demonstrate the practical applicability of our approach. Results suggest performance on par with or surpassing the current state-of-the-art, without hyperparameter optimizations or ``task search’’ of any kind. Code will be made available at https://github.com/konstantinosKokos/unitaryPE. -
OYXOY: A Modern NLP Test Suite for Modern Greek.
.
Findings of the Association for Computational Linguistics: EACL 2023.
September 2023.
[ ] [ paper ] [ code ]This paper serves as a foundational step towards the development of a linguistically motivated and technically relevant evaluation suite for Greek NLP. We initiate this endeavor by introducing four expert-verified evaluation tasks, specifically targeted at natural language inference, word sense disambiguation (through example comparison or sense selection) and metaphor detection. More than language-adapted replicas of existing tasks, we contribute two innovations which will resonate with the broader resource and evaluation community. Firstly, our inference dataset is the first of its kind, marking not just one, but rather all possible inference labels, accounting for possible shifts due to e.g. ambiguity or polysemy. Secondly, we demonstrate a cost-efficient method to obtain datasets for under-resourced languages. Using ChatGPT as a language-neutral parser, we transform the Dictionary of Standard Modern Greek into a structured format, from which we derive the other three tasks through simple projections. Alongside each task, we conduct experiments using currently available state of the art machinery. Our experimental baselines affirm the challenging nature of our tasks and highlight the need for expedited progress in order for the Greek NLP ecosystem to keep pace with contemporary mainstream research. -
SPINDLE: Spinning Raw Text into Lambda Terms with Graph Attention.
.
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations.
March 2023.
[ ] [ paper ] [ code ]This paper describes SPINDLE, an open source Python module, providing an efficient and accurate parser for written Dutch that transforms raw text input to programs for meaning composition expressed as λ terms. The parser integrates a number of breakthrough advances made in recent years. Its output consists of hi-res derivations of a multimodal type-logical grammar, capturing two orthogonal axes of syntax, namely deep function-argument structures and dependency relations. These are produced by three interdependent systems: a static type-checker asserting the well-formedness of grammatical analyses, a state-of-the-art, structurally-aware supertagger based on heterogeneous graph convolutions, and a massively parallel proof search component based on Sinkhorn iterations. Packed in the software are also handy utilities and extras for proof visualization and inference, intended to facilitate end-user utilization. -
Diamonds Are Forever -- Theoretical and Empirical Support for a Dependency-Enhanced Type Logic.
.
Logic and Algorithms in Computational Linguistics 2021.
March 2023.
[ ] [ paper ] [ code ]Extended Lambek calculi enlarge the type language with adjoint pairs of unary modalities. In previous work, modalities have been used as licensors for controlled forms of restructuring, reordering and copying. Here, we study a complementary use of the modalities as dependency features coding for grammatical roles. The result is a multidimensional type logic simultaneously inducing dependency and function argument structure on the linguistic material. We discuss the new perspective on constituent structure suggested by the dependency-enhanced type logic, and we experimentally evaluate how well a neural language model like BERT can deal with the subtle interplay between logical and structural reasoning that this type logic gives rise to. -
Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions.
.
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD).
May 2022.
[ ] [ paper ] [ code ]The syntactic categories of categorial grammar formalisms are structured units made of smaller, indivisible primitives, bound together by the underlying grammar’s category formation rules. In the trending approach of constructive supertagging, neural models are increasingly made aware of the internal category structure, which in turn enables them to more reliably predict rare and out-of-vocabulary categories, with significant implications for grammars previously deemed too complex to find practical use. In this work, we revisit constructive supertagging from a graph-theoretic perspective, and propose a framework based on heterogeneous dynamic graph convolutions aimed at exploiting the distinctive structure of a supertagger’s output space. We test our approach on a number of categorial grammar datasets spanning different languages and grammar formalisms, achieving substantial improvements over previous state of the art scores. -
Discontinuous Constituency and BERT: A Case Study of Dutch.
.
Findings of the Association for Computational Linguistics: ACL 2022.
May 2022.
[ ] [ paper ] [ code ]In this paper, we set out to quantify the syntactic capacity of BERT in the evaluation regime of non-context free patterns, as occurring in Dutch. We devise a test suite based on a mildly context-sensitive formalism, from which we derive grammars that capture the linguistic phenomena of control verb nesting and verb raising. The grammars, paired with a small lexicon, provide us with a large collection of naturalistic utterances, annotated with verb-subject pairings, that serve as the evaluation test bed for an attention-based span selection probe. Our results, backed by extensive analysis, suggest that the models investigated fail in the implicit acquisition of the dependencies examined. -
A Logic-Based Framework for Natural Language Inference in Dutch.
.
Computational Linguistics in the Netherlands.
February 2022.
[ ] [ paper ] [ code ]We present a framework for deriving inference relations between Dutch sentence pairs. The proposed framework relies on logic-based reasoning to produce inspectable proofs leading up to inference labels; its judgements are therefore transparent and formally verifiable. At its core, the system is powered by two λ-calculi, used as syntactic and semantic theories, respectively. Sentences are first converted to syntactic proofs and terms of the linear λ-calculus using a choice of two parsers: an Alpino-based pipeline, or Neural Proof Nets. The syntactic terms are then converted to semantic terms of the simply typed λ-calculus, via a set of hand designed type- and term-level transformations. Pairs of semantic terms are then fed to an automated theorem prover for natural logic which reasons with them while using the lexical relations found in the Open Dutch WordNet. We evaluate the reasoning pipeline on the recently created Dutch natural language inference dataset, and achieve promising results, remaining only within a 1.1–3.2% performance margin to strong neural baselines. To the best of our knowledge, the reasoning pipeline is the first logic-based system for Dutch. -
Fighting the COVID-19 Infodemic with a Holistic BERT Ensemble.
.
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda.
June 2021.
[ ] [ paper ] [ code ]This paper describes the TOKOFOU system, an ensemble model for misinformation detection tasks based on six different transformer-based pre-trained encoders, implemented in the context of the COVID-19 Infodemic Shared Task for English. We fine tune each model on each of the task’s questions and aggregate their prediction scores using a majority voting approach. TOKOFOU obtains an overall F1 score of 89.7%, ranking first. -
Improving BERT Pretraining with Syntactic Supervision.
.
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD).
April 2021.
[ ] [ paper ] [ code ]Bidirectional masked Transformers have become the core theme in the current NLP landscape. Despite their impressive benchmarks, a recurring theme in recent research has been to question such models’ capacity for syntactic generalization. In this work, we seek to address this question by adding a supervised, token-level supertagging objective to standard unsupervised pretraining, enabling the explicit incorporation of syntactic biases into the network’s training dynamics. Our approach is straightforward to implement, induces a marginal computational overhead and is general enough to adapt to a variety of settings. We apply our methodology on Lassy Large, an automatically annotated corpus of written Dutch. Our experiments suggest that our syntax-aware model performs on par with established baselines, despite Lassy Large being one order of magnitude smaller than commonly used corpora. -
Neural Proof Nets.
.
Proceedings of the 24th Conference on Computational Natural Language Learning.
October 2020.
[ ] [ paper ] [ code ]Linear logic and the linear λ-calculus have a long standing tradition in the study of natural language form and meaning. Among the proof calculi of linear logic, proof nets are of particular interest, offering an attractive geometric representation of derivations that is unburdened by the bureaucratic complications of conventional prooftheoretic formats. Building on recent advances in set-theoretic learning, we propose a neural variant of proof nets based on Sinkhorn networks, which allows us to translate parsing as the problem of extracting syntactic primitives and permuting them into alignment. Our methodology induces a batch-efficient, end-to-end differentiable architecture that actualizes a formally grounded yet highly efficient neuro-symbolic parser. We test our approach on ÆThel, a dataset of type-logical derivations for written Dutch, where it manages to correctly transcribe raw text sentences into proofs and terms of the linear λ-calculus with an accuracy of as high as 70%. -
Æthel: Automatically Extracted Typeogical Derivations for Dutch.
.
Proceedings of the 12th Language Resources and Evaluation Conference.
May 2020.
[ ] [ paper ] [ code ]We present ÆTHEL, a semantic compositionality dataset for written Dutch. ÆTHEL consists of two parts. First, it contains a lexicon of supertags for about 900 000 words in context. The supertags correspond to types of the simply typed linear lambda-calculus, enhanced with dependency decorations that capture grammatical roles supplementary to function-argument structures. On the basis of these types, ÆTHEL further provides 72192 validated derivations, presented in four formats: natural-deduction and sequent-style proofs, linear logic proofnets and the associated programs (lambda terms) for meaning composition. ÆTHEL’s types and derivations are obtained by means of an extraction algorithm applied to the syntactic analyses of LASSY Small, the gold standard corpus of written Dutch. We discuss the extraction algorithm and show how `virtual elements’ in the original LASSY annotation of unbounded dependencies and coordination phenomena give rise to higher-order types. We suggest some example usecases highlighting the benefits of a type-driven approach at the syntax semantics interface. The following resources are open-sourced with {ÆTHEL: the lexical mappings between words and types, a subset of the dataset consisting of 7924 semantic parses, and the Python code that implements the extraction algorithm. -
Constructive Type-Logical Supertagging with Self-Attention Networks.
.
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019).
August 2019.
[ ] [ paper ] [ code ]We propose a novel application of self-attention networks towards grammar induction. We present an attention-based supertagger for a refined type-logical grammar, trained on constructing types inductively. In addition to achieving a high overall type accuracy, our model is able to learn the syntax of the grammar{’}s type system along with its denotational semantics. This lifts the closed world assumption commonly made by lexicalized grammar supertaggers, greatly enhancing its generalization potential. This is evidenced both by its adequate accuracy over sparse word types and its ability to correctly construct complex types never seen during training, which, to the best of our knowledge, was as of yet unaccomplished. -
Towards a 2-Multiple Context-Free Grammar for the 3-Dimensional Dyck Language.
.
At the Intersection of Language, Logic, and Information.
July 2019.
[ ] [ paper ] [ code ]We discuss the open problem of parsing the Dyck language of 3 symbols, D3, using a 2-Multiple Context-Free Grammar. We attempt to tackle this problem by implementing a number of novel meta-grammatical techniques and present the associated software packages we developed.
Drafts & Preprints
-
On Tables with Numbers, with Numbers.
.
August 2024.
[ ] [ paper ]This paper is a critical reflection on the epistemic culture of contemporary computational linguistics, framed in the context of its growing obsession with tables with numbers. We argue against tables with numbers on the basis of their epistemic irrelevance, their environmental impact, their role in enabling and exacerbating social inequalities, and their deep ties to commercial applications and profit-driven research. We substantiate our arguments with empirical evidence drawn from a meta-analysis of computational linguistics research over the last decade. -
Deductive Parsing with an Unbounded Type Lexicon.
.
August 2019.
[ ] [ paper ]We present a novel deductive parsing framework for categorial type logics, modeled as the composition of two components. The first is an attention-based neural supertagger, which assigns words dependency-decorated, contextually informed linear types. It requires no predefined type lexicon, instead utilizing the type syntax to construct types inductively, enabling the use of a richer and more precise typing environment. The type annotations produced are used by the second component, a computationally efficient hybrid system that emulates the inference process of the type logic, iteratively producing a bottom-up reconstruction of the input’s derivation-proof and the associated program for compositional meaning assembly. Initial experiments yield promising results for each of the components.
-
Nominal Class Assignment in Swahili: A Computational Account.
.
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024).
October 2024.