Here you can find software produced by our research group. We produce open-source software for research purposes, generally for unix-based machines. We'd like to also refer you to our webservice portal where some of our software is available as a webservice. You can register there for a free account.
CLAM allows you to quickly and transparently transform your Natural Language Processing application into a RESTful webservice, with which both human end-users as well as automated clients can interact.
Colibri Core is software, consisting of command line tools as well as programming libraries. to quickly and efficiently count and extract patterns from large corpus data, to extract various statistics on the extracted patterns, and to compute relations between the extracted patterns.
A Machine Translation framework that wraps around the Moses Decoder and enables k-NN classifier techniques to be used for modelling source-side-context
Colibrita is a proof-of-concept translation assistance system, translating L1 fragments in an L2 context, using machine learning and statistical machine translation techniques.
Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. Flat allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm.
FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources. FoLiA’s intended use is as a format for storing and/or exchanging language resources, including corpora.
Fowlt is a spelling correction system for English.
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package. Most modules were created in the 1990s at the ILK Research Group (Tilburg University, the Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium). Over the years they have been integrated into a single text processing tool, which is currently maintained and developed by the Language Machines Research Group and the Centre for Language and Speech Technology at Radboud University Nijmegen. A dependency parser, a base phrase chunker, and a named-entity recognizer module were added more recently. Where possible, Frog makes use of multi-processor support to run subtasks in parallel.
Gecco is a generic modular and distributed framework for spelling correction. Aimed to build complete context-aware spelling correction system given your own data set. Most modules will be language-independent and trainable from a source corpus. Training is explicitly included in the framework. The framework aims to easily extendible, modules can be written in Python 3. Moreover, the framework is scalable and distributable over multiple servers. Given an input text, Gecco will add various suggestions for correction. The system can be invoked from the command-line, as a Python binding, as a RESTful webservice, or through the web application (two interfaces).
LaMachine is not a single tool, but is a distribution of almost all our software bundled in three different ways to facilitate use on a wide variety of systems. LaMachine can be used as a Virtual Machine - Easiest, allowing you to run our software on any host OS, as a Docker application, or as a compilation/installation script in a virtual environment. It contains software such as Timbl, ucto, Frog, colibri core and all the Python bindings.
Lama Events is a calendar application listing events in the near future. The events are detected and selected by a fully automatic procedure in the Dutch Twitter stream (courtesy of Twiqs.nl). Tweets referring to the same future events are clustered based on the frequent co-occurrence of words (names, phrases) and temporal expressions that characterize the event. The date and time of the event is automatically determined based on direct and indirect time references in the texts of the tweets in a cluster. The demo shows a day-by-day ranked list of automatically detected events in the Dutch language area (Netherlands and Flanders).
MBT is a memory-based tagger-generator and tagger in one. The tagger-generator part can generate a sequence tagger on the basis of a training set of tagged sequences; the tagger part can tag new sequences. MBT can, for instance, be used to generate part-of-speech taggers or chunkers for natural language processing. It has also been used for named-entity recognition, information extraction in domain-specific texts, and disfluency chunking in transcribed speech.
Oersetter is a Frisian-Dutch, Dutch-Frisian Machine Translation system developed in collaboration with the Fryske Akademy.
PyNLPl, pronounced as "pineapple", is a Python (2 & 3) library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks.
T-scan is an analysis tool for dutch texts to assess the complexity of the text, and is based on original work by Rogier Kraf (Utrecht University) [See: Kraf et al., 2009]. The code has been reimplemented and extended by Ko van der Sloot (Tilburg University), and is currently maintained and continued by Martijn van der Klis (Utrecht University)
TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases. For over fifteen years TiMBL has been mostly used in natural language processing as a machine learning classifier component, but its use extends to virtually any supervised machine learning domain. Due to its particular decision-tree-based implementation, TiMBL is in many cases far more efficient in classification than a standard k-nearest neighbor algorithm would be.
Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation.
Valkuil is a Dutch spelling correction system.
This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging, lemmatisation, morphological analysis, named entity recognition, shallow parsing, and dependency parsing. The tool itself is implemented in C++
python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. With this module, all functionality exposed through the C++ interface is also available to Python scripts. Being able to access the API from Python greatly facilitates prototyping TiMBL-based applications.
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is a regular-expression based, extensible, and advanced tokeniser written in C++