MLRS Corpora
The MLRS Project currently hosts two different (though related) corpora, both of which are accessible from our Corpus Portal, powered by the IMS Open Corpus Workbench:
- MLRS Corpus, v1.0: an opportunistic text collection of nearly 100 million tokens, mostly created from publicly available documents, as well as a limited amount of user-contributed material.
- MLRS Corpus, v2.0 BETA: an extension of the MLRS
Corpus v1.0, including texts from the older corpus with several
additional texts, totalling ca. 130 million tokens.
Texts for this version of the corpus were preprocessed as follows:
- Removal of duplicate material (using the Onion corpus de-duplication tool;
- Removal of long (> 1 sentence) stretches of non-Maltese text;
- Simple, dictionary-based spelling correction;
- Part of Speech tagging, carried out using the TnT Tagger, trained on ca. 26k of manually annotated text, reaching an accuracy of ca. 95-6%. The Maltese version of TnT can be used online; see our tools page for more info.
Creating a user account
The corpora can be browsed and searched online after creation of a free user account. If you do not have an account, you can create one from the registration and login panel at the top of this page. You will be sent an email with your new password and instructions on how to access the corpus.
Learn more about it
The MLRS Corpus v1.0 was presented at a talk hosted by the Institute of Linguistics, part of the Linguistics Circle series, on 3rd June 2011. You can watch a video of the talk below.
The slides for the talk can be downloaded from here (pptx format).
Documentation
The corpus interface is documented here (pdf format).