The University of Macau Portuguese Learner Corpus (UMPLC) is an annotated longitudinal learner
corpus of
Portuguese. The corpus comprises 933 compositions (totaling 209,097 tokens) produced by 121 Chinese
undergraduate students over six consecutive semesters (from their first to third year of study).
All the compositions were produced in exam contexts that imposed time constraints, length limits,
and
prohibited access to reference tools. 11 collections were administered over the three-year period,
with
at least one collection in each semester. In each collection, students were required to produce one
or
two compositions in response to specific prompt(s). Note that students in different classes may
receive
similar but different prompts. In total, there are 18 prompts
, covering a wide range of
topics
and different genres that were appropriate to the students’ levels and learning content.
Three text versions are now available at this website: a transcribed version, a spell-checked
version,
and an annotated version with part of speech and lemma tags. The transcribed version preserves all
the
errors made by the learners. The spell-checked version corrects all orthographic errors resulting in
non-words. The annotated version was based on the spell-checked version. The automatic annotation of
POS
and lemma was performed on the spell-checked texts using
Stanza
,
following the
Universal Dependencies
annotation scheme. To ensure
data quality, all three text versions were manually reviewed.
The file names were standardized using a five-part convention. This convention encoded the
information
of the collection point, class, identification number, sex, year and semester, exam type, and
composition prompt. For instance, the file name “1_03_011_F_11M” signifies:
- Collection point: 1
- Class: 03
- Learner’s identification number: 011
- Sex: female
- Year and semester: 1st year, 1st semester
- Exam type: midterm exam
- Prompt: 11M
The UMPLC is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
This license requires that reusers give credit to the creator. It allows reusers to distribute, remix,
adapt, and build upon the material in any medium or format, for noncommercial purposes only. If others
modify or adapt the material, they must license the modified material under identical terms.