The University of Macau Portuguese Learner Corpus (UMPLC) is an annotated longitudinal learner corpus of Portuguese. The corpus comprises 933 compositions (totaling 209,097 tokens) produced by 121 Chinese undergraduate students over six consecutive semesters (from their first to third year of study).

All the compositions were produced in exam contexts that imposed time constraints, length limits, and prohibited access to reference tools. 11 collections were administered over the three-year period, with at least one collection in each semester. In each collection, students were required to produce one or two compositions in response to specific prompt(s). Note that students in different classes may receive similar but different prompts. In total, there are 18 prompts , covering a wide range of topics and different genres that were appropriate to the students’ levels and learning content.

Three text versions are now available at this website: a transcribed version, a spell-checked version, and an annotated version with part of speech and lemma tags. The transcribed version preserves all the errors made by the learners. The spell-checked version corrects all orthographic errors resulting in non-words. The annotated version was based on the spell-checked version. The automatic annotation of POS and lemma was performed on the spell-checked texts using Stanza , following the Universal Dependencies annotation scheme. To ensure data quality, all three text versions were manually reviewed.

The file names were standardized using a five-part convention. This convention encoded the information of the collection point, class, identification number, sex, year and semester, exam type, and composition prompt. For instance, the file name “1_03_011_F_11M” signifies:
  • Collection point: 1
  • Class: 03
  • Learner’s identification number: 011
  • Sex: female
  • Year and semester: 1st year, 1st semester
  • Exam type: midterm exam
  • Prompt: 11M


The UMPLC is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. This license requires that reusers give credit to the creator. It allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, for noncommercial purposes only. If others modify or adapt the material, they must license the modified material under identical terms.


To register or download the corpus files, please contact us at umplcorpus@gmail.com


You, M., Zhang, J., Wong, D. F., & Lan, K. (2025). Umplc: the first longitudinal learner corpus of Portuguese. Language Resources and Evaluation, 1-20. https://doi.org/10.1007/s10579-025-09811-w


Zhang, J., & You, M. (2023). Corpus de aprendizes de português da universidade de macau e ensino de português l2. Texto Livre, 17, e47754. https://doi.org/10.1590/1983-3652.2024.47754