Published November 2, 2023 | Version 1.0.0
Model Open

CATMuS Medieval

  • 1. ROR icon Histoire, Archéologie, Littératures des Mondes Chrétiens et Musulmans Médiévaux
  • 2. ROR icon French Institute for Research in Computer Science and Automation
  • 3. ROR icon École Nationale des Chartes
  • 4. ROR icon Institut de Recherche et d'Histoire des Textes
  • 5. Centre Jean Mabillon
  • 6. ROR icon École Normale Supérieure de Lyon
  • 7. ROR icon Institute for Computational Linguistics “A. Zampolli”
  • 8. ROR icon Ca' Foscari University of Venice
  • 9. ROR icon University of Toronto
  • 10. ROR icon University of Geneva
  • 1. ROR icon Ca' Foscari University of Venice
  • 2. rescribe.xyz
  • 3. Centre Jean Mabillon
  • 4. Università degli Studi di Sassari
  • 5. ROR icon École Nationale des Chartes
  • 6. ROR icon French Institute for Research in Computer Science and Automation
  • 7. ROR icon Histoire, Archéologie, Littératures des Mondes Chrétiens et Musulmans Médiévaux
  • 8. ROR icon Jean Moulin University Lyon 3
  • 9. ROR icon Lumière University Lyon 2
  • 10. Université de Genève
  • 11. ROR icon University of Geneva
  • 12. ROR icon Université Libre de Bruxelles

Description

CATMuS (Consistent Approach to Transcribing ManuScript) Medieval is a Kraken HTR model trained on four different languages (in descending order of importance in the dataset: Old and Middle French, Latin, Spanish (and other languages of Spain), Italian) on strictly graphematic transcriptions. No abbreviations are resolved.

This model is the result of the collaboration from researchers from CREMMA, GalliCorpora, HTRomance and DEEDS projects. It follows the CREMMA Guidelines (Supplemented by the CREMMA Medii Aevi) and will be consolidated under the CATMuS Medieval Guidelines in an upcoming paper.

The model is trained with NFD Unicode normalization: each diacritic (including superscripts) are transcribed as their own characters, separately from the "main" character.

Metrics

  • 3,361,410 characters
  • 113,228 lines
  • 1602 files (indifferently double pages or single pages)
  • 7560 regions

All source datasets and papers are referenced in the related works section, all transcribers are mentioned in the collaborators section, all partner-project members are mentioned as authors.

Fundings

  • CREMMA, DIM MAP, Région Île-de-France
  • CremmaLab, DIM MAP, Région Île-de-France
  • GalliCorpora, Datalab, Bibliothèque nationale de France
  • HTRomance, Datalab, Bibliothèque nationale de France
  • Text as Image, Image as Text: Charter integrity and topic modelling, SSHRCC 1350911
  • Les Décades de Bersuire, première traduction française de l'Histoire romaine de Tite-Live – LiBer, ANR 21-CE27-0008
  • Projet Fabliaux, Biblissima+, ANR 21-ESRE-0005

Files

metadata.json

Files (22.9 MB)

Name Size Download all
md5:11f45c4d63038bd5fd932e5df6c3ae7e
22.9 MB Download
md5:582c3ff89f880cc0c03de8fecb7ebaac
3.4 kB Preview Download

Additional details

Dates

Created
2023-11-01

References

  • Ariane Pinche. Guide de transcription pour les manuscrits du Xe au XVe siècle. 2022. ⟨hal-03697382⟩
  • Thibault Clérice, Malamatenia Vlachou-Efstathiou, Alix Chagué. CREMMA Medii Aevi: Literary manuscript text recognition in Latin. Journal of Open Humanities Data, 2023, 9, pp.4. ⟨10.5334/johd.97⟩. ⟨hal-03828353v5⟩