This is a parallel corpus made out of PDF documents from the European Medicines Agency. All files are automatically converted from PDF to plain text using pdftotext with the command line arguments -layout -nopgbrk -eol unix. There are some known problems with tables and multi-column layouts - some of them are fixed in the current version.
source: http://www.emea.europa.eu/
NEW: Dutch EMEA Treebank (parsed with Alpino)
Complete download (XML): EMEA0.3.tar.gz (5,0G )Bottom-left triangle: download files
| Upper-right triangle: sample files
|