A parallel corpus extracted from the European Parliament web site by Philipp Koehn (University of Edinburgh). Appr. 40 million words per language. The main intended use is to aid statistical machine translation research.
More information can be found at http://www.statmt.org/europarl/. The main difference in this release vs. the first release in 2002 and second release in 2003 is that it is larger and it comes with improved processing tools that allow the creation of parallel corpora between any two of the 11 languages. Some data is now tagged with the original language the text was spoken in.
11 languages, 55 bitextsOld version: Europarl2.
NEW: Dutch Europarl3 Treebank (parsed with Alpino).
Complete download (XML): Europarl3.tar.gz (3,5G )Bottom-left triangle: download files
| Upper-right triangle: sample files
|
| da | de | el | en | es | fi | fr | it | nl | pt | sv | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| da | view | view | view | view | view | view | view | view | view | view | da | |
| de | ces | view | view | view | view | view | view | view | view | view | de | |
| el | ces | ces | view | view | view | view | view | view | view | view | el | |
| en | ces | ces | ces | view | view | view | view | view | view | view | en | |
| es | ces | ces | ces | ces | view | view | view | view | view | view | es | |
| fi | ces | ces | ces | ces | ces | view | view | view | view | view | fi | |
| fr | ces | ces | ces | ces | ces | ces | view | view | view | view | fr | |
| it | ces | ces | ces | ces | ces | ces | ces | view | view | view | it | |
| nl | ces | ces | ces | ces | ces | ces | ces | ces | view | view | nl | |
| pt | ces | ces | ces | ces | ces | ces | ces | ces | ces | view | pt | |
| sv | ces | ces | ces | ces | ces | ces | ces | ces | ces | ces | sv | |
| da | de | el | en | es | fi | fr | it | nl | pt | sv |
| language | files | tokens | sentences | da | de | el | en | es | fi | fr | it | nl | pt | sv |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| da | 659 | 37.5M | 1.6M | 1.3M | 0.7M | 1.3M | 1.3M | 1.2M | 1.3M | 1.2M | 1.3M | 1.2M | 1.1M | |
| de | 659 | 37.5M | 1.5M | 1.3M | 0.6M | 1.3M | 1.3M | 1.2M | 1.3M | 1.2M | 1.3M | 1.2M | 1.1M | |
| el | 537 | 26.3M | 1.0M | 0.7M | 0.7M | 0.6M | 0.7M | 0.6M | 0.7M | 0.6M | 0.7M | 0.6M | 0.6M | |
| en | 658 | 39.4M | 1.5M | 1.3M | 1.3M | 0.7M | 1.3M | 1.2M | 1.3M | 1.2M | 1.3M | 1.3M | 1.1M | |
| es | 660 | 40.5M | 1.5M | 1.3M | 1.3M | 0.7M | 1.3M | 1.2M | 1.3M | 1.2M | 1.3M | 1.2M | 1.1M | |
| fi | 611 | 26.4M | 1.4M | 1.2M | 1.2M | 0.6M | 1.3M | 1.2M | 1.2M | 1.2M | 1.2M | 1.2M | 1.1M | |
| fr | 662 | 43.7M | 1.5M | 1.3M | 1.3M | 0.7M | 1.3M | 1.3M | 1.3M | 1.2M | 1.3M | 1.3M | 1.1M | |
| it | 658 | 39.0M | 1.4M | 1.2M | 1.2M | 0.6M | 1.2M | 1.2M | 1.2M | 1.3M | 1.2M | 1.2M | 1.1M | |
| nl | 659 | 39.5M | 1.6M | 1.3M | 1.3M | 0.7M | 1.3M | 1.3M | 1.2M | 1.3M | 1.2M | 1.3M | 1.1M | |
| pt | 660 | 40.9M | 1.4M | 1.3M | 1.3M | 0.7M | 1.3M | 1.3M | 1.2M | 1.3M | 1.2M | 1.3M | 1.1M | |
| sv | 611 | 33.4M | 1.5M | 1.2M | 1.2M | 0.7M | 1.2M | 1.1M | 1.2M | 1.2M | 1.1M | 1.2M | 1.1M |