CC100¶
Attribute |
Value |
---|---|
pretty_name |
CC 100 |
annotations_creators |
|
language_creators |
|
languages |
|
licenses |
|
multilinguality |
|
size_categories |
|
source_datasets |
|
task_categories |
|
task_ids |
|
paperswithcode_id |
cc100 |
Dataset Description¶
Homepage: CC 100
Dataset Summary¶
Raw text data for 100+ different languages.
Download and prepare data¶
The dataset can be loaded directly via the squirrel Catalog API. Make sure that squirrel-dataset-core is installed via pip, which will register this dataset. Visit the homepage https://data.statmt.org/cc-100/ for all the available languages. Use the following code to load the data:
from squirrel.catalog import Catalog
plugin_catalog = Catalog.from_plugins()
it_af = plugin_catalog["cc100"].get_driver().select("gd").get_iter()
Dataset Structure¶
Data Instances¶
A sample from the training set is provided below:
{
'text': 'Èireannaich air an sguad ainmeachadh...',
}
Dataset Schema¶
text: Contains the raw text without annotations.
Data Splits¶
Visit the homepage for a list of all splits.