AllenAI C4¶
Attribute |
Value |
---|---|
pretty_name |
Allenai C4 |
annotations_creators |
|
language_creators |
|
languages |
|
licenses |
ODC-By |
multilinguality |
|
size_categories |
1B<n<10B |
source_datasets |
|
task_categories |
|
task_ids |
|
paperswithcode_id |
Dataset Description¶
Homepage: Allenai C4
Licenses: Open Data Commons Attribution License (ODC-By) v1.0
Dataset Summary¶
Raw text data in 101 different languages.
Download and prepare data¶
The dataset can be loaded directly via the squirrel Catalog API. Make sure that squirrel-dataset-core is installed via pip, which will register this dataset. Visit the TensorFlow site https://www.tensorflow.org/datasets/catalog/c4 for all the available languages. Use the following code to load the data:
from squirrel.catalog import Catalog
plugin_catalog = Catalog.from_plugins()
# For each of the 101 languages there is a train and valid split
it_af_val = plugin_catalog["c4"].get_driver().select("af", "valid").get_iter()
Dataset Structure¶
Data Instances¶
A sample from the training set is provided below:
{
'text': 'Stehe vielleicht kurz vorm Wechsel, ein paar Frage...',
'timestamp': '2018-12-16T09:17:27Z',
'url': 'http://www.pokertips.org/forums/showthread.php?t=49769&page=3'
}
Dataset Schema¶
text: Contains the raw text without annotations.
Data Splits¶
Visit the tensorflow site https://www.tensorflow.org/datasets/catalog/c4 for a list of all splits and number of examples.