AllenAI C4

Attribute

Value

pretty_name

Allenai C4

annotations_creators

language_creators

languages

licenses

ODC-By

multilinguality

size_categories

1B<n<10B

source_datasets

task_categories

task_ids

paperswithcode_id

Dataset Description

Dataset Summary

Raw text data in 101 different languages.

Download and prepare data

The dataset can be loaded directly via the squirrel Catalog API. Make sure that squirrel-dataset-core is installed via pip, which will register this dataset. Visit the TensorFlow site https://www.tensorflow.org/datasets/catalog/c4 for all the available languages. Use the following code to load the data:

from squirrel.catalog import Catalog
plugin_catalog = Catalog.from_plugins()

# For each of the 101 languages there is a train and valid split
it_af_val = plugin_catalog["c4"].get_driver().select("af", "valid").get_iter()

Dataset Structure

Data Instances

A sample from the training set is provided below:

{
    'text': 'Stehe vielleicht kurz vorm Wechsel, ein paar Frage...',
    'timestamp': '2018-12-16T09:17:27Z',
    'url': 'http://www.pokertips.org/forums/showthread.php?t=49769&page=3'
}

Dataset Schema

  • text: Contains the raw text without annotations.

Data Splits

Visit the tensorflow site https://www.tensorflow.org/datasets/catalog/c4 for a list of all splits and number of examples.