CC100

Attribute

Value

pretty_name

CC 100

annotations_creators

language_creators

languages

licenses

multilinguality

size_categories

source_datasets

task_categories

task_ids

paperswithcode_id

cc100

Dataset Description

Dataset Summary

Raw text data for 100+ different languages.

Download and prepare data

The dataset can be loaded directly via the squirrel Catalog API. Make sure that squirrel-dataset-core is installed via pip, which will register this dataset. Visit the homepage https://data.statmt.org/cc-100/ for all the available languages. Use the following code to load the data:

from squirrel.catalog import Catalog
plugin_catalog = Catalog.from_plugins()

it_af = plugin_catalog["cc100"].get_driver().select("gd").get_iter()

Dataset Structure

Data Instances

A sample from the training set is provided below:

{
    'text': 'Èireannaich air an sguad ainmeachadh...',
}

Dataset Schema

  • text: Contains the raw text without annotations.

Data Splits

Visit the homepage for a list of all splits.