Add a Dataset¶
Squirrel-datasets supports you with two tasks:
Preprocessing data and registering the preprocessed dataset in the data mesh
Loading data from the data mesh
Preprocessing¶
For the first task, i.e. preprocessing, we recommend using Apache Spark. The scenario is that quite often you would
like to work with data stored in Google Cloud Storage and finish your batch processing job on a kubernetes cluser. We use
PySpark for defining the preprocessing logic in python. You can find a tutorial how to use spark for preprocessing under
examples/08.Spark_Preprocessing.ipynb
.
Data Loading¶
For the second task, i.e. data loading, we use the high level API from squirrel
. The corresponding data loading logic
is defined through a Driver
class.
Add a New Dataset¶
After having understood the two above discussed main tasks and how we handle them, here is how it looks like when you
want to add a new dataset into squirrel-datasets
: define your preprocessing logic; define your loading logic;
register the dataset into a catalog plugin.
Define your preprocessing logic.
Create a new directory under
squirrel_datasets_core/datasets
named after your dataset, e.g. “example_dataset”. Write your preprocessing scripts under a newpreprocessing.py
file in it if needed.
Define your loading logic.
After the preprocessing step, you want to make sure your preprocessed dataset is valid and readable. In that case, you need to define the loading logic. The driver defines how the dataset is read from the serialized file into your memory.
In
squirrel
there are already many built-in drivers for reading all kinds of datasets. There areCsvDriver
,JsonlDriver
,MessagepackDriver
, and many others. For details, please refer tosquirrel.driver
.Select a suitable driver if one of them is applicable to your dataset’s format and compression method.
If there is no driver suitable for your dataset, then you need to define a custom driver. The custom driver should have the same interface as
squirrel.driver.IterDriver
. We recommend that you subclass from this class, then add the loading logic inside. This class should be saved undersquirrel_datasets_core/datasets/example_dataset/driver.py
Note
This is not always the case that data loading occurs after the preprocessing steps. For image datasets, spark is not always the right tool to do it. In that case, you may want to load and process the data without it, and you need to define the loading logic for your raw data. In that case, you may swap the above steps or use them more flexibly. See
squirrel_datasets_core.datasets.imagenet
for an example.