Huggingface, Deeplake, Hub, Torchvision¶

Ever wondered how you can tap into common databases like Huggingface, Activeloop Deeplake, Activeloop Hub and Torchvision with Squirrel? Squirrel creates lightweight wrappers around these libraries’ APIs, which means you can quickly and easily load data from the mentioned servers. The benefit is that you get Squirrel’s stream manipulation functionality on-top. Say you want to pre-process a Huggingface dataset with a Squirrel multi-processing async_map that is easily achievable with the HuggingfaceDriver.

To use the drivers, you need to install squirrel-datasets-core with the corresponding dependency.

pip install "squirrel-datasets-core[huggingface]"
pip install "squirrel-datasets-core[deeplake]"
pip install "squirrel-datasets-core[hub]"
pip install "squirrel-datasets-core[torchvision]"

The below examples show how to instantiate the three drivers and shows what they output. Note that we simply “forward” the output of these libraries, so the format of whatever they output may differ. For example, in the below code we take the first item of the pipeline with .take(1) and we map a print function over this pipeline, which outputs something different for each backend. The images coming from the Huggingface servers are PIL images, while for Hub and Deeplake they are in their custom Tensor format. The user should write corresponding pre-processing functions that suit their use-case.

from squirrel_datasets_core.driver.deeplake import DeeplakeDriver
from squirrel_datasets_core.driver.hub import HubDriver
from squirrel_datasets_core.driver.huggingface import HuggingfaceDriver
from squirrel_datasets_core.driver.torchvision import TorchvisionDriver

DeeplakeDriver("hub://activeloop/cifar100-train").get_iter().take(1).map(print).join()
# prints
# {
#     "images": Tensor(key="images", index=Index([0])),
#     "labels": Tensor(key="labels", index=Index([0])),
#     "coarse_labels": Tensor(key="coarse_labels", index=Index([0])),
# }

HubDriver("hub://activeloop/cifar100-train").get_iter().take(1).map(print).join()
# {
#     "images": Tensor(key="images", index=Index([0])),
#     "labels": Tensor(key="labels", index=Index([0])),
#     "coarse_labels": Tensor(key="coarse_labels", index=Index([0])),
# }

HuggingfaceDriver("cifar100").get_iter("train").take(1).map(print).join()
# prints
# {
#     "img": <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32 at 0x1424D6310>,
#     "fine_label": 19,
#     "coarse_label": 11,
# }

TorchvisionDriver("cifar100", download=True).get_iter().take(1).map(print).join()
# prints
# (
#     <PIL.Image.Image image mode=RGB size=32x32 at 0x143BD77C0>,
#     6,
# )

What does this look like in a realistic scenario? Let’s say you want to train a classifier on the CIFAR-100 dataset and you need a torch Dataloader to train the model. Simply create the HuggingfaceDriver as shown below and use it as a data source. A cool side-effect of using the HuggingfaceDriver is that you won’t need to download the data locally - but it can be streamed directly from the Huggingface servers. Beware that your machine’s internet connection may become a bottleneck here. Also note that you can pass any arguments and keyword arguments to the respective drivers to influence their internals. For example for Huggingface, you can set HuggingfaceDriver(url, streaming=False) to download the data locally before starting to iterate.

import typing as t

import torch
import torchvision.transforms.functional as F
from squirrel_datasets_core.driver.huggingface import HuggingfaceDriver
from torch.utils.data import DataLoader
from torch.utils.data._utils.collate import default_collate

BATCH_SIZE = 16


def hf_pre_proc(item: t.Dict[str, t.Any]) -> t.Dict[str, torch.Tensor]:
    """Converts the data coming from Huggingface to `torch.Tensor`s, such that the torch DataLoader
        can read it. We return only the `fine_label`s and not the `coarse_label`s of the CIFAR100
        dataset.

    Args:
        item (t.Dict[str, t.Any]): Sample coming from the Huggingface servers.

    Returns:
        t.Dict[str, torch.Tensor]: Ssample containing data as tensors.
    """
    return {
        "img": F.pil_to_tensor(item["img"]) / 255,
        "label": torch.tensor(item["fine_label"]),
    }


train_driver = (
    HuggingfaceDriver("cifar100")  # can be any of above drivers, just adapt hf_pre_proc
    .get_iter("train")
    .split_by_worker_pytorch()
    .map(hf_pre_proc)
    .batched(BATCH_SIZE, default_collate)
    .to_torch_iterable()
)
train_loader = DataLoader(train_driver, batch_size=None, num_workers=2)

# your train loop
for batch in train_loader:
    assert type(batch["img"]) == torch.Tensor
    assert len(batch["img"]) == BATCH_SIZE
    # forward pass ...

Please take note of the original dataset license from the dataset provider.