Add Datasets

Here, we define the datasets we want to use for our training. Declaring the dataset independent from your training has the advantage that you can easily use the same data for different trainings—even with different models.

This page describes the dataset option. You may also find all options in the Nix AI Search.

Defining a Dataset

Here we use the MNIST dataset.

outputs = { nix-ai, ... }: nix-ai.lib.mkFlake {
  ...
    datasets = {
      myMnistDataset = {
        src.dataset = "ylecun/mnist";
        src.hash = "sha256-iG7RkNne5gPhDmohPw0KgtLhMkUJoh4m+sp5uVRaylg=";
      };
 
      myOtherDataset = ...
    };
  ...

There are three ways to obtain a dataset:

Load from Huggingface
```
src.dataset = "ylecun/mnist";
```

Load from a web URL

src.url = "https://example.com/datasets/custom.zip";

Use a dataset that resides inside your git repository
```
src.path = ./datasets/custom.zip;
```

Dataset Hash

Every downloaded dataset that originates from another source need a hash. We want to verify, that we are using the same dataset every time and not some altered or updated version without. This is especially important, since we then can cache our datasets more efficiently.

A hash looks like this:

src.hash = "sha256-iG7RkNne5gPhDmohPw0KgtLhMkUJoh4m+sp5uVRaylg=";

We can obtain this hash by first running our training with an empty hash (src.hash = "";) and afterwards taking a look at the error message.

Add Datasets

Defining a Dataset​

Dataset Hash​

Defining a Dataset

Dataset Hash