Add Datasets
Here, we define the datasets we want to use for our training. Declaring the dataset independent from your training has the advantage that you can easily use the same data for different trainings—even with different models.
This page describes the dataset option. You may also find all options in the Nix AI Search.
Defining a Dataset
Here we use the MNIST dataset.
outputs = { nix-ai, ... }: nix-ai.lib.mkFlake {
...
datasets = {
myMnistDataset = {
src.dataset = "ylecun/mnist";
src.hash = "sha256-iG7RkNne5gPhDmohPw0KgtLhMkUJoh4m+sp5uVRaylg=";
};
myOtherDataset = ...
};
...
There are three ways to obtain a dataset:
-
Load from Huggingface
src.dataset = "ylecun/mnist"; -
Load from a web URL
src.url = "https://example.com/datasets/custom.zip"; -
Use a dataset that resides inside your git repository
src.path = ./datasets/custom.zip;
Dataset Hash
Every downloaded dataset that originates from another source need a hash. We want to verify, that we are using the same dataset every time and not some altered or updated version without. This is especially important, since we then can cache our datasets more efficiently.
A hash looks like this:
src.hash = "sha256-iG7RkNne5gPhDmohPw0KgtLhMkUJoh4m+sp5uVRaylg=";
We can obtain this hash by first running our training with an empty hash (src.hash = "";) and afterwards taking a look at the error message.
