Advanced Dataset Preparation

Sometimes we want to prepare a dataset, such as converting it to a pickle file or mixing different datasets.

This page describes the package option. You may also find all options in the Nix AI Search.

The `prepare` attribute

We can define a preprocessor script for our dataset by defining the prepare attribute of a dataset.

outputs = { nix-ai, ... }: nix-ai.lib.mkFlake {
...
datasets = {
  "snli" = {
    src = {
      dataset = "stanfordnlp/snli";
      hash = "sha256-+dJMSBXJgakSlqafDy8Ja3bI/y+LwtBAdixGw66cdQM=";
    };

    prepare = {
      GPU = "any";
      directoryPath = ./datasets/snli;
      drop = [ "corpus.pkl" ];
      commands = ''
        python clean.py
      '';
    };
  };
};
...

Many sub-attributes of prepare are similar to a training configuration. Let us break them down:

GPU: All preprocessing can be done with a GPU. Here we use any GPU available to us by setting

GPU = "any";. If you do not want to use a GPU, remove this line.
directoryPath: This is the path in which your preparation scripts lie. The directoryPath should never be ./ and should also never be in quotes:

directoryPath = ./datasets/snli;
commands: How do we invoke our processing script? Here we invoke the clean.py script, which was copied from our datasets/snli directory:
```
commands = ''
  python clean.py
'';
```
drop: Several outputs result from our preprocessor script. We want to keep the corpus.pkl file here:

drop = [ "corpus.pkl" ];

Note: In this instance 'drop', does not mean deleting files, but dropping them into the output directory.

Dataset preprocessor script

In this example, a dataset is downloaded from Huggingface, filtered and a tokenizer is trained, to use equal tokens in all model trainings. Dataset and Tokenizer are exported as a pickle file. The script that achieves this lies in datasets/snli/clean.py:

path = os.getenv("dataset")
dataset = datasets.load_dataset(path, split="train")

...

# filter the huggingface dataset for premise-hypothesis pairs and keep them
dataset = hf_to_pairs(dataset)
 # train tokens using the dataset
tok = BPETokenizer()
tok.train([train_dataset(dataset)], vocab_size=1024)

# pre tokenize the dataset
train_data = data_process(dataset) 

# create a pickle file and dump the tokenizer and dataset for later use
pkl_file = "corpus.pkl"
pickle.dump((train_data, tok), open(pkl_file, "wb"))

Advanced Dataset Preparation

The prepare attribute

Dataset preprocessor script​

The `prepare` attribute

Dataset preprocessor script