Zum Hauptinhalt springen

Advanced Dataset Preparation

Sometimes we want to prepare a dataset, such as converting it to a pickle file or mixing different datasets.

This page describes the package option. You may also find all options in the Nix AI Search.

The prepare attribute

We can define a preprocessor script for our dataset by defining the prepare attribute of a dataset.

outputs = { nix-ai, ... }: nix-ai.lib.mkFlake {
...
datasets = {
"snli" = {
src = {
dataset = "stanfordnlp/snli";
hash = "sha256-+dJMSBXJgakSlqafDy8Ja3bI/y+LwtBAdixGw66cdQM=";
};

prepare = {
GPU = "any";
directoryPath = ./datasets/snli;
drop = [ "corpus.pkl" ];
commands = ''
python clean.py
'';
};
};
};
...

Many sub-attributes of prepare are similar to a training configuration. Let us break them down:

  • GPU: All preprocessing can be done with a GPU. Here we use any GPU available to us by setting

    GPU = "any";. If you do not want to use a GPU, remove this line.

  • directoryPath: This is the path in which your preparation scripts lie. The directoryPath should never be ./ and should also never be in quotes:

    directoryPath = ./datasets/snli;

  • commands: How do we invoke our processing script? Here we invoke the clean.py script, which was copied from our datasets/snli directory:

    commands = ''
    python clean.py
    '';
  • drop: Several outputs result from our preprocessor script. We want to keep the corpus.pkl file here:

    drop = [ "corpus.pkl" ];

    Note: In this instance 'drop', does not mean deleting files, but dropping them into the output directory.

Dataset preprocessor script

In this example, a dataset is downloaded from Huggingface, filtered and a tokenizer is trained, to use equal tokens in all model trainings. Dataset and Tokenizer are exported as a pickle file. The script that achieves this lies in datasets/snli/clean.py:

path = os.getenv("dataset")
dataset = datasets.load_dataset(path, split="train")

...

# filter the huggingface dataset for premise-hypothesis pairs and keep them
dataset = hf_to_pairs(dataset)
# train tokens using the dataset
tok = BPETokenizer()
tok.train([train_dataset(dataset)], vocab_size=1024)

# pre tokenize the dataset
train_data = data_process(dataset)

# create a pickle file and dump the tokenizer and dataset for later use
pkl_file = "corpus.pkl"
pickle.dump((train_data, tok), open(pkl_file, "wb"))