Advanced Dataset Preparation
Sometimes we want to prepare a dataset, such as converting it to a pickle file or mixing different datasets.
This page describes the package option. You may also find all options in the Nix AI Search.
The prepare attribute
We can define a preprocessor script for our dataset by defining the prepare attribute of a dataset.
outputs = { nix-ai, ... }: nix-ai.lib.mkFlake {
...
datasets = {
"snli" = {
src = {
dataset = "stanfordnlp/snli";
hash = "sha256-+dJMSBXJgakSlqafDy8Ja3bI/y+LwtBAdixGw66cdQM=";
};
prepare = {
GPU = "any";
directoryPath = ./datasets/snli;
drop = [ "corpus.pkl" ];
commands = ''
python clean.py
'';
};
};
};
...
Many sub-attributes of prepare are similar to a training configuration. Let us break them down:
-
GPU: All preprocessing can be done with a GPU. Here we use any GPU available to us by setting
GPU = "any";. If you do not want to use a GPU, remove this line. -
directoryPath: This is the path in which your preparation scripts lie. The
directoryPathshould never be./and should also never be in quotes:directoryPath = ./datasets/snli; -
commands: How do we invoke our processing script? Here we invoke the
clean.pyscript, which was copied from ourdatasets/snlidirectory:commands = ''
python clean.py
''; -
drop: Several outputs result from our preprocessor script. We want to keep the
corpus.pklfile here:drop = [ "corpus.pkl" ];Note: In this instance 'drop', does not mean deleting files, but dropping them into the output directory.
Dataset preprocessor script
In this example, a dataset is downloaded from Huggingface, filtered and a tokenizer is trained, to use equal tokens in all model trainings. Dataset and Tokenizer are exported as a pickle file. The script that achieves this lies in datasets/snli/clean.py:
path = os.getenv("dataset")
dataset = datasets.load_dataset(path, split="train")
...
# filter the huggingface dataset for premise-hypothesis pairs and keep them
dataset = hf_to_pairs(dataset)
# train tokens using the dataset
tok = BPETokenizer()
tok.train([train_dataset(dataset)], vocab_size=1024)
# pre tokenize the dataset
train_data = data_process(dataset)
# create a pickle file and dump the tokenizer and dataset for later use
pkl_file = "corpus.pkl"
pickle.dump((train_data, tok), open(pkl_file, "wb"))