Types of Experiments

Currently, Abed is capable of handling three types of experiments. The type of experiment you want to use can be set with the setting TYPE. The chosen experiment type has an effect on several settings in the The Abed Settings file.

Model assessment

TYPE = 'ASSESS'.

With this setting, Abed does a grid search on the specified parameters for each method, on every specified dataset. It is useful for simple experiments with a single dataset per command, or experiments where you want to perform cross validation on a dataset. Example settings are:

TYPE = 'ASSESS'

DATASETS = ['dataset_1', 'dataset_2']
METHODS = ['Lasso', 'Ridge']

costs = [pow(2, x) for x in range(-8, 9, 2)]
PARAMS = {
    'Lasso': {
        'cost': costs
        },
    'Ridge': {
        'cost': costs
        }
    }

COMMANDS = {
    'Lasso': ("python {execdir}/lasso.py {datadir}/{dataset}.txt "
        "{cost}"),
    'Ridge': ("python {execdir}/ridge.py {datadir}/{dataset}.txt "
        "{cost}")
    }

where lasso.py and ridge.py are python scripts that parse the command line arguments, load the dataset specified on the command line, and run Lasso or Ridge regression with the specified cost parameter, respectively. See also Designing Executable Scripts.

Note that the executables do not necessarily have to be Python scripts. The command will be executed by the system, so it can be an R program executed with for instance Rscript, a compiled executable, or anything else.

Also of note is the addition of ".txt" to the {dataset} variable in the command. This implies that we expect the datasets to be stored in files dataset_1.txt and dataset_2.txt in the dataset folder (see DATADIR).

Nested Cross-Validation

TYPE = 'CV_TT'

With this setting, a train and test dataset are expected for each command. This is useful when you want to train a model on one dataset and test it on another, or when you want to run nested cross validation for instance. Example settings are as follows:

TYPE = 'CV_TT'

DATASETS = [('dataset_1_train', 'dataset_1_test'),
        ('dataset_2_train', 'dataset_2_test')]
METHODS = ['Lasso', 'Ridge']

costs = [pow(2, x) for x in range(-8, 9, 2)]
PARAMS = {
    'Lasso': {
        'cost': costs
        },
    'Ridge': {
        'cost': costs
        }
    }

COMMANDS = {
    'Lasso': ("python {execdir}/lasso.py {datadir}/{train_dataset}.txt "
        "{datadir}/{test_dataset}.txt {cost}"),
    'Ridge': ("python {execdir}/ridge.py {datadir}/{train_dataset}.txt "
        "{datadir}/{test_dataset}.txt {cost}")
    }

Now it is expected that the executables lasso.py and ridge.py accept two command line arguments for the datasets. Note that the datasets are provided as tuples of training and test datasets.

This option was designed with nested cross validation in mind. One would create K splits of a dataset on disk, corresponding to separate train and test dataset. Then, each executable performs for instance 10-fold cross validation on each of the K training sets, each time predicting the corresponding test dataset. Results on both the training and test datasets would be printed to the output. Later, the label used for the training data can be set using the YTRAIN_LABEL setting. When generating the results, Abed will find out which parameter setting performs best on the training dataset, and show the performance on the test dataset. See Generate result tables for cv_tt mode for more information.

Raw command file

TYPE = 'RAW'

This setting can be used for experiments that do not fully fit in either of the above frameworks. It allows you to provide a file with commands, through the setting RAW_CMD_FILE. The raw command file should contain the commands you wish to execute on separate lines (empty lines are allowed). It is possible to use the variables {execdir} and {datadir} as with the other experiment types. Other variables will not be used however. A command file could look like this:

python {execdir}/lasso.py {datadir}/dataset_1.txt 1.0
python {execdir}/lasso.py {datadir}/dataset_1.txt 5.0
python {execdir}/lasso.py {datadir}/dataset_1.txt 10.0
python {execdir}/lasso.py {datadir}/dataset_1.txt 50.0
python {execdir}/lasso.py {datadir}/dataset_1.txt 100.0

python {execdir}/ridge.py {datadir}/dataset_1.txt 1.0
python {execdir}/ridge.py {datadir}/dataset_1.txt 5.0
python {execdir}/ridge.py {datadir}/dataset_1.txt 10.0
python {execdir}/ridge.py {datadir}/dataset_1.txt 50.0
python {execdir}/ridge.py {datadir}/dataset_1.txt 100.0

Note that now the DATASETS and METHODS settings will not be used. The command file should also be added to the git repository, as otherwise it will not be uploaded to the cluster.