The Abed Settings file

As a user, the settings file is the place where you’ll define your experiment for Abed. Since there are quite some settings that can be influenced, the settings file contains different sections with related settings. These sections are reflected in the documentation below.

General settings

PROJECT_NAME

Default: ''

The name of the project to use. This setting is used in the default of the setting REMOTE_DIR, and can therefore not be empty when accepting the default for that setting. It can contain a forward slash (/), when necessary. The project name further only occurs in the HTML output generated by Abed.

TASK_FILE

Default: abed.constants.TASKS_FILENAME

This is the filename of the task file. It shouldn’t generally need to be changed from the default value.

AUTO_FILE

Default: abed.constants.AUTO_FILENAME

This is the filename of the auto file. It shouldn’t generally need to be changed from the default value.

RESULT_DIR

Default: '/path/to/local/results'

Path to the local results directory. This is where the results of all computations will end up in the end. Because this can be potentially large, it is wise to consider the location of this path carefully. For instance, when working in a Dropbox folder, it may not be feasible to store all results in Dropbox. To ensure this is considered, the default path is not directly useable and should be changed by the user.

STAGE_DIR

Default: '/path/to/local/stagedir'

Path to local stage directory. This is the directory where the results are temporarily stored before being moved to the result directory. The same considerations apply as are specified in the setting RESULT_DIR. The default should be adjusted by the user.

MAX_FILES

Default: 1000

The maximum number of files stored in a result directory on the remote. When results are generated on the remote, they are placed in subdirectories of at most this many files. Each subdirectory will later be zipped before it can be downloaded from the remote, so it can be useful to not make the zip files too large. With this setting a limit can be placed on the number of files per subdirectory of the remote result directory, and therefore on the size of the zip files that need to be downloaded. The default should be fine for most users.

ZIP_DIR

Default: './zips'

Path to the local zip directory. This is where the compressed zip files with results are stored. These files will be unpacked into the stage dir and moved to the result directory (see RESULT_DIR). The zip files are kept by Abed for posterity, but can be removed when the user sees no need for them anymore (the raw results will be kept in the result directory).

LOG_DIR

Default: './logs'

Local path where the PBS logs will be kept.

OUTPUT_DIR

Default: './output'

Local path where the output files are placed that are generated by Abed.

AUTO_SLEEP

Default: 120

Sleep time in seconds between consecutive checks on the remote when running in auto mode. In auto mode Abed checks the remote repeatedly to see if a submitted task is finished, so it can pull the results and submit a new task. The time between checks is defined here. See also Automatic Job Management.

HTML_PORT

Default: 8000

Port on the localhost where the HTML pages will be served on. This is used by the function abed.html.view.view_html(). It will be automatically incremented when the port is in use. It should not generally have to be changed by the user.

COMPRESSION

Default: 'bzip2'

Compression algorithm to use when compressing finished result directories. Abed can compress result files automatically when all computations for a dataset are finished. This is the algorithm that is used by Abed to compress the tar file that results. Allowed choices are: bzip2, gzip, and lzma. See also the documentation in Result Compression.

Server parameters and settings

REMOTE_USER

Default: 'username'

Username on the remote server. This is assumed to be the username to use when logging in on the remote server, and under which the jobs will be submitted. It should be changed by the user.

REMOTE_HOST

Default: 'address.of.host'

The address of the remote server. For instance, for the Dutch LISA compute cluster, it is lisa.surfsara.nl. It should be changed by the user.

REMOTE_DIR

Default: '/home/%s/projects/%s' % (REMOTE_USER, PROJECT_NAME)

Path on the remote server to place all project files in. The default assumes that the remote server is a Unix server with the standard directory layout. It uses the REMOTE_USER setting and the PROJECT_NAME setting to construct the remote path. Therefore, it shouldn’t necessarily have to be changed by the user.

REMOTE_PORT

Default: 22

Remote communication port to use. Usually communication is done over the SSH port, so that is the default. To learn more about how communication is done with the remote server, see Communication through Fabric and Utilities for working with Fabric.

REMOTE_SCRATCH

Default: None

Typically, it is more efficient to place results generated by a job on the compute cluster on a disk which is close to the node which executes the job, as opposed to in the user’s home directory. Such a directory is called the scratch directory, and it is typically erased after a job is finished. Abed places the results it generates on this scratch directory, and periodically copies them to the home directory of the user, to lower the burden on the network of the cluster. On the Dutch LISA cluster, the path to the scratch directory is given by an environment variable, but on other systems it may be a fixed location. This setting exists for the latter case. For the former case, see REMOTE_SCRATCH_ENV. This setting is used by abed.run_utils.get_scratchdir().

REMOTE_SCRATCH_ENV

Default: 'TMPDIR'

See also the description under the setting REMOTE_SCRATCH. This setting exists in case the location of the scratch directory is given by an environment variable. The name of the environment variable can be set here. The default corresponds to the name used on the Dutch LISA compute cluster. This setting is used by abed.run_utils.get_scratchdir().

Settings for Master/Worker program

MW_SENDATONCE

Default: 100

When running tasks on the compute node, Abed works through a Master/Worker system, where the master thread continually sends out work to the worker threads. To reduce communication load, it sends out a number of tasks at once. This setting defines how many tasks are sent. Note that as soon as there are fewer tasks left than n_workers * MW_SENDATONCE, Abed reduces the number of tasks to send to 1. This ensures that the situation can’t occur where one worker gets all the remaining tasks whereas other workers get none.

MW_COPY_WORKER

Default: False

Abed can reserve a thread on the compute node to periodically copy results from the scratch directory on the node to the home directory of the user. However, this places additional load on the internal bandwith of the compute cluster, so therefore it is turned off by default. The results will in any case be copied as compressed archives from the scratch after the wall time has ended.

MW_COPY_SLEEP

Default: 120

If :setting:MW_COPY_WORKER is True, Abed reserves one thread on the compute node for periodically copying the result files from the scratch directory to the home directory (see also REMOTE_SCRATCH). This setting defines the time in seconds between consecutive copying of the results.

MW_NUM_WORKERS

Default: None

The number of worker processes to use. This can be useful in applications where you don’t want to use all available processes, for instance because you’re using parallelism in your application. Allowed values are a number or None. If None is given, the maximum number of worker processes is started. Note that the master process and the copy worker do not count as worker processes (see also MW_COPY_WORKER).

Experiment type

These settings define the type of experiment that Abed will run. For a more in-depth overview, see Types of Experiments. Note that not all settings apply to all types of experiments. Therefore, the initial settings file generated by Abed requires the user to uncomment the block of settings relating to the specific type of experiment they want to use.

TYPE

Default: Undefined.

This setting defines the type of experiment that will be run. Valid options are 'ASSESS', 'CV_TT', and 'RAW'. See Types of Experiments for a more in-depth discussion of the different types.

CV_BASESEED

Default: 123456

Only used when the experiment type is 'CV_TT'. This defines the base seed to use for the generation of the cross-validation seeds. This setting is used by the function abed.tasks.init_tasks_cv_tt().

YTRAIN_LABEL

Default: 'y_train'

Only used when the experiment type is 'CV_TT'. This defines the label for the part in the result file which corresponds to the training set.

RAW_CMD_FILE

Default: '/path/to/file.txt'

Only used when the experiment type is 'RAW'. This is the path to the file with raw tasks. See also Types of Experiments.

Build settings

NEEDS_BUILD

Default: False

Whether or not compilation is necessary on the remote.

BUILD_DIR

Default: 'build'

Directory where the build command needs to be executed. It is assumed that this is a subdirectory of the current directory on the remote. The path defined here should therefore be the path in the git archive. With the default setting, the local Abed directory would look like:

|--- abed_conf.py
|--- abed_tasks.txt
|--- abed_auto.txt
|--- build
|--- datasets
|--- execs

BUILD_CMD

Default: 'make all'

The command to run on the remote. This is run in the directory given by the BUILD_DIR setting.

Experiment parameters and settings

DATADIR

Default: abed.constants.DATASET_DIRNAME

The path where the datasets are stored. Note that this does not necessarily have to be in the Abed working directory (where you typed abed init). However, on the remote the datasets will be placed in a directory called datasets in the remote working directory. By using this setting it is possible to place your datasets in a directory outside of the Abed working directory. This can be very useful when you’re running multiple experiments that use the same datasets.

EXECDIR

Default: abed.constants.EXECS_DIRNAME

The path where the executables are stored. These executables are used in the commands that Abed runs (see the setting COMMANDS). It is advisable to keep the default setting as this makes it easier to place executables under version control.

DATASETS

Default: ['dataset_1', 'dataset_2']

This setting defines the names of the datasets that will be used in the experiments. Abed expects a different type in the list depending on the TYPE that is used:

  • When TYPE is 'ASSESS', the expected format is the same as the default, simply a list of names of the datasets, as strings.

  • When TYPE is 'CV_TT', the expected format is a list of tuples, where each tuple is a pair of strings. The first string gives the name of the training dataset, and the second string the name of the test dataset. For instance:

    DATASETS = [('dataset_1_train', 'dataset_1_test'), ('dataset_2_train',
    'dataset_2_test')]
    
  • When TYPE is 'RAW', this setting is not used.

See for more info on the different experiment types and their requirements Types of Experiments.

DATASET_NAMES

Default: {k:str(i) for i, k in enumerate(DATASETS)}

Optional. This setting gives a mapping of datasets to names. This can be useful when you wish to use different names for the datasets in the output than in the DATASETS setting. If this setting is not present in the settings file, the ID of a dataset will be generated with the function abed.datasets.dataset_name().

As an example, consider datasets names following the pattern 'dataset_1', 'dataset_2', etc. It may then be nice to use names such as '001', '002', etc. in the result tables. This can be achieved by setting:

DATASET_NAMES = {k:'%03i' % int(k.split('_')[-1]) for k in DATASETS}

Note that this setting relates closely to the setting DATA_DESCRIPTION_CSV.

METHODS

Default: ['method_1', 'method_2']

Here you define the names of the methods that you will use. These names will must be the same as the ones used in the PARAMS and the COMMANDS settings. Since these names will also be used in directory names, it is advisable to not use spaces or other illegal characters in these names.

PARAMS

Default:

{
    'method_1': {
        'param_1': [val_1, val_2],
        'param_2': [val_3, val_4],
        'param_3': [val_5, val_6]
        },
    'method_2': {
        'param_1': [val_1, val_2, val_3],
        },
 }

As described in How Abed works, Abed runs a grid search where the commands defined in COMMANDS are run for the respective method for each dataset, and all possible combinations of the values in the parameters. The values that are used are defined in this setting. Expected is a list of values for each parameter, even if only one value is used. Note that the names used for the parameters match those used in the commands. The user must therefore ensure that these are the same. This setting is not used with the 'RAW' experiment type.

COMMANDS

Default:

{
    'method_1': ("{execdir}/method_1 {datadir}/{dataset} {param_1} "
        "{param_2} {param_3}"),
    'method_2': "{execdir}/method_2 {datadir}/{dataset} {param_1}"
}

Abed works by calling external commands for each method. The advantage of running external commands, is that Abed can be used regardless of the language that the methods are implemented in. This setting defines the commands that Abed needs to run for each method. The variables {execdir} and {datadir} are special variables, which are formatted by Abed automatically. The {param_*} variables correspond to the names defined in PARAMS. Finally, the {dataset} variable will be formatted by Abed based on the names of the datasets defined in the DATASETS setting. Note that it is up to the user to ensure the right file extension is supplied here. This means, that if the name of a dataset defined in DATASETS is for instance 'iris', but the filename on the disk is 'iris.txt', the command should be adjusted with the part {datadir}/{dataset}.txt.

There are slight differences between the way the commands are used depending on the type of experiment that is run (see TYPE). Thus,

  • When TYPE is 'ASSESS', the expected form for the dataset part of the command is {dataset} (as in the default).

  • When TYPE is 'CV_TT', both a training and a test dataset should be included in the command, with the variables {train_dataset} and {test_dataset}, respectively. Thus, for this format a command could look like:

    COMMANDS = {'method_1': ("{execdir}/method_1 {datadir}/{train_dataset} "
        "{datadir}/{test_dataset} {param_1} {param_2} {param_3}")}
    
  • When TYPE is 'RAW', this setting is not used.

METRICS

Default:

{
    'NAME_1': {
        'metric': metric_function_1,
        'best': max
        },
    'NAME_2': {
        'metric': metric_function_2,
        'best': min
        }
}

This setting defines the metrics that are applied to the output of a single command. The user is free to define any function here, although Abed currently expects a function that takes two lists as input. It is therefore recommended to either use functions from sklearn.metrics, or define functions with a similar signature. See Defining custom metric functions for instructions on how to include custom metrics.

Note that in these settings, a name can be defined by the user, as well as which direction is considered better in the metric function. This is done by defining the 'best' field, which can be either the max function, or the min function. These directions will be used when Abed ranks the results of a method on a given dataset with a given set of parameters.

SCALARS

Default:

{
    'time': {
        'best': min
        },
}

To compare results from a command on a single variable, this setting can be used. This can be useful when one wants to compare the computation time of a command for instance. The external executable could print for instance:

time
0.8473294179

With the default setting for the SCALARS field, Abed would read this value as a scalar result for the command.

RESULT_PRECISION

Default: 4

Results are considered equal if they are the same number within this precision. Thus, with the default setting, the numbers 1.12345 and 1.12354 would be considered equal, and would therefore get the same rank. If no results should ever be considered equal, increase this setting to a large enough number.

DATA_DESCRIPTION_CSV

Default: None

When generating result tables, it is possible to add additional columns of the table with an external CSV file. It is required that the CSV file is of the format:

ID,col1,col2,col3
1,a,10,3
2,b,20,2
3,c,30,1

where the first column is considered the column with IDs of the datasets. The easiest way to do this is to combine this with the DATASET_NAMES setting, which is a dict mapping elements of the DATASETS list to IDs. IDs of datasets must be strings. The first row of the CSV file will be used as headers in the table.

REFERENCE_METHOD

Default: None

Abed automatically runs statistical tests to see if a chosen reference method is statistically different from other methods. This reference method can be set here, and must be a method from the METHODS setting. If you do not wish to run these statistical tests, use the default value of None. See also the documentation in Understanding Statistical Test Results for more information on how to use and interpret the test results (tldr: carefully!).

SIGNIFICANCE_LEVEL

Default: 0.05

This sets the significance level used in the statistical tests. See also the documentation in Understanding Statistical Test Results and the setting REFERENCE_METHOD.

PBS settings

The settings below all relate to running the simulations on a compute cluster. Currently only PBS Torque type clusters are supported. In the future, these settings will likely be generalized to support other compute cluster setups as well.

PBS_NODES

Default: 1

The number of compute nodes to use on the cluster.

PBS_WALLTIME

Default: 360

Wall-clock time in minutes for the computations. This is the time that will be reserved from the queueing system. Note that the actual computation time is dependent also on PBS_TIME_REDUCE.

PBS_CPUTYPE

Default: None

Optional. The type of cpu to use on the cluster. Some clusters allow to specify which type of cpu will be used by the job. This can be very important for jobs where time comparisons are performed, as there it is vital to use the same type of cpu. If set, this setting must be a string. For example, one can specify 'cpu4' for a specific type of CPU on Lisa. This setting may not be available on all PBS systems.

PBS_CORETYPE

Default: None

Optional. The type of node to use on the cluster, as specified by the number of cores of the node. This setting is similar to the PBS_CPUTYPE setting. For example, one can specify 'cores16' for a 16-core node for instance. This setting may not be available on all PBS systems.

PBS_LINES_BEFORE

Default: []

Optional. Additional lines to add to the PBS file. These lines will be added before the email line, and directly after the lines creating the result directories.

PBS_LINES_AFTER

Default: []

Optional. Additonal lines to add to the PBS file. These lines will be added just after the compression of the result files, and just before the final email line.

PBS_PPN

Default: None

Optional. The number of processors per node to use. If you know beforehand how many cores there are on a node, this setting allows you to limit the number of processors that are actually used for computations. Especially when running computation time comparisons, it is recommended to reserve one core for system processes.

PBS_MODULES

Default: ['mpicopy', 'python/2.7.9']

Optional. On some PBS systems, additional modules may be loaded with the command module load. This configuration defines the modules that are loaded.

Note that some modules may be necessary for Abed to function correctly. For instance, the mpicopy command is used for copying files to compute nodes during a job, and on some systems this may require loading the mpicopy module. See also the setting PBS_MPICOPY.

PBS_EXPORTS

Default: ['PATH=$PATH:/home/%s/.local/bin/abed' % REMOTE_USER]

Optional. The lines in this list are interpreted as arguments for the export command. This can be useful for setting PATH variables, or defining other environment settings.

PBS_MPICOPY

Default: ['{data_dir}', EXECDIR, TASK_FILE]

Optional. Abed was initially designed for the Dutch National LISA Compute Cluster. On this cluster, it is more efficient to store results from computations on a so-called scratch directory, which is a disk attached locally on the compute node. To copy files to this scratch directory, the LISA staff designed the mpicopy command. This setting can be used to define the files and directories that will be copied to the scratch directory on the node. For more information on the mpicopy command, see here.

Dependency on this command is not a very portable solution, ideas for improvement are very welcome.

PBS_TIME_REDUCE

Default: 600

Abed generates a result file for every task. Since this can be quite a lot of files to download from the server after the job is done, Abed creates compressed archives of results. These archives are generated using the pbzip2 command, which compresses files in parallel. Hence, part of the time of the job is used for this result compression. The time allotted for this is defined with this setting, in seconds. If you expect only a few result files, you can choose to reduce the value of this setting.

Note: it is currently unknown if the pbzip2 command is widely available. If dependency on this command is a problem, please let us know.