The Abed Settings file¶
As a user, the settings file is the place where you’ll define your experiment for Abed. Since there are quite some settings that can be influenced, the settings file contains different sections with related settings. These sections are reflected in the documentation below.
General settings¶
PROJECT_NAME
¶
Default: ''
The name of the project to use. This setting is used in the default of the
setting REMOTE_DIR
, and can therefore not be empty when accepting
the default for that setting. It can contain a forward slash (/
), when
necessary. The project name further only occurs in the HTML output generated
by Abed.
TASK_FILE
¶
Default: abed.constants.TASKS_FILENAME
This is the filename of the task file. It shouldn’t generally need to be changed from the default value.
AUTO_FILE
¶
Default: abed.constants.AUTO_FILENAME
This is the filename of the auto file. It shouldn’t generally need to be changed from the default value.
RESULT_DIR
¶
Default: '/path/to/local/results'
Path to the local results directory. This is where the results of all computations will end up in the end. Because this can be potentially large, it is wise to consider the location of this path carefully. For instance, when working in a Dropbox folder, it may not be feasible to store all results in Dropbox. To ensure this is considered, the default path is not directly useable and should be changed by the user.
STAGE_DIR
¶
Default: '/path/to/local/stagedir'
Path to local stage directory. This is the directory where the results are
temporarily stored before being moved to the result directory. The same
considerations apply as are specified in the setting RESULT_DIR
.
The default should be adjusted by the user.
MAX_FILES
¶
Default: 1000
The maximum number of files stored in a result directory on the remote. When results are generated on the remote, they are placed in subdirectories of at most this many files. Each subdirectory will later be zipped before it can be downloaded from the remote, so it can be useful to not make the zip files too large. With this setting a limit can be placed on the number of files per subdirectory of the remote result directory, and therefore on the size of the zip files that need to be downloaded. The default should be fine for most users.
ZIP_DIR
¶
Default: './zips'
Path to the local zip directory. This is where the compressed zip files with
results are stored. These files will be unpacked into the stage dir and moved
to the result directory (see RESULT_DIR
). The zip files are kept by
Abed for posterity, but can be removed when the user sees no need for them
anymore (the raw results will be kept in the result directory).
OUTPUT_DIR
¶
Default: './output'
Local path where the output files are placed that are generated by Abed.
AUTO_SLEEP
¶
Default: 120
Sleep time in seconds between consecutive checks on the remote when running in auto mode. In auto mode Abed checks the remote repeatedly to see if a submitted task is finished, so it can pull the results and submit a new task. The time between checks is defined here. See also Automatic Job Management.
HTML_PORT
¶
Default: 8000
Port on the localhost where the HTML pages will be served on. This is used by
the function abed.html.view.view_html()
. It will be automatically
incremented when the port is in use. It should not generally have to be
changed by the user.
COMPRESSION
¶
Default: 'bzip2'
Compression algorithm to use when compressing finished result directories.
Abed can compress result files automatically when all computations for a
dataset are finished. This is the algorithm that is used by Abed to compress
the tar file that results. Allowed choices are: bzip2
, gzip
, and
lzma
. See also the documentation in Result Compression.
Server parameters and settings¶
REMOTE_USER
¶
Default: 'username'
Username on the remote server. This is assumed to be the username to use when logging in on the remote server, and under which the jobs will be submitted. It should be changed by the user.
REMOTE_HOST
¶
Default: 'address.of.host'
The address of the remote server. For instance, for the Dutch LISA compute
cluster, it is lisa.surfsara.nl
. It should be changed by the user.
REMOTE_DIR
¶
Default: '/home/%s/projects/%s' % (REMOTE_USER, PROJECT_NAME)
Path on the remote server to place all project files in. The default assumes
that the remote server is a Unix server with the standard directory layout. It
uses the REMOTE_USER
setting and the PROJECT_NAME
setting to construct the remote path. Therefore, it shouldn’t necessarily have
to be changed by the user.
REMOTE_PORT
¶
Default: 22
Remote communication port to use. Usually communication is done over the SSH port, so that is the default. To learn more about how communication is done with the remote server, see Communication through Fabric and Utilities for working with Fabric.
REMOTE_SCRATCH
¶
Default: None
Typically, it is more efficient to place results generated by a job on the
compute cluster on a disk which is close to the node which executes the job,
as opposed to in the user’s home directory. Such a directory is called the
scratch directory, and it is typically erased after a job is finished. Abed
places the results it generates on this scratch directory, and periodically
copies them to the home directory of the user, to lower the burden on the
network of the cluster. On the Dutch LISA cluster, the path to the scratch
directory is given by an environment variable, but on other systems it may be
a fixed location. This setting exists for the latter case. For the former
case, see REMOTE_SCRATCH_ENV
. This setting is used by
abed.run_utils.get_scratchdir()
.
REMOTE_SCRATCH_ENV
¶
Default: 'TMPDIR'
See also the description under the setting REMOTE_SCRATCH
. This
setting exists in case the location of the scratch directory is given by an
environment variable. The name of the environment variable can be set here.
The default corresponds to the name used on the Dutch LISA compute cluster.
This setting is used by abed.run_utils.get_scratchdir()
.
Settings for Master/Worker program¶
MW_SENDATONCE
¶
Default: 100
When running tasks on the compute node, Abed works through a Master/Worker
system, where the master thread continually sends out work to the worker
threads. To reduce communication load, it sends out a number of tasks at once.
This setting defines how many tasks are sent. Note that as soon as there are
fewer tasks left than n_workers * MW_SENDATONCE
, Abed reduces the number
of tasks to send to 1. This ensures that the situation can’t occur where one
worker gets all the remaining tasks whereas other workers get none.
MW_COPY_WORKER
¶
Default: False
Abed can reserve a thread on the compute node to periodically copy results from the scratch directory on the node to the home directory of the user. However, this places additional load on the internal bandwith of the compute cluster, so therefore it is turned off by default. The results will in any case be copied as compressed archives from the scratch after the wall time has ended.
MW_COPY_SLEEP
¶
Default: 120
If :setting:MW_COPY_WORKER
is True
, Abed reserves one thread on the
compute node for periodically copying the result files from the scratch
directory to the home directory (see also REMOTE_SCRATCH
). This
setting defines the time in seconds between consecutive copying of the
results.
MW_NUM_WORKERS
¶
Default: None
The number of worker processes to use. This can be useful in applications
where you don’t want to use all available processes, for instance because
you’re using parallelism in your application. Allowed values are a number or
None
. If None
is given, the maximum number of worker processes is
started. Note that the master process and the copy worker do not count as
worker processes (see also MW_COPY_WORKER
).
Experiment type¶
These settings define the type of experiment that Abed will run. For a more in-depth overview, see Types of Experiments. Note that not all settings apply to all types of experiments. Therefore, the initial settings file generated by Abed requires the user to uncomment the block of settings relating to the specific type of experiment they want to use.
TYPE
¶
Default: Undefined.
This setting defines the type of experiment that will be run. Valid options
are 'ASSESS'
, 'CV_TT'
, and 'RAW'
. See Types of Experiments for a
more in-depth discussion of the different types.
CV_BASESEED
¶
Default: 123456
Only used when the experiment type is 'CV_TT'
. This defines the base seed
to use for the generation of the cross-validation seeds. This setting is used
by the function abed.tasks.init_tasks_cv_tt()
.
YTRAIN_LABEL
¶
Default: 'y_train'
Only used when the experiment type is 'CV_TT'
. This defines the label for
the part in the result file which corresponds to the training set.
RAW_CMD_FILE
¶
Default: '/path/to/file.txt'
Only used when the experiment type is 'RAW'
. This is the path to the file
with raw tasks. See also Types of Experiments.
Build settings¶
BUILD_DIR
¶
Default: 'build'
Directory where the build command needs to be executed. It is assumed that this is a subdirectory of the current directory on the remote. The path defined here should therefore be the path in the git archive. With the default setting, the local Abed directory would look like:
|--- abed_conf.py
|--- abed_tasks.txt
|--- abed_auto.txt
|--- build
|--- datasets
|--- execs
Experiment parameters and settings¶
DATADIR
¶
Default: abed.constants.DATASET_DIRNAME
The path where the datasets are stored. Note that this does not necessarily
have to be in the Abed working directory (where you typed abed init
).
However, on the remote the datasets will be placed in a directory called
datasets
in the remote working directory. By using this setting it is
possible to place your datasets in a directory outside of the Abed working
directory. This can be very useful when you’re running multiple experiments
that use the same datasets.
EXECDIR
¶
Default: abed.constants.EXECS_DIRNAME
The path where the executables are stored. These executables are used in the
commands that Abed runs (see the setting COMMANDS
). It is advisable
to keep the default setting as this makes it easier to place executables under
version control.
DATASETS
¶
Default: ['dataset_1', 'dataset_2']
This setting defines the names of the datasets that will be used in the
experiments. Abed expects a different type in the list depending on the
TYPE
that is used:
When
TYPE
is'ASSESS'
, the expected format is the same as the default, simply a list of names of the datasets, as strings.When
TYPE
is'CV_TT'
, the expected format is a list of tuples, where each tuple is a pair of strings. The first string gives the name of the training dataset, and the second string the name of the test dataset. For instance:DATASETS = [('dataset_1_train', 'dataset_1_test'), ('dataset_2_train', 'dataset_2_test')]
When
TYPE
is'RAW'
, this setting is not used.
See for more info on the different experiment types and their requirements Types of Experiments.
DATASET_NAMES
¶
Default: {k:str(i) for i, k in enumerate(DATASETS)}
Optional. This setting gives a mapping of datasets to names. This can be
useful when you wish to use different names for the datasets in the output
than in the DATASETS
setting. If this setting is not present in the
settings file, the ID of a dataset will be generated with the function
abed.datasets.dataset_name()
.
As an example, consider datasets names following the pattern 'dataset_1'
,
'dataset_2'
, etc. It may then be nice to use names such as '001'
,
'002'
, etc. in the result tables. This can be achieved by setting:
DATASET_NAMES = {k:'%03i' % int(k.split('_')[-1]) for k in DATASETS}
Note that this setting relates closely to the setting
DATA_DESCRIPTION_CSV
.
METHODS
¶
Default: ['method_1', 'method_2']
Here you define the names of the methods that you will use. These names will
must be the same as the ones used in the PARAMS
and the
COMMANDS
settings. Since these names will also be used in directory
names, it is advisable to not use spaces or other illegal characters in these
names.
PARAMS
¶
Default:
{
'method_1': {
'param_1': [val_1, val_2],
'param_2': [val_3, val_4],
'param_3': [val_5, val_6]
},
'method_2': {
'param_1': [val_1, val_2, val_3],
},
}
As described in How Abed works, Abed runs a grid search where the commands
defined in COMMANDS
are run for the respective method for each
dataset, and all possible combinations of the values in the parameters. The
values that are used are defined in this setting. Expected is a list of values
for each parameter, even if only one value is used. Note that the names used
for the parameters match those used in the commands. The user must therefore
ensure that these are the same. This setting is not used with the 'RAW'
experiment type.
COMMANDS
¶
Default:
{
'method_1': ("{execdir}/method_1 {datadir}/{dataset} {param_1} "
"{param_2} {param_3}"),
'method_2': "{execdir}/method_2 {datadir}/{dataset} {param_1}"
}
Abed works by calling external commands for each method. The advantage of
running external commands, is that Abed can be used regardless of the language
that the methods are implemented in. This setting defines the commands that
Abed needs to run for each method. The variables {execdir}
and
{datadir}
are special variables, which are formatted by Abed
automatically. The {param_*}
variables correspond to the names defined in
PARAMS
. Finally, the {dataset}
variable will be formatted by
Abed based on the names of the datasets defined in the DATASETS
setting. Note that it is up to the user to ensure the right file extension is
supplied here. This means, that if the name of a dataset defined in
DATASETS
is for instance 'iris'
, but the filename on the disk
is 'iris.txt'
, the command should be adjusted with the part
{datadir}/{dataset}.txt
.
There are slight differences between the way the commands are used depending
on the type of experiment that is run (see TYPE
). Thus,
When
TYPE
is'ASSESS'
, the expected form for the dataset part of the command is{dataset}
(as in the default).When
TYPE
is'CV_TT'
, both a training and a test dataset should be included in the command, with the variables{train_dataset}
and{test_dataset}
, respectively. Thus, for this format a command could look like:COMMANDS = {'method_1': ("{execdir}/method_1 {datadir}/{train_dataset} " "{datadir}/{test_dataset} {param_1} {param_2} {param_3}")}
When
TYPE
is'RAW'
, this setting is not used.
METRICS
¶
Default:
{
'NAME_1': {
'metric': metric_function_1,
'best': max
},
'NAME_2': {
'metric': metric_function_2,
'best': min
}
}
This setting defines the metrics that are applied to the output of a single command. The user is free to define any function here, although Abed currently expects a function that takes two lists as input. It is therefore recommended to either use functions from sklearn.metrics, or define functions with a similar signature. See Defining custom metric functions for instructions on how to include custom metrics.
Note that in these settings, a name can be defined by the user, as well as
which direction is considered better in the metric function. This is done by
defining the 'best'
field, which can be either the max
function, or
the min
function. These directions will be used when Abed ranks the
results of a method on a given dataset with a given set of parameters.
SCALARS
¶
Default:
{
'time': {
'best': min
},
}
To compare results from a command on a single variable, this setting can be used. This can be useful when one wants to compare the computation time of a command for instance. The external executable could print for instance:
time
0.8473294179
With the default setting for the SCALARS
field, Abed would read
this value as a scalar result for the command.
RESULT_PRECISION
¶
Default: 4
Results are considered equal if they are the same number within this precision. Thus, with the default setting, the numbers 1.12345 and 1.12354 would be considered equal, and would therefore get the same rank. If no results should ever be considered equal, increase this setting to a large enough number.
DATA_DESCRIPTION_CSV
¶
Default: None
When generating result tables, it is possible to add additional columns of the table with an external CSV file. It is required that the CSV file is of the format:
ID,col1,col2,col3
1,a,10,3
2,b,20,2
3,c,30,1
where the first column is considered the column with IDs of the datasets. The
easiest way to do this is to combine this with the DATASET_NAMES
setting, which is a dict
mapping elements of the DATASETS
list
to IDs. IDs of datasets must be strings. The first row of the CSV file will be
used as headers in the table.
REFERENCE_METHOD
¶
Default: None
Abed automatically runs statistical tests to see if a chosen reference method
is statistically different from other methods. This reference method can be
set here, and must be a method from the METHODS
setting. If you do
not wish to run these statistical tests, use the default value of None
.
See also the documentation in Understanding Statistical Test Results for more information on
how to use and interpret the test results (tldr: carefully!).
SIGNIFICANCE_LEVEL
¶
Default: 0.05
This sets the significance level used in the statistical tests. See also the
documentation in Understanding Statistical Test Results and the setting
REFERENCE_METHOD
.
PBS settings¶
The settings below all relate to running the simulations on a compute cluster. Currently only PBS Torque type clusters are supported. In the future, these settings will likely be generalized to support other compute cluster setups as well.
PBS_WALLTIME
¶
Default: 360
Wall-clock time in minutes for the computations. This is the time that will be
reserved from the queueing system. Note that the actual computation time is
dependent also on PBS_TIME_REDUCE
.
PBS_CPUTYPE
¶
Default: None
Optional. The type of cpu to use on the cluster. Some clusters allow to
specify which type of cpu will be used by the job. This can be very important
for jobs where time comparisons are performed, as there it is vital to use the
same type of cpu. If set, this setting must be a string. For example, one can
specify 'cpu4'
for a specific type of CPU on Lisa. This
setting may not be available on all PBS systems.
PBS_CORETYPE
¶
Default: None
Optional. The type of node to use on the cluster, as specified by the number
of cores of the node. This setting is similar to the PBS_CPUTYPE
setting. For example, one can specify 'cores16'
for a 16-core node for
instance. This setting may not be available on all PBS systems.
PBS_LINES_BEFORE
¶
Default: []
Optional. Additional lines to add to the PBS file. These lines will be added before the email line, and directly after the lines creating the result directories.
PBS_LINES_AFTER
¶
Default: []
Optional. Additonal lines to add to the PBS file. These lines will be added just after the compression of the result files, and just before the final email line.
PBS_PPN
¶
Default: None
Optional. The number of processors per node to use. If you know beforehand how many cores there are on a node, this setting allows you to limit the number of processors that are actually used for computations. Especially when running computation time comparisons, it is recommended to reserve one core for system processes.
PBS_MODULES
¶
Default: ['mpicopy', 'python/2.7.9']
Optional. On some PBS systems, additional modules may be loaded with the
command module load
. This configuration defines the modules that are
loaded.
Note that some modules may be necessary for Abed to function correctly. For
instance, the mpicopy
command is used for copying files to compute nodes
during a job, and on some systems this may require loading the mpicopy
module. See also the setting PBS_MPICOPY
.
PBS_EXPORTS
¶
Default: ['PATH=$PATH:/home/%s/.local/bin/abed' % REMOTE_USER]
Optional. The lines in this list are interpreted as arguments for the
export
command. This can be useful for setting PATH variables, or defining
other environment settings.
PBS_MPICOPY
¶
Default: ['{data_dir}', EXECDIR, TASK_FILE]
Optional. Abed was initially designed for the Dutch National LISA Compute
Cluster. On this cluster, it is more efficient to store results from
computations on a so-called scratch directory, which is a disk attached
locally on the compute node. To copy files to this scratch directory, the LISA
staff designed the mpicopy
command. This setting can be used to define the
files and directories that will be copied to the scratch directory on the
node. For more information on the mpicopy
command, see here.
Dependency on this command is not a very portable solution, ideas for improvement are very welcome.
PBS_TIME_REDUCE
¶
Default: 600
Abed generates a result file for every task. Since this can be quite a lot of
files to download from the server after the job is done, Abed creates
compressed archives of results. These archives are generated using the
pbzip2
command, which compresses files in parallel. Hence, part of the
time of the job is used for this result compression. The time allotted for
this is defined with this setting, in seconds. If you expect only a few result
files, you can choose to reduce the value of this setting.
Note: it is currently unknown if the pbzip2
command is widely available.
If dependency on this command is a problem, please let us know.