======================
The Abed Settings file
======================

As a user, the settings file is the place where you'll define your experiment 
for Abed. Since there are quite some settings that can be influenced, the 
settings file contains different sections with related settings. These 
sections are reflected in the documentation below.

.. contents::
    :local:
    :depth: 1


General settings
================

.. setting:: PROJECT_NAME

``PROJECT_NAME``
----------------

Default: ``''``

The name of the project to use. This setting is used in the default of the 
setting :setting:`REMOTE_DIR`, and can therefore not be empty when accepting 
the default for that setting. It can contain a forward slash (``/``), when 
necessary. The project name further only occurs in the HTML output generated 
by Abed.

.. setting:: TASK_FILE

``TASK_FILE``
-------------

Default: :py:const:`abed.constants.TASKS_FILENAME`

This is the filename of the task file. It shouldn't generally need to be 
changed from the default value.


.. setting:: AUTO_FILE

``AUTO_FILE``
-------------

Default: :py:const:`abed.constants.AUTO_FILENAME`

This is the filename of the auto file. It shouldn't generally need to be 
changed from the default value.

.. setting:: RESULT_DIR

``RESULT_DIR``
--------------

Default: ``'/path/to/local/results'``

Path to the local results directory. This is where the results of all 
computations will end up in the end. Because this can be potentially large, it 
is wise to consider the location of this path carefully. For instance, when 
working in a Dropbox folder, it may not be feasible to store all results in 
Dropbox. To ensure this is considered, the default path is not directly 
useable and should be changed by the user.

.. setting:: STAGE_DIR

``STAGE_DIR``
-------------

Default: ``'/path/to/local/stagedir'``

Path to local stage directory. This is the directory where the results are 
temporarily stored before being moved to the result directory. The same 
considerations apply as are specified in the setting :setting:`RESULT_DIR`.  
The default should be adjusted by the user.

.. setting:: MAX_FILES

``MAX_FILES``
-------------

Default: ``1000``

The maximum number of files stored in a result directory on the remote. When 
results are generated on the remote, they are placed in subdirectories of at 
most this many files. Each subdirectory will later be zipped before it can be 
downloaded from the remote, so it can be useful to not make the zip files too 
large. With this setting a limit can be placed on the number of files per 
subdirectory of the remote result directory, and therefore on the size of the 
zip files that need to be downloaded. The default should be fine for most 
users.

.. setting:: ZIP_DIR

``ZIP_DIR``
-----------

Default: ``'./zips'``

Path to the local zip directory. This is where the compressed zip files with 
results are stored. These files will be unpacked into the stage dir and moved 
to the result directory (see :setting:`RESULT_DIR`). The zip files are kept by 
Abed for posterity, but can be removed when the user sees no need for them 
anymore (the raw results will be kept in the result directory).

.. setting:: LOG_DIR

``LOG_DIR``
-----------

Default: ``'./logs'``

Local path where the PBS logs will be kept.

.. setting:: OUTPUT_DIR

``OUTPUT_DIR``
--------------

Default: ``'./output'``

Local path where the output files are placed that are generated by Abed.

.. setting:: AUTO_SLEEP

``AUTO_SLEEP``
--------------

Default: ``120``

Sleep time in seconds between consecutive checks on the remote when running in 
auto mode. In auto mode Abed checks the remote repeatedly to see if a 
submitted task is finished, so it can pull the results and submit a new task.  
The time between checks is defined here. See also 
:doc:`../api/core/abed.auto`.

.. setting:: HTML_PORT

``HTML_PORT``
--------------

Default: ``8000``

Port on the localhost where the HTML pages will be served on. This is used by 
the function :py:func:`abed.html.view.view_html`. It will be automatically 
incremented when the port is in use. It should not generally have to be 
changed by the user.

.. setting:: COMPRESSION

``COMPRESSION``
---------------

Default: ``'bzip2'``

Compression algorithm to use when compressing finished result directories.  
Abed can compress result files automatically when all computations for a 
dataset are finished. This is the algorithm that is used by Abed to compress 
the tar file that results. Allowed choices are: ``bzip2``, ``gzip``, and 
``lzma``. See also the documentation in :doc:`../api/core/abed.compress`.

Server parameters and settings
==============================

.. setting:: REMOTE_USER

``REMOTE_USER``
---------------

Default: ``'username'``

Username on the remote server. This is assumed to be the username to use when 
logging in on the remote server, and under which the jobs will be submitted.  
It should be changed by the user.

.. setting:: REMOTE_HOST

``REMOTE_HOST``
---------------

Default: ``'address.of.host'``

The address of the remote server. For instance, for the Dutch LISA compute 
cluster, it is ``lisa.surfsara.nl``. It should be changed by the user.

.. setting:: REMOTE_DIR

``REMOTE_DIR``
--------------

Default: ``'/home/%s/projects/%s' % (REMOTE_USER, PROJECT_NAME)``

Path on the remote server to place all project files in. The default assumes 
that the remote server is a Unix server with the standard directory layout. It 
uses the :setting:`REMOTE_USER` setting and the :setting:`PROJECT_NAME` 
setting to construct the remote path. Therefore, it shouldn't necessarily have 
to be changed by the user.

.. setting:: REMOTE_PORT

``REMOTE_PORT``
---------------

Default: ``22``

Remote communication port to use. Usually communication is done over the SSH 
port, so that is the default. To learn more about how communication is done 
with the remote server, see :doc:`../api/core/abed.fab` and 
:doc:`../api/core/abed.fab_util`.

.. setting:: REMOTE_SCRATCH

``REMOTE_SCRATCH``
------------------

Default: ``None``

Typically, it is more efficient to place results generated by a job on the 
compute cluster on a disk which is close to the node which executes the job, 
as opposed to in the user's home directory. Such a directory is called the 
*scratch* directory, and it is typically erased after a job is finished. Abed 
places the results it generates on this scratch directory, and periodically 
copies them to the home directory of the user, to lower the burden on the 
network of the cluster. On the Dutch LISA cluster, the path to the scratch 
directory is given by an environment variable, but on other systems it may be 
a fixed location.  This setting exists for the latter case. For the former 
case, see :setting:`REMOTE_SCRATCH_ENV`.  This setting is used by 
:py:func:`abed.run_utils.get_scratchdir`.

.. setting:: REMOTE_SCRATCH_ENV

``REMOTE_SCRATCH_ENV``
----------------------

Default: ``'TMPDIR'``

See also the description under the setting :setting:`REMOTE_SCRATCH`. This 
setting exists in case the location of the *scratch* directory is given by an 
environment variable. The name of the environment variable can be set here.  
The default corresponds to the name used on the Dutch LISA compute cluster.  
This setting is used by :py:func:`abed.run_utils.get_scratchdir`.

Settings for Master/Worker program
==================================

.. setting:: MW_SENDATONCE

``MW_SENDATONCE``
-----------------

Default: ``100``

When running tasks on the compute node, Abed works through a Master/Worker 
system, where the master thread continually sends out work to the worker 
threads. To reduce communication load, it sends out a number of tasks at once.  
This setting defines how many tasks are sent. Note that as soon as there are 
fewer tasks left than ``n_workers * MW_SENDATONCE``, Abed reduces the number 
of tasks to send to 1.  This ensures that the situation can't occur where one 
worker gets all the remaining tasks whereas other workers get none.

.. setting:: MW_COPY_WORKER

``MW_COPY_WORKER``
------------------

Default: ``False``

Abed can reserve a thread on the compute node to periodically copy results 
from the scratch directory on the node to the home directory of the user.  
However, this places additional load on the internal bandwith of the compute 
cluster, so therefore it is turned off by default. The results will in any 
case be copied as compressed archives from the scratch after the wall time has 
ended.

.. setting:: MW_COPY_SLEEP

``MW_COPY_SLEEP``
-----------------

Default: ``120``

If :setting:``MW_COPY_WORKER`` is ``True``, Abed reserves one thread on the 
compute node for periodically copying the result files from the scratch 
directory to the home directory (see also :setting:`REMOTE_SCRATCH`).  This 
setting defines the time in seconds between consecutive copying of the 
results.

.. setting:: MW_NUM_WORKERS

``MW_NUM_WORKERS``
------------------

Default: ``None``

The number of worker processes to use. This can be useful in applications 
where you *don't* want to use all available processes, for instance because 
you're using parallelism in your application. Allowed values are a number or 
``None``. If ``None`` is given, the maximum number of worker processes is 
started. Note that the master process and the copy worker do not count as 
worker processes (see also :setting:`MW_COPY_WORKER`).

Experiment type
===============

These settings define the type of experiment that Abed will run. For a more 
in-depth overview, see :doc:`experiments`. Note that not all settings apply to 
all types of experiments. Therefore, the initial settings file generated by 
Abed requires the user to uncomment the block of settings relating to the 
specific type of experiment they want to use.

.. setting:: TYPE

``TYPE``
--------

Default: Undefined.

This setting defines the type of experiment that will be run. Valid options 
are ``'ASSESS'``, ``'CV_TT'``, and ``'RAW'``. See :doc:`experiments` for a 
more in-depth discussion of the different types.


.. setting:: CV_BASESEED

``CV_BASESEED``
---------------

Default: ``123456``

Only used when the experiment type is ``'CV_TT'``. This defines the base seed 
to use for the generation of the cross-validation seeds. This setting is used 
by the function :py:func:`abed.tasks.init_tasks_cv_tt`.

.. setting:: YTRAIN_LABEL

``YTRAIN_LABEL``
----------------

Default: ``'y_train'``

Only used when the experiment type is ``'CV_TT'``. This defines the label for 
the part in the result file which corresponds to the training set.

.. setting:: RAW_CMD_FILE

``RAW_CMD_FILE``
----------------

Default: ``'/path/to/file.txt'``

Only used when the experiment type is ``'RAW'``. This is the path to the file 
with raw tasks. See also :doc:`experiments`.

Build settings
==============

.. setting:: NEEDS_BUILD

``NEEDS_BUILD``
---------------

Default: ``False``

Whether or not compilation is necessary on the remote.

.. setting:: BUILD_DIR

``BUILD_DIR``
-------------

Default: ``'build'``

Directory where the build command needs to be executed. It is assumed that 
this is a subdirectory of the *current* directory on the remote. The path 
defined here should therefore be the path in the git archive. With the default 
setting, the local Abed directory would look like::

    |--- abed_conf.py
    |--- abed_tasks.txt
    |--- abed_auto.txt
    |--- build
    |--- datasets
    |--- execs


.. setting:: BUILD_CMD

``BUILD_CMD``
-------------

Default: ``'make all'``

The command to run on the remote. This is run in the directory given by the
:setting:`BUILD_DIR` setting.

Experiment parameters and settings
==================================

.. setting:: DATADIR

``DATADIR``
-----------

Default: :py:const:`abed.constants.DATASET_DIRNAME`

The path where the datasets are stored. Note that this does not necessarily 
have to be in the Abed working directory (where you typed ``abed init``).  
However, on the remote the datasets will be placed in a directory called 
``datasets`` in the remote working directory. By using this setting it is 
possible to place your datasets in a directory outside of the Abed working 
directory. This can be very useful when you're running multiple experiments 
that use the same datasets.

.. setting:: EXECDIR

``EXECDIR``
-----------

Default: :py:const:`abed.constants.EXECS_DIRNAME`

The path where the executables are stored. These executables are used in the 
commands that Abed runs (see the setting :setting:`COMMANDS`). It is advisable 
to keep the default setting as this makes it easier to place executables under 
version control.

.. setting:: DATASETS

``DATASETS``
------------

Default: ``['dataset_1', 'dataset_2']``

This setting defines the names of the datasets that will be used in the 
experiments. Abed expects a different type in the list depending on the 
:setting:`TYPE` that is used:

* When :setting:`TYPE` is ``'ASSESS'``, the expected format is the same as the 
  default, simply a list of names of the datasets, as strings.

* When :setting:`TYPE` is ``'CV_TT'``, the expected format is a list of 
  tuples, where each tuple is a pair of strings. The first string gives the 
  name of the training dataset, and the second string the name of the test 
  dataset.  For instance::

    DATASETS = [('dataset_1_train', 'dataset_1_test'), ('dataset_2_train', 
    'dataset_2_test')]

* When :setting:`TYPE` is ``'RAW'``, this setting is not used.

See for more info on the different experiment types and their requirements 
:doc:`experiments`.

.. setting:: DATASET_NAMES

``DATASET_NAMES``
-----------------

Default: ``{k:str(i) for i, k in enumerate(DATASETS)}``

Optional. This setting gives a mapping of datasets to names. This can be 
useful when you wish to use different names for the datasets in the output 
than in the :setting:`DATASETS` setting. If this setting is not present in the 
settings file, the ID of a dataset will be generated with the function 
:py:func:`abed.datasets.dataset_name`. 

As an example, consider datasets names following the pattern ``'dataset_1'``, 
``'dataset_2'``, etc.  It may then be nice to use names such as ``'001'``, 
``'002'``, etc. in the result tables. This can be achieved by setting::

    DATASET_NAMES = {k:'%03i' % int(k.split('_')[-1]) for k in DATASETS}

Note that this setting relates closely to the setting 
:setting:`DATA_DESCRIPTION_CSV`.


.. setting:: METHODS

``METHODS``
-----------

Default: ``['method_1', 'method_2']``

Here you define the names of the methods that you will use. These names will 
must be the same as the ones used in the :setting:`PARAMS` and the 
:setting:`COMMANDS` settings. Since these names will also be used in directory 
names, it is advisable to not use spaces or other illegal characters in these 
names.

.. setting:: PARAMS

``PARAMS``
----------

Default::

    {
        'method_1': {
            'param_1': [val_1, val_2],
            'param_2': [val_3, val_4],
            'param_3': [val_5, val_6]
            },
        'method_2': {
            'param_1': [val_1, val_2, val_3],
            },
     }

As described in :doc:`workings`, Abed runs a grid search where the commands 
defined in :setting:`COMMANDS` are run for the respective method for each 
dataset, and all possible combinations of the values in the parameters. The 
values that are used are defined in this setting. Expected is a list of values 
for each parameter, even if only one value is used. Note that the names used 
for the parameters match those used in the commands. The user must therefore 
ensure that these are the same. This setting is not used with the ``'RAW'`` 
experiment type.

.. setting:: COMMANDS

``COMMANDS``
------------

Default::

    {
        'method_1': ("{execdir}/method_1 {datadir}/{dataset} {param_1} "
            "{param_2} {param_3}"),
        'method_2': "{execdir}/method_2 {datadir}/{dataset} {param_1}"
    }

Abed works by calling external commands for each method. The advantage of 
running external commands, is that Abed can be used regardless of the language 
that the methods are implemented in. This setting defines the commands that 
Abed needs to run for each method. The variables ``{execdir}`` and 
``{datadir}`` are special variables, which are formatted by Abed 
automatically. The ``{param_*}`` variables correspond to the names defined in 
:setting:`PARAMS`. Finally, the ``{dataset}`` variable will be formatted by 
Abed based on the names of the datasets defined in the :setting:`DATASETS` 
setting. Note that it is up to the user to ensure the right file extension is 
supplied here. This means, that if the name of a dataset defined in 
:setting:`DATASETS` is for instance ``'iris'``, but the filename on the disk 
is ``'iris.txt'``, the command should be adjusted with the part 
``{datadir}/{dataset}.txt``.

There are slight differences between the way the commands are used depending 
on the type of experiment that is run (see :setting:`TYPE`). Thus,

* When :setting:`TYPE` is ``'ASSESS'``, the expected form for the dataset part 
  of the command is ``{dataset}`` (as in the default).

* When :setting:`TYPE` is ``'CV_TT'``, both a training and a test dataset 
  should be included in the command, with the variables ``{train_dataset}`` 
  and ``{test_dataset}``, respectively. Thus, for this format a command could 
  look like::

    COMMANDS = {'method_1': ("{execdir}/method_1 {datadir}/{train_dataset} "
        "{datadir}/{test_dataset} {param_1} {param_2} {param_3}")}

* When :setting:`TYPE` is ``'RAW'``, this setting is not used.

.. setting:: METRICS

``METRICS``
-----------

Default::

    {
        'NAME_1': {
            'metric': metric_function_1,
            'best': max
            },
        'NAME_2': {
            'metric': metric_function_2,
            'best': min
            }
    }

This setting defines the metrics that are applied to the output of a single 
command. The user is free to define any function here, although Abed currently 
expects a function that takes two lists as input. It is therefore recommended 
to either use functions from `sklearn.metrics 
<http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics>`_, 
or define functions with a similar signature. See :doc:`metric_functions` for 
instructions on how to include custom metrics.

Note that in these settings, a name can be defined by the user, as well as 
which direction is considered *better* in the metric function. This is done by 
defining the ``'best'`` field, which can be either the ``max`` function, or 
the ``min`` function. These directions will be used when Abed ranks the 
results of a method on a given dataset with a given set of parameters.

.. setting:: SCALARS

``SCALARS``
-----------

Default::

    {
        'time': {
            'best': min
            },
    }

To compare results from a command on a single variable, this setting can be 
used. This can be useful when one wants to compare the computation time of a 
command for instance. The external executable could print for instance::

    time
    0.8473294179

With the default setting for the :setting:`SCALARS` field, Abed would read 
this value as a scalar result for the command.

.. setting:: RESULT_PRECISION

``RESULT_PRECISION``
--------------------

Default: ``4``

Results are considered equal if they are the same number within this 
precision. Thus, with the default setting, the numbers 1.12345 and 1.12354 
would be considered *equal*, and would therefore get the same rank. If no 
results should ever be considered equal, increase this setting to a large 
enough number.

.. setting:: DATA_DESCRIPTION_CSV

``DATA_DESCRIPTION_CSV``
------------------------

Default: ``None``

When generating result tables, it is possible to add additional columns of the 
table with an external CSV file. It is required that the CSV file is of the 
format::

    ID,col1,col2,col3
    1,a,10,3
    2,b,20,2
    3,c,30,1

where the first column is considered the column with IDs of the datasets. The 
easiest way to do this is to combine this with the :setting:`DATASET_NAMES` 
setting, which is a ``dict`` mapping elements of the :setting:`DATASETS` list 
to IDs. IDs of datasets must be strings. The first row of the CSV file will be 
used as headers in the table. 

.. setting:: REFERENCE_METHOD

``REFERENCE_METHOD``
--------------------

Default: ``None``

Abed automatically runs statistical tests to see if a chosen reference method 
is statistically different from other methods. This reference method can be 
set here, and must be a method from the :setting:`METHODS` setting. If you do 
not wish to run these statistical tests, use the default value of ``None``.  
See also the documentation in :doc:`statistical_tests` for more information on 
how to use and interpret the test results (tldr: carefully!).

.. setting:: SIGNIFICANCE_LEVEL

``SIGNIFICANCE_LEVEL``
----------------------

Default: ``0.05``

This sets the significance level used in the statistical tests. See also the 
documentation in :doc:`statistical_tests` and the setting 
:setting:`REFERENCE_METHOD`.

PBS settings
============

The settings below all relate to running the simulations on a compute cluster.  
Currently only PBS Torque type clusters are supported. In the future, these 
settings will likely be generalized to support other compute cluster setups as 
well.

.. setting:: PBS_NODES

``PBS_NODES``
-------------

Default: ``1``

The number of compute nodes to use on the cluster.

.. setting:: PBS_WALLTIME

``PBS_WALLTIME``
----------------

Default: ``360``

Wall-clock time in minutes for the computations. This is the time that will be 
reserved from the queueing system. Note that the actual computation time is 
dependent also on :setting:`PBS_TIME_REDUCE`.

.. setting:: PBS_CPUTYPE

``PBS_CPUTYPE``
---------------

Default: ``None``

Optional. The type of cpu to use on the cluster. Some clusters allow to 
specify which type of cpu will be used by the job. This can be very important 
for jobs where time comparisons are performed, as there it is vital to use the 
same type of cpu. If set, this setting must be a string. For example, one can 
specify ``'cpu4'`` for a specific type of CPU on `Lisa 
<https://userinfo.surfsara.nl/systems/lisa/usage/batch-usage#heading5>`_. This 
setting may not be available on all PBS systems.

.. setting:: PBS_CORETYPE

``PBS_CORETYPE``
----------------

Default: ``None``

Optional. The type of node to use on the cluster, as specified by the number 
of cores of the node.  This setting is similar to the :setting:`PBS_CPUTYPE` 
setting. For example, one can specify ``'cores16'`` for a 16-core node for 
instance.  This setting may not be available on all PBS systems.

.. setting:: PBS_LINES_BEFORE

``PBS_LINES_BEFORE``
--------------------

Default: ``[]``

Optional. Additional lines to add to the PBS file. These lines will be added 
before the email line, and directly after the lines creating the result 
directories.

.. setting:: PBS_LINES_AFTER

``PBS_LINES_AFTER``
-------------------

Default: ``[]``

Optional. Additonal lines to add to the PBS file. These lines will be added 
just after the compression of the result files, and just before the final 
email line.

.. setting:: PBS_PPN

``PBS_PPN``
-----------

Default: ``None``

Optional. The number of processors per node to use. If you know beforehand how 
many cores there are on a node, this setting allows you to limit the number of 
processors that are actually used for computations. Especially when running 
computation time comparisons, it is recommended to reserve one core for system 
processes.

.. setting:: PBS_MODULES

``PBS_MODULES``
---------------

Default: ``['mpicopy', 'python/2.7.9']``

Optional. On some PBS systems, additional modules may be loaded with the 
command ``module load``. This configuration defines the modules that are 
loaded.

Note that some modules may be necessary for Abed to function correctly. For 
instance, the ``mpicopy`` command is used for copying files to compute nodes 
during a job, and on some systems this may require loading the ``mpicopy`` 
module. See also the setting :setting:`PBS_MPICOPY`.

.. setting:: PBS_EXPORTS

``PBS_EXPORTS``
---------------

Default: ``['PATH=$PATH:/home/%s/.local/bin/abed' % REMOTE_USER]``

Optional. The lines in this list are interpreted as arguments for the 
``export`` command. This can be useful for setting PATH variables, or defining 
other environment settings.

.. setting:: PBS_MPICOPY

``PBS_MPICOPY``
---------------

Default: ``['{data_dir}', EXECDIR, TASK_FILE]``

Optional. Abed was initially designed for the Dutch National LISA Compute 
Cluster. On this cluster, it is more efficient to store results from 
computations on a so-called *scratch* directory, which is a disk attached 
locally on the compute node. To copy files to this scratch directory, the LISA 
staff designed the ``mpicopy`` command. This setting can be used to define the 
files and directories that will be copied to the scratch directory on the 
node. For more information on the ``mpicopy`` command, see `here 
<https://userinfo.surfsara.nl/systems/lisa/software/mpicopy>`_. 

Dependency on this command is not a very portable solution, ideas for 
improvement are very welcome.

.. setting:: PBS_TIME_REDUCE

``PBS_TIME_REDUCE``
-------------------

Default: ``600``

Abed generates a result file for every task. Since this can be quite a lot of 
files to download from the server after the job is done, Abed creates 
compressed archives of results. These archives are generated using the 
``pbzip2`` command, which compresses files in parallel. Hence, part of the 
time of the job is used for this result compression. The time allotted for 
this is defined with this setting, in seconds. If you expect only a few result 
files, you can choose to reduce the value of this setting.

Note: it is currently unknown if the ``pbzip2`` command is widely available.  
If dependency on this command is a problem, please let us know.