====================== The Abed Settings file ====================== As a user, the settings file is the place where you'll define your experiment for Abed. Since there are quite some settings that can be influenced, the settings file contains different sections with related settings. These sections are reflected in the documentation below. .. contents:: :local: :depth: 1 General settings ================ .. setting:: PROJECT_NAME ``PROJECT_NAME`` ---------------- Default: ``''`` The name of the project to use. This setting is used in the default of the setting :setting:`REMOTE_DIR`, and can therefore not be empty when accepting the default for that setting. It can contain a forward slash (``/``), when necessary. The project name further only occurs in the HTML output generated by Abed. .. setting:: TASK_FILE ``TASK_FILE`` ------------- Default: :py:const:`abed.constants.TASKS_FILENAME` This is the filename of the task file. It shouldn't generally need to be changed from the default value. .. setting:: AUTO_FILE ``AUTO_FILE`` ------------- Default: :py:const:`abed.constants.AUTO_FILENAME` This is the filename of the auto file. It shouldn't generally need to be changed from the default value. .. setting:: RESULT_DIR ``RESULT_DIR`` -------------- Default: ``'/path/to/local/results'`` Path to the local results directory. This is where the results of all computations will end up in the end. Because this can be potentially large, it is wise to consider the location of this path carefully. For instance, when working in a Dropbox folder, it may not be feasible to store all results in Dropbox. To ensure this is considered, the default path is not directly useable and should be changed by the user. .. setting:: STAGE_DIR ``STAGE_DIR`` ------------- Default: ``'/path/to/local/stagedir'`` Path to local stage directory. This is the directory where the results are temporarily stored before being moved to the result directory. The same considerations apply as are specified in the setting :setting:`RESULT_DIR`. The default should be adjusted by the user. .. setting:: MAX_FILES ``MAX_FILES`` ------------- Default: ``1000`` The maximum number of files stored in a result directory on the remote. When results are generated on the remote, they are placed in subdirectories of at most this many files. Each subdirectory will later be zipped before it can be downloaded from the remote, so it can be useful to not make the zip files too large. With this setting a limit can be placed on the number of files per subdirectory of the remote result directory, and therefore on the size of the zip files that need to be downloaded. The default should be fine for most users. .. setting:: ZIP_DIR ``ZIP_DIR`` ----------- Default: ``'./zips'`` Path to the local zip directory. This is where the compressed zip files with results are stored. These files will be unpacked into the stage dir and moved to the result directory (see :setting:`RESULT_DIR`). The zip files are kept by Abed for posterity, but can be removed when the user sees no need for them anymore (the raw results will be kept in the result directory). .. setting:: LOG_DIR ``LOG_DIR`` ----------- Default: ``'./logs'`` Local path where the PBS logs will be kept. .. setting:: OUTPUT_DIR ``OUTPUT_DIR`` -------------- Default: ``'./output'`` Local path where the output files are placed that are generated by Abed. .. setting:: AUTO_SLEEP ``AUTO_SLEEP`` -------------- Default: ``120`` Sleep time in seconds between consecutive checks on the remote when running in auto mode. In auto mode Abed checks the remote repeatedly to see if a submitted task is finished, so it can pull the results and submit a new task. The time between checks is defined here. See also :doc:`../api/core/abed.auto`. .. setting:: HTML_PORT ``HTML_PORT`` -------------- Default: ``8000`` Port on the localhost where the HTML pages will be served on. This is used by the function :py:func:`abed.html.view.view_html`. It will be automatically incremented when the port is in use. It should not generally have to be changed by the user. .. setting:: COMPRESSION ``COMPRESSION`` --------------- Default: ``'bzip2'`` Compression algorithm to use when compressing finished result directories. Abed can compress result files automatically when all computations for a dataset are finished. This is the algorithm that is used by Abed to compress the tar file that results. Allowed choices are: ``bzip2``, ``gzip``, and ``lzma``. See also the documentation in :doc:`../api/core/abed.compress`. Server parameters and settings ============================== .. setting:: REMOTE_USER ``REMOTE_USER`` --------------- Default: ``'username'`` Username on the remote server. This is assumed to be the username to use when logging in on the remote server, and under which the jobs will be submitted. It should be changed by the user. .. setting:: REMOTE_HOST ``REMOTE_HOST`` --------------- Default: ``'address.of.host'`` The address of the remote server. For instance, for the Dutch LISA compute cluster, it is ``lisa.surfsara.nl``. It should be changed by the user. .. setting:: REMOTE_DIR ``REMOTE_DIR`` -------------- Default: ``'/home/%s/projects/%s' % (REMOTE_USER, PROJECT_NAME)`` Path on the remote server to place all project files in. The default assumes that the remote server is a Unix server with the standard directory layout. It uses the :setting:`REMOTE_USER` setting and the :setting:`PROJECT_NAME` setting to construct the remote path. Therefore, it shouldn't necessarily have to be changed by the user. .. setting:: REMOTE_PORT ``REMOTE_PORT`` --------------- Default: ``22`` Remote communication port to use. Usually communication is done over the SSH port, so that is the default. To learn more about how communication is done with the remote server, see :doc:`../api/core/abed.fab` and :doc:`../api/core/abed.fab_util`. .. setting:: REMOTE_SCRATCH ``REMOTE_SCRATCH`` ------------------ Default: ``None`` Typically, it is more efficient to place results generated by a job on the compute cluster on a disk which is close to the node which executes the job, as opposed to in the user's home directory. Such a directory is called the *scratch* directory, and it is typically erased after a job is finished. Abed places the results it generates on this scratch directory, and periodically copies them to the home directory of the user, to lower the burden on the network of the cluster. On the Dutch LISA cluster, the path to the scratch directory is given by an environment variable, but on other systems it may be a fixed location. This setting exists for the latter case. For the former case, see :setting:`REMOTE_SCRATCH_ENV`. This setting is used by :py:func:`abed.run_utils.get_scratchdir`. .. setting:: REMOTE_SCRATCH_ENV ``REMOTE_SCRATCH_ENV`` ---------------------- Default: ``'TMPDIR'`` See also the description under the setting :setting:`REMOTE_SCRATCH`. This setting exists in case the location of the *scratch* directory is given by an environment variable. The name of the environment variable can be set here. The default corresponds to the name used on the Dutch LISA compute cluster. This setting is used by :py:func:`abed.run_utils.get_scratchdir`. Settings for Master/Worker program ================================== .. setting:: MW_SENDATONCE ``MW_SENDATONCE`` ----------------- Default: ``100`` When running tasks on the compute node, Abed works through a Master/Worker system, where the master thread continually sends out work to the worker threads. To reduce communication load, it sends out a number of tasks at once. This setting defines how many tasks are sent. Note that as soon as there are fewer tasks left than ``n_workers * MW_SENDATONCE``, Abed reduces the number of tasks to send to 1. This ensures that the situation can't occur where one worker gets all the remaining tasks whereas other workers get none. .. setting:: MW_COPY_WORKER ``MW_COPY_WORKER`` ------------------ Default: ``False`` Abed can reserve a thread on the compute node to periodically copy results from the scratch directory on the node to the home directory of the user. However, this places additional load on the internal bandwith of the compute cluster, so therefore it is turned off by default. The results will in any case be copied as compressed archives from the scratch after the wall time has ended. .. setting:: MW_COPY_SLEEP ``MW_COPY_SLEEP`` ----------------- Default: ``120`` If :setting:``MW_COPY_WORKER`` is ``True``, Abed reserves one thread on the compute node for periodically copying the result files from the scratch directory to the home directory (see also :setting:`REMOTE_SCRATCH`). This setting defines the time in seconds between consecutive copying of the results. .. setting:: MW_NUM_WORKERS ``MW_NUM_WORKERS`` ------------------ Default: ``None`` The number of worker processes to use. This can be useful in applications where you *don't* want to use all available processes, for instance because you're using parallelism in your application. Allowed values are a number or ``None``. If ``None`` is given, the maximum number of worker processes is started. Note that the master process and the copy worker do not count as worker processes (see also :setting:`MW_COPY_WORKER`). Experiment type =============== These settings define the type of experiment that Abed will run. For a more in-depth overview, see :doc:`experiments`. Note that not all settings apply to all types of experiments. Therefore, the initial settings file generated by Abed requires the user to uncomment the block of settings relating to the specific type of experiment they want to use. .. setting:: TYPE ``TYPE`` -------- Default: Undefined. This setting defines the type of experiment that will be run. Valid options are ``'ASSESS'``, ``'CV_TT'``, and ``'RAW'``. See :doc:`experiments` for a more in-depth discussion of the different types. .. setting:: CV_BASESEED ``CV_BASESEED`` --------------- Default: ``123456`` Only used when the experiment type is ``'CV_TT'``. This defines the base seed to use for the generation of the cross-validation seeds. This setting is used by the function :py:func:`abed.tasks.init_tasks_cv_tt`. .. setting:: YTRAIN_LABEL ``YTRAIN_LABEL`` ---------------- Default: ``'y_train'`` Only used when the experiment type is ``'CV_TT'``. This defines the label for the part in the result file which corresponds to the training set. .. setting:: RAW_CMD_FILE ``RAW_CMD_FILE`` ---------------- Default: ``'/path/to/file.txt'`` Only used when the experiment type is ``'RAW'``. This is the path to the file with raw tasks. See also :doc:`experiments`. Build settings ============== .. setting:: NEEDS_BUILD ``NEEDS_BUILD`` --------------- Default: ``False`` Whether or not compilation is necessary on the remote. .. setting:: BUILD_DIR ``BUILD_DIR`` ------------- Default: ``'build'`` Directory where the build command needs to be executed. It is assumed that this is a subdirectory of the *current* directory on the remote. The path defined here should therefore be the path in the git archive. With the default setting, the local Abed directory would look like:: |--- abed_conf.py |--- abed_tasks.txt |--- abed_auto.txt |--- build |--- datasets |--- execs .. setting:: BUILD_CMD ``BUILD_CMD`` ------------- Default: ``'make all'`` The command to run on the remote. This is run in the directory given by the :setting:`BUILD_DIR` setting. Experiment parameters and settings ================================== .. setting:: DATADIR ``DATADIR`` ----------- Default: :py:const:`abed.constants.DATASET_DIRNAME` The path where the datasets are stored. Note that this does not necessarily have to be in the Abed working directory (where you typed ``abed init``). However, on the remote the datasets will be placed in a directory called ``datasets`` in the remote working directory. By using this setting it is possible to place your datasets in a directory outside of the Abed working directory. This can be very useful when you're running multiple experiments that use the same datasets. .. setting:: EXECDIR ``EXECDIR`` ----------- Default: :py:const:`abed.constants.EXECS_DIRNAME` The path where the executables are stored. These executables are used in the commands that Abed runs (see the setting :setting:`COMMANDS`). It is advisable to keep the default setting as this makes it easier to place executables under version control. .. setting:: DATASETS ``DATASETS`` ------------ Default: ``['dataset_1', 'dataset_2']`` This setting defines the names of the datasets that will be used in the experiments. Abed expects a different type in the list depending on the :setting:`TYPE` that is used: * When :setting:`TYPE` is ``'ASSESS'``, the expected format is the same as the default, simply a list of names of the datasets, as strings. * When :setting:`TYPE` is ``'CV_TT'``, the expected format is a list of tuples, where each tuple is a pair of strings. The first string gives the name of the training dataset, and the second string the name of the test dataset. For instance:: DATASETS = [('dataset_1_train', 'dataset_1_test'), ('dataset_2_train', 'dataset_2_test')] * When :setting:`TYPE` is ``'RAW'``, this setting is not used. See for more info on the different experiment types and their requirements :doc:`experiments`. .. setting:: DATASET_NAMES ``DATASET_NAMES`` ----------------- Default: ``{k:str(i) for i, k in enumerate(DATASETS)}`` Optional. This setting gives a mapping of datasets to names. This can be useful when you wish to use different names for the datasets in the output than in the :setting:`DATASETS` setting. If this setting is not present in the settings file, the ID of a dataset will be generated with the function :py:func:`abed.datasets.dataset_name`. As an example, consider datasets names following the pattern ``'dataset_1'``, ``'dataset_2'``, etc. It may then be nice to use names such as ``'001'``, ``'002'``, etc. in the result tables. This can be achieved by setting:: DATASET_NAMES = {k:'%03i' % int(k.split('_')[-1]) for k in DATASETS} Note that this setting relates closely to the setting :setting:`DATA_DESCRIPTION_CSV`. .. setting:: METHODS ``METHODS`` ----------- Default: ``['method_1', 'method_2']`` Here you define the names of the methods that you will use. These names will must be the same as the ones used in the :setting:`PARAMS` and the :setting:`COMMANDS` settings. Since these names will also be used in directory names, it is advisable to not use spaces or other illegal characters in these names. .. setting:: PARAMS ``PARAMS`` ---------- Default:: { 'method_1': { 'param_1': [val_1, val_2], 'param_2': [val_3, val_4], 'param_3': [val_5, val_6] }, 'method_2': { 'param_1': [val_1, val_2, val_3], }, } As described in :doc:`workings`, Abed runs a grid search where the commands defined in :setting:`COMMANDS` are run for the respective method for each dataset, and all possible combinations of the values in the parameters. The values that are used are defined in this setting. Expected is a list of values for each parameter, even if only one value is used. Note that the names used for the parameters match those used in the commands. The user must therefore ensure that these are the same. This setting is not used with the ``'RAW'`` experiment type. .. setting:: COMMANDS ``COMMANDS`` ------------ Default:: { 'method_1': ("{execdir}/method_1 {datadir}/{dataset} {param_1} " "{param_2} {param_3}"), 'method_2': "{execdir}/method_2 {datadir}/{dataset} {param_1}" } Abed works by calling external commands for each method. The advantage of running external commands, is that Abed can be used regardless of the language that the methods are implemented in. This setting defines the commands that Abed needs to run for each method. The variables ``{execdir}`` and ``{datadir}`` are special variables, which are formatted by Abed automatically. The ``{param_*}`` variables correspond to the names defined in :setting:`PARAMS`. Finally, the ``{dataset}`` variable will be formatted by Abed based on the names of the datasets defined in the :setting:`DATASETS` setting. Note that it is up to the user to ensure the right file extension is supplied here. This means, that if the name of a dataset defined in :setting:`DATASETS` is for instance ``'iris'``, but the filename on the disk is ``'iris.txt'``, the command should be adjusted with the part ``{datadir}/{dataset}.txt``. There are slight differences between the way the commands are used depending on the type of experiment that is run (see :setting:`TYPE`). Thus, * When :setting:`TYPE` is ``'ASSESS'``, the expected form for the dataset part of the command is ``{dataset}`` (as in the default). * When :setting:`TYPE` is ``'CV_TT'``, both a training and a test dataset should be included in the command, with the variables ``{train_dataset}`` and ``{test_dataset}``, respectively. Thus, for this format a command could look like:: COMMANDS = {'method_1': ("{execdir}/method_1 {datadir}/{train_dataset} " "{datadir}/{test_dataset} {param_1} {param_2} {param_3}")} * When :setting:`TYPE` is ``'RAW'``, this setting is not used. .. setting:: METRICS ``METRICS`` ----------- Default:: { 'NAME_1': { 'metric': metric_function_1, 'best': max }, 'NAME_2': { 'metric': metric_function_2, 'best': min } } This setting defines the metrics that are applied to the output of a single command. The user is free to define any function here, although Abed currently expects a function that takes two lists as input. It is therefore recommended to either use functions from `sklearn.metrics `_, or define functions with a similar signature. See :doc:`metric_functions` for instructions on how to include custom metrics. Note that in these settings, a name can be defined by the user, as well as which direction is considered *better* in the metric function. This is done by defining the ``'best'`` field, which can be either the ``max`` function, or the ``min`` function. These directions will be used when Abed ranks the results of a method on a given dataset with a given set of parameters. .. setting:: SCALARS ``SCALARS`` ----------- Default:: { 'time': { 'best': min }, } To compare results from a command on a single variable, this setting can be used. This can be useful when one wants to compare the computation time of a command for instance. The external executable could print for instance:: time 0.8473294179 With the default setting for the :setting:`SCALARS` field, Abed would read this value as a scalar result for the command. .. setting:: RESULT_PRECISION ``RESULT_PRECISION`` -------------------- Default: ``4`` Results are considered equal if they are the same number within this precision. Thus, with the default setting, the numbers 1.12345 and 1.12354 would be considered *equal*, and would therefore get the same rank. If no results should ever be considered equal, increase this setting to a large enough number. .. setting:: DATA_DESCRIPTION_CSV ``DATA_DESCRIPTION_CSV`` ------------------------ Default: ``None`` When generating result tables, it is possible to add additional columns of the table with an external CSV file. It is required that the CSV file is of the format:: ID,col1,col2,col3 1,a,10,3 2,b,20,2 3,c,30,1 where the first column is considered the column with IDs of the datasets. The easiest way to do this is to combine this with the :setting:`DATASET_NAMES` setting, which is a ``dict`` mapping elements of the :setting:`DATASETS` list to IDs. IDs of datasets must be strings. The first row of the CSV file will be used as headers in the table. .. setting:: REFERENCE_METHOD ``REFERENCE_METHOD`` -------------------- Default: ``None`` Abed automatically runs statistical tests to see if a chosen reference method is statistically different from other methods. This reference method can be set here, and must be a method from the :setting:`METHODS` setting. If you do not wish to run these statistical tests, use the default value of ``None``. See also the documentation in :doc:`statistical_tests` for more information on how to use and interpret the test results (tldr: carefully!). .. setting:: SIGNIFICANCE_LEVEL ``SIGNIFICANCE_LEVEL`` ---------------------- Default: ``0.05`` This sets the significance level used in the statistical tests. See also the documentation in :doc:`statistical_tests` and the setting :setting:`REFERENCE_METHOD`. PBS settings ============ The settings below all relate to running the simulations on a compute cluster. Currently only PBS Torque type clusters are supported. In the future, these settings will likely be generalized to support other compute cluster setups as well. .. setting:: PBS_NODES ``PBS_NODES`` ------------- Default: ``1`` The number of compute nodes to use on the cluster. .. setting:: PBS_WALLTIME ``PBS_WALLTIME`` ---------------- Default: ``360`` Wall-clock time in minutes for the computations. This is the time that will be reserved from the queueing system. Note that the actual computation time is dependent also on :setting:`PBS_TIME_REDUCE`. .. setting:: PBS_CPUTYPE ``PBS_CPUTYPE`` --------------- Default: ``None`` Optional. The type of cpu to use on the cluster. Some clusters allow to specify which type of cpu will be used by the job. This can be very important for jobs where time comparisons are performed, as there it is vital to use the same type of cpu. If set, this setting must be a string. For example, one can specify ``'cpu4'`` for a specific type of CPU on `Lisa `_. This setting may not be available on all PBS systems. .. setting:: PBS_CORETYPE ``PBS_CORETYPE`` ---------------- Default: ``None`` Optional. The type of node to use on the cluster, as specified by the number of cores of the node. This setting is similar to the :setting:`PBS_CPUTYPE` setting. For example, one can specify ``'cores16'`` for a 16-core node for instance. This setting may not be available on all PBS systems. .. setting:: PBS_LINES_BEFORE ``PBS_LINES_BEFORE`` -------------------- Default: ``[]`` Optional. Additional lines to add to the PBS file. These lines will be added before the email line, and directly after the lines creating the result directories. .. setting:: PBS_LINES_AFTER ``PBS_LINES_AFTER`` ------------------- Default: ``[]`` Optional. Additonal lines to add to the PBS file. These lines will be added just after the compression of the result files, and just before the final email line. .. setting:: PBS_PPN ``PBS_PPN`` ----------- Default: ``None`` Optional. The number of processors per node to use. If you know beforehand how many cores there are on a node, this setting allows you to limit the number of processors that are actually used for computations. Especially when running computation time comparisons, it is recommended to reserve one core for system processes. .. setting:: PBS_MODULES ``PBS_MODULES`` --------------- Default: ``['mpicopy', 'python/2.7.9']`` Optional. On some PBS systems, additional modules may be loaded with the command ``module load``. This configuration defines the modules that are loaded. Note that some modules may be necessary for Abed to function correctly. For instance, the ``mpicopy`` command is used for copying files to compute nodes during a job, and on some systems this may require loading the ``mpicopy`` module. See also the setting :setting:`PBS_MPICOPY`. .. setting:: PBS_EXPORTS ``PBS_EXPORTS`` --------------- Default: ``['PATH=$PATH:/home/%s/.local/bin/abed' % REMOTE_USER]`` Optional. The lines in this list are interpreted as arguments for the ``export`` command. This can be useful for setting PATH variables, or defining other environment settings. .. setting:: PBS_MPICOPY ``PBS_MPICOPY`` --------------- Default: ``['{data_dir}', EXECDIR, TASK_FILE]`` Optional. Abed was initially designed for the Dutch National LISA Compute Cluster. On this cluster, it is more efficient to store results from computations on a so-called *scratch* directory, which is a disk attached locally on the compute node. To copy files to this scratch directory, the LISA staff designed the ``mpicopy`` command. This setting can be used to define the files and directories that will be copied to the scratch directory on the node. For more information on the ``mpicopy`` command, see `here `_. Dependency on this command is not a very portable solution, ideas for improvement are very welcome. .. setting:: PBS_TIME_REDUCE ``PBS_TIME_REDUCE`` ------------------- Default: ``600`` Abed generates a result file for every task. Since this can be quite a lot of files to download from the server after the job is done, Abed creates compressed archives of results. These archives are generated using the ``pbzip2`` command, which compresses files in parallel. Hence, part of the time of the job is used for this result compression. The time allotted for this is defined with this setting, in seconds. If you expect only a few result files, you can choose to reduce the value of this setting. Note: it is currently unknown if the ``pbzip2`` command is widely available. If dependency on this command is a problem, please let us know.