Het HTCondor Team van de Universiteit van Wisconsin-Madison heeft een twee nieuwe versies uitgebracht van zijn 'workload management system' HTCondor. In de stable-tak is versie 8.6.4 verschenen en in de ontwikkeltak is dat versie 8.7.2. HTCondor richt zich op het beheer van rekenintensieve taken en kan deze over verschillende aangesloten nodes verdelen. De gebruiker stuurt zijn taak naar HTCondor, waarna dit het proces afhandelt op basis van ingestelde policies en de beschikbaarheid van aangesloten resources, om tot slot de resultaten naar de gebruiker terug te sturen. HTCondor kan bijvoorbeeld een dedicated Beowulf-cluster aansturen, maar ook gewone desktops die even niets te doen hebben. Tijdens SC16 hebben Google, Fermilab en het HTCondor Team een 160k-core cloud-based elastic compute cluster gedemonstreerd. De lijst met veranderingen van deze uitgave ziet er als volgt uit:
- Python bindings are now available on MacOSX.
- Allow Python modules to be used as condor_collector plugin. This undocumented feature is to be used by expert developers only.
- Fixed a bug with PASSWORD authentication that would sporadically cause it to fail to exchange keys, due to whether or not the first round-trip of communications blocked on reading from the network.
- Pslot preemption now properly handles machine custom resources, such as GPUs.
- Fixed a bug that prevented HTCondor from correctly setting virtual memory cgroup limits when soft physical memory limits were being used.
- Fixed a bug that prevented parallel universe jobs from running that used $$() expansion in submit files.
- Added a new knob, STARTD_RECOMPUTE_DISK_FREE, which defaults to true, which tells the startd to periodically recompute and advertise free disk space. Admins can set this to false for partitionable slots whose execute directory is used by HTCondor alone.
- Fixed a bug that could cause condor_submit to fail to submit a job with a proxy file to a condor_schedd older than 8.5.8, due to the absence of an X.509 CA certificates directory.
- Restored a check in condor_submit about whether the job's X.509 proxy has sufficient lifetime remaining.
- Fixed a bug in condor_dagman where the DAG status file showed an incorrect status code if submit attempts failed for the final node.
- Bosco now properly identifies CentOS 7 as a supported platform.
- Fixed a bug when Bosco is used to submit jobs to multiple remote clusters. When arguments to remote_gahp are provided in the GridResource attribute, jobs could be submitted to the wrong cluster.
- To speed up the installation process on Enterprise Linux 7, the SELinux profile is now reloaded only once, when setting the HTCondor daemons to run in permissive mode.
- Update the systemd configuration on Enterprise Linux 7 to start the condor_master after time synchronization is achieved. This prevents unnecessary daemon restarts due to sudden time shifts.
- The condor_shadow will now ignore updates of JobStartDate from the condor_starter since the condor_schedd already sets this attribute correctly and the condor_starter incorrectly tries to set it even if the job has already run once. A consequence of this fix is that the value of JobStartDate that the condor_startd uses for policy expressions will be different than the value that the condor_schedd uses. Resolving this problem will potentially break existing policy expressions in the condor_startd, so it will be be not be changed in the 8.6 series, but fixed in the 8.7 series.
- Fixed a bug where per-instance job attributes like RemoteHost would show up in the history file for completed jobs. This bug occurred if a job happened to complete while the condor_schedd was in the process of a graceful shutdown.
- The condor_convert_history command is present again in this release.
- The parameter SETTABLE_ATTRS_ADMINISTRATOR is now correctly appears in condor_config_val.
- Our current implementation of late materialization is incompatible with condor_dagman and will cause unexpected behavior, including failing without warning. This is a top-priority issue which aim to resolve in an upcoming release.
- Improved the performance of the condor_schedd by setting the default for the knob SUBMIT_SKIP_FILECHECKS to true. This prevents the condor_schedd from checking the readability of all input files, and skips the creation of the output files on the submit side at submit time. Output files are now created either at transfer time, when file transfer is on, or by the job itself, if a shared filesystem is used. As a result of this change, it is possible that a job will run to completion, and only then is put on hold because the output file on the submit machine cannot be written.
- Changed condor_submit to not create empty stdout and stderr files before submitting jobs by default. This caused confusion for users, and slowed down the submission process. The older behavior, where condor_submit would fail if it could not create this files, is available when the parameter SUBMIT_SKIP_FILECHECKS is set to false. The default is now true.
- condor_q will now show expanded totals when querying a condor_schedd that is version 8.7.1 or later. The totals for the current user and for all users are provided by the condor_schedd. To get the old totals display set the configuration parameter CONDOR_Q_SHOW_OLD_SUMMARY to true.
- The condor_annex tool now logs to the user configuration directory. Added an audit log of condor_annex commands and their results.
- Changed condor_off so that the -annex flag implies the -master flag, since this is more likely to be the right thing.
- Added -status flag to condor_annex, which reports on instances which are running but not in the pool.
- If invoked with an annex name and duration (but not an instance or slot count), condor_annex will now adjust the duration of the named annex.
- Job input files which are downloaded from http:// web addresses now have mechanisms to recover from transfer failures. This should increase the reliability of using web-based input files, especially under slow and/or unstable network conditions.
- Reduced load on the condor_collector by optimizing queries performed when an HTCondor daemon needs to look up the address of another daemon.
- Reduced load on the condor_collector by optimizing queries performed when using condor_q with several different command-line options such as -submitter and -global.
- Added the condor_top tool, an automated version of the now-defunct condor_top.pl which uses the python bindings to monitor the status of daemons.
- Added a new option -cron to condor_gpu_discovery that allows it to be used directly as an executable of a condor_startd cron job.
- The configuration variable MAX_RUNNING_SCHEDULER_JOBS_PER_OWNER was set to default to 100. It formerly had no default value.
- Added a parameter DEDICATED_SCHEDULER_USE_SERIAL_CLAIMS which defaults to false. When true, allows the dedicated schedule to use claimed/idle slots that the serial scheduler has claimed.
- The condor_advertise tool now assumes an update command if one is not specified on the command-line and attempts to determine exact command by inspecting the first ad to be advertised.
- Improved support for running several condor_negotiators in a single pool. NEGOTIATOR_NAME now works like MASTER_NAME. condor_userprio has a -name option to select a specific condor_negotiator. Accounting ads from multiple condor_negotiators can co-exist in the condor_collector. (Ticket #5717)
- Package EC2 Annex components in the condor-annex-ec2 sub RPM.
- Added configuration parameter ALTERNATE_JOB_SPOOL, an expression evaluated against the job ad, which specifies an alternate spool directory to use for files related to that job.
- With an empty configuration file, HTCondor would behave as if ALLOW_ADMINISTRATOR were *. Changed the default to $(CONDOR_HOST), which is much less insecure.
- Fixed a bug in the condor_schedd where it did not account for the initial state of late materialize jobs when calculating the running totals of jobs by state. This bug resulted in condor_q displaying incorrect totals when CONDOR_Q_SHOW_OLD_SUMMARY was set to false.
- Fixed a bug where the condor_schedd would incorrectly try to check the validity of output files and directories for late materialize jobs. The condor_schedd will now always skip file checks for late materialize jobs.
- Changed the output of the condor_status command so that the Load Average field now displays the load average of just the condor job running in that slot. Previously, load associated from outside of condor was proportionately distributed into the condor slots, resulting in much confusion.
- Illegal chars ('+', '.') are now prohibited in DAGMan node names.
- Improve audit log messages by including the connection ID and properly filtering out shadow and gridmanager modifications to the job queue log.
- condor_root_switchboard has been removed from the release, since PrivSep is no longer supported.