Het Condor Team van de Universiteit van Wisconsin-Madison heeft een nieuwe ontwikkelversie uitgebracht van hun 'workload management system' Condor. Het versienummer is aanbeland bij 7.5.5 en het pakket wordt onder de Apache 2.0-licentie uitgegeven. Condor richt zich op het beheer van rekenintensieve taken en kan deze over meerdere aangesloten nodes verdelen. De gebruiker stuurt zijn taak naar Condor waarna deze het proces afhandelt op basis van ingestelde policies en de beschikbaarheid van de aangesloten resources, om tot slot de resultaten naar de gebruiker terug te sturen. Condor kan bijvoorbeeld een dedicated Beowulf-cluster aansturen, maar ook gewone desktops die even niets te doen hebben, kunnen worden ingezet. De aankondiging met de lijst van aanpassingen van deze uitgave ziet er als volgt uit:
Condor 7.5.5 released!
The Condor Team is pleased to announce the release of Condor 7.5.5. This is a development release of Condor. This release is primarily to improve scalability and performance. Additionally, of note to those people who build Condor from source, we have modernized our build system to use cmake instead of imake. A large number of bugs have been fixed - more details are in the Version History. Condor 7.5.5 binaries and source code are available from our downloads page.
- This version of Condor uses a different layout in the spool directory for storing files belonging to jobs that are in the queue. Conversion of the spool directory is automatic when upgrading, but be aware that downgrading to a previous version of Condor requires extra effort. The procedure for downgrading is either to drain all jobs with spooled files from the queue, or to manually convert the spool back to the older format. To manually convert back to the older format, stop Condor and back up the spool directory in case of problems. Then move all subdirectories matching the form $(SPOOL)/<#>/<#>/cluster<#>.proc<#>.subproc<#> into $(SPOOL). Also do this for any files of the form $(SPOOL)/<#>/cluster<#>.ickpt.subproc<#>. Edit $(SPOOL)/job_queue.log with a text editor, and change all references to the old paths to the new paths. Then, remove $(SPOOL)/spool_version. Finally, start up Condor.
- For those who compile Condor from the source code rather than using packages of pre-built executables, be aware that in this release Condor is built using cmake instead of imake. See the README.building file for the new instructions on how to build Condor.
- This release note serves to remind users that as of Condor version 7.5.1, the RPMs come with native packaging. Therefore, items are in different locations, as given by FHS locations, such as /usr/bin, /usr/sbin, /etc, and /var/log. Please see section 3.2.6 for installation documentation.
- Quill is now available only within the source code distribution of Condor. It is no longer included in the builds of Condor provided by UW, but it is available as a feature that can be enabled by those who compile Condor from the source code. Find the code within the condor_contrib directory, in the directories condor_tt and condor_dbmsd.
- The AIX 5.2 packages in this release have been found to be incompatible with AIX 5.3.
- We are planning to drop support for AIX. Please contact us if this is a problem for you.
- The directory structure within the Unix tar file package of Condor has changed. Previously, the tar file contained a top level directory named condor-. The top level directory is now the same as the tar file name, but without the .tar.gz extension.
- On Unix platforms, the following executables used to be located in both the sbin and bin directories, but are now only located in the bin directory: condor, condor_checkpoint, condor_reschedule, and condor_vacate.
- The size of the Condor installation has increased by as much as 60% compared to Condor version 7.5.4. We hope to eliminate most of this increase in Condor version 7.5.6.
- Previously, packages containing debug symbols were available for many Unix platforms. In this release, the debug packages contain full, `unstripped' executables instead of just the debug symbols.
- The contents of the Windows zip and MSI packages of Condor have changed. The lib and libexec folders no longer exist, and all contents previously within them are now in bin. condor_setup and condor_set_acls have been moved from the top level directory into bin.
- The Windows MSI installer for Condor version 7.5.5 requires that all previous MSI installations of Condor be uninstalled. Before uninstalling previous versions, make backup copies of configuration files. Any settings that need to be preserved must be reapplied to the configuration of the new installation.
- The following list itemizes changes included in this Condor version 7.5.5 release that belong to Condor version 7.4.5. That stable series version will not yet have been released as this development version is released.
- condor_dagman now prints a message in the dagman.out file whenever it truncates a node job user log file. condor_dagman now prints additional diagnostic information in the case of certain log file errors.
- Fixed a bug in which a network disconnect between the submit machine and execute machine during the transfer of output files caused the condor_starter daemon to immediately give up, rather than waiting for the condor_shadow to reconnect. This problem was introduced in Condor version 7.4.4.
- Fixed a bug in which if condor_ssh_to_job attempted to connect to a job while the job's input files were being transferred, this caused the file transfer to fail, which resulted in the job returning to the idle state in the queue.
- In privsep mode, the transfer of output failed if a job's execute directory contained symbolic links to non-existent paths.
Configuration Variable and ClassAd Attribute Additions and Changes:
- Negotiation is now handled asynchronously in the condor_schedd daemon. This means that the condor_schedd remains responsive during negotiation and is less prone to falling behind on communication with condor_shadow processes.
- Improved monitoring and avoidance of a lock convoy problem observed when there were more than 30,000 condor_shadow processes. At this scale, locking the condor_shadow daemon's log on each write to the log file has been observed on Linux platforms to sometimes result in a situation where the system does very little productive work, and is instead consumed by rapid context switching between the condor_shadow daemons that are waiting for the lock.
- On Linux platforms, if the condor_schedd daemon's spool directory is on an ext3 file system, Condor can now scale to a larger number of spooled jobs. Previously, Condor created two subdirectories within the spool directory for each spooled job and for each running job. The ext3 file system only supports 31,997 subdirectories. This effectively limited the number of spooled jobs to less than 16,000. Now, Condor creates a hierarchy of subdirectories within the spool directory, to increase the limit on the number of spooled jobs in ext3 to 320,000,000, which is likely to be larger than other limits on the size of the job queue, such as memory.
- The condor_shadow daemon uses less memory than it has since Condor version 7.5.0. Memory usage should now be similar to the 7.4 series.
- The condor_dagman and condor_submit_dag command-line flag -DumpRescue causes the dump of an incomplete Rescue DAG, when the parsing of the DAG input file fails. This may help in figuring out what went wrong. See section 2.10.7 for complete details on Rescue DAGs.
- condor_dagman now has the capability to create the jobstate.log file needed for the Pegasus workflow manager. See section 2.10.11 for details.
- condor_wait can now work on jobs with logs that are only readable by the user running condor_wait. Previously, write access to the job's user log was required.
- Added a new value for the job ClassAd attribute JobStatus. The TRANSFERRING_OUTPUT status is used when transferring a job's output files after the job has finished running. Jobs with this status will have their JobStatus attribute set to 6. The standard condor_q display will show the job's status as >.
- The new configuration variable LOCK_DEBUG_LOG_TO_APPEND controls whether a daemon's debug lock is used when appending to the log. When the default value of False, the debug lock is only used when rotating the log file. When True, the debug lock is used when writing to the log as well as when rotating the log file. See section 3.3.4 for the complete definition.
- The new configuration variable LOCAL_CONFIG_DIR_EXCLUDE_REGEXP may be set to a regular expression that specifies file names to ignore when looking for configuration files within the directories specified via LOCAL_CONFIG_DIR. See section 3.3.3 for the complete definition.
- In previous versions of Condor, the condor_starter could not write the .machine.ad and .job.ad files to the execute directory when PrivSep was enabled. This has now been fixed, and these files are correctly emitted in all cases.
- Since Condor version 7.5.2, the speed of condor_q was not as high as earlier 7.5 and 7.4 releases, especially when retrieving large numbers of jobs. Viewing 100K jobs took about four times longer. This release fixes the performance, making it about the same as before Condor version 7.5.2.
- A bug introduced in Condor version 7.5.4 prevented parallel universe jobs with multiple queue statements in the submit description file from working with condor_dagman. This is now fixed.
- Improved the way Condor daemons send heartbeat messages to their parent process. This resolves a problem observed on busy submit machines using the condor_shared_port daemon. The condor_master daemon sometimes incorrectly determined that the condor_schedd was hung, and would kill and restart it.
- When the configuration variable NOT_RESPONDING_WANT_CORE is True, the condor_master daemon now follows up with a SIGKILL, if the child process does not exit within ten minutes of receiving SIGABRT. This addresses observed cases in which the child process hangs while writing a core file.
- Host name-based authorization failed in Condor version 7.5.4 under Mac OS X 10.4, because look ups of the host name for incoming connections failed.
- A bug introduced in Condor version 7.5.0 caused the attributes MyType and TargetType in offline ClassAds to get set to "(unknown type)" when the offline ClassAd was matched to a job.
- condor_dagman now excepts in the case of certain log file errors, when continuing would be likely to put condor_dagman into an incorrect internal state.
- Fixed a bug that caused DAG node jobs to have their coredumpsize limit set according to the CREATE_CORE_FILES configuration variable, rather than the user's coredumpsize limit.
- Fixed a case introduced in Condor version 7.5.4 on Windows platforms, in which the following spurious log message was produced: SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0
- Since Condor version 7.4.1, Condor-C jobs submitted without file transfer enabled could fail with the following error in the condor_starter log: FileTransfer: DownloadFiles called on server side
- Fixed a memory leak caused by use of the ClassAd eval() function. This problem was introduced in Condor version 7.5.2.
- Fixed a bug that could cause the condor_negotiator daemon to crash when groups are configured with GROUP_QUOTA_DYNAMIC_<group_name>, or when GROUP_QUOTA_ is not defined to be something greater than 0.
- Fixed a bug that caused random characters to appear for the value of AuthMethods when logging with D_FULLDEBUG and D_SECURITY enabled. This problem was introduced in Condor version 7.5.3.
- Fixed a memory leak in the condor_schedd introduced in Condor version 7.5.4.
- Fixed a problem introduced in Condor version 7.5.4 that could cause the condor_schedd daemon to enter an infinite loop while in the process of shutting down. For the problem to happen, it was necessary for flocking to have been enabled.
- Configuration variable SCHEDD_QUERY_WORKERS was effectively ignored when condor_q authenticated itself to the condor_schedd. The query was always processed in the main condor_schedd process rather than in a sub-process. This problem has existed since before Condor version 7.0.0.
- Fixed a problem affecting jobs that store their output in the condor_schedd's spool directory. The problem affected jobs that include an empty directory in their list of output files to transfer. This problem was introduced in Condor version 7.5.4, when support for the transfer of directories was added.
- Fixed a problem affecting the condor_master daemon since Condor version 7.5.3. The condor_master daemon would crash if it was instructed to shut down a daemon that was not currently running, but which was waiting to be restarted.
- Fixed a bug in condor_submit that prevented the submission of multiple vm universe jobs in a single submit file.
- Fixed a bug in the condor_schedd that could cause it to temporarily under count the number of running local and scheduler universe jobs. In Condor version 7.5.4, this under counting could cause the daemon to crash.
- Fixed a bug that could cause the condor_gridmanager to crash if a GAHP server did not behave as expected on start up.
- Improved the hold reason reported in several failure cases for CREAM grid jobs.
- The KFlops attribute reported by
condor_status -run -total
could erroneously be reported as negative. This has been fixed.
- Since Condor version 7.5.4, the refreshing of the proxy for the job in the remote queue did not work in Condor-C. Therefore, if the original job proxy expired, the job was halted and put on hold, even if the proxy had been renewed on the submit machine.
- In Condor version 7.5.5, when a running job is put on hold, the job is removed from the job queue.