Cookies op Tweakers

Tweakers maakt gebruik van cookies, onder andere om de website te analyseren, het gebruiksgemak te vergroten en advertenties te tonen. Door gebruik te maken van deze website, of door op 'Ga verder' te klikken, geef je toestemming voor het gebruik van cookies. Wil je meer informatie over cookies en hoe ze worden gebruikt, bekijk dan ons cookiebeleid.

Meer informatie

Door , , 0 reacties
Bron: Condor

Het Condor Team van de Universiteit van Wisconsin-Madison heeft een nieuwe ontwikkelversie uitgebracht van hun 'workload management system' Condor. Het versienummer is aanbeland bij 7.3.2 en het pakket wordt onder de Apache 2.0-licentie uitgegeven. Condor richt zich op het beheer van rekenintensieve taken en kan deze over meerdere aangesloten nodes verdelen. De gebruiker stuurt zijn taak naar Condor waarna deze het proces afhandelt op basis van ingestelde policies en de beschikbaarheid van de aangesloten resources, om tot slot de resultaten naar de gebruiker terug te sturen. Condor kan bijvoorbeeld een dedicated Beowulf-cluster aansturen, maar ook standaard desktops die normaal ingezet worden voor gebruikers, kunnen ingezet worden als ze even niets te doen hebben. Wanneer een gebruiker terugkeert naar zijn desktop wordt de huidige taak automatisch doorgespeeld naar een andere node. De aankondiging, samen met de lijst van aanpassingen van deze uitgave ziet er als volgt uit:

Condor 7.3.2 released!

The Condor Team is pleased to announce the release of Condor 7.3.2. This release improves the behavior of the checkpoint server in mixed 32/64 bit architecture pools, provides a new tool called condor_ssh_to_job to allow interactive debugging of running jobs, some performance enhancements, lazy log file processing for DAGMan jobs, plugins for utilizing a host's power management capabilities, and many other new features in addition to many bug fixes.

Release Notes:
  • The format of the output from condor_status with the -grid option has been changed to provide more useful information.
  • Removed the newline appended to the end of condor_status -format output. Therefore, code which parses the output of this command should now be careful when trimming the last line.
New Features:
  • condor_fetchlog may now fetch the history files of a condor_schedd daemon. And, the history file kept by the condor_schedd daemon may now be rotated daily or monthly.
  • The condor_ckpt_server will automatically clean up stale checkpoint files. The configuration variables which control this behavior are described below.
  • The condor_ckpt_server (either the 32-bit or 64-bit) executable will now communicate correctly between 32-bit and 64-bit submit nodes. If by some chance bit width issues arise in the checkpoint protocol (for example, with file sizes), clear error messages are logged in the checkpoint server logs.
  • The new condor_ssh_to_job tool allows interactive debugging of running jobs. See the manual page at [*] for details.
  • The condor_status command is now substantially faster, especially with the -format option.
  • Grid universe grid type gt5 has been added for submission to the new Globus GRAM5 service. When a GRAM service is identified as gt5, jobmanager throttling and the Grid Monitor are not used. See section 5.3.2 for details.
  • Grid universe grid type cream has been added for submission to the CREAM job service of gLite. See section 5.3.8 for details.
  • When low on file descriptors for creating new network sockets, the condor_schedd daemon now avoids the unlimited stacking up of messages that it sends periodically to the condor_negotiator and condor_startd.
  • The performance and failure handling of the Grid Monitor have been improved.
  • For grid type nordugrid in the grid universe, job status information is now obtained using Nordugrid ARC's LDAP server, which should greatly improve performance. Also, Condor can now tell when these jobs are running.
  • The new -valgrind option to condor_submit_dag causes condor_submit_dag to generate a submit description file that uses valgrind on condor_dagman, instead of the condor_dagman binary as its executable.
  • condor_dagman now lazily evaluates and opens node job log files. Instead of parsing all submit description files and immediately opening their specified log files at start up, condor_dagman now parses the submit description files just before each job is submitted, and has each log file open only when relevant jobs are in the queue or executing POST scripts. In addition, condor_dagman now automatically generates a default user log file for any node job that does not specify one.
  • Both the support and documentation for the MPI universe have been removed. MPI applications are supported through the use of the parallel universe.
  • When the condor_startd daemon's test of virtual machine software fails (for machines configured as capable of running virtual machines), the condor_startd will periodically retry the test until it succeeds.
  • The nordugrid_gahp now limits the number of connections made to each NorduGrid ARC server and reuses connections when possible.
  • Added the ClassAd function eval(), which takes a string argument and evaluates the contents of the string as a ClassAd expression. An policy example where this is useful is described in section 3.5.9 on job suspension.
  • The new condor_q option -attributes limits the attributes which are displayed when using the -xml or -long options. Limiting the number of attributes also increases the efficiency of the query.
  • Condor's power management capabilities are now implemented as a plug-in. In particular, the condor_startd now runs an external program, as specified by the configuration variable HIBERNATION_PLUGIN , to perform the detection of available low power states and the switching to these low power states.
  • The new Condor daemon condor_rooster has been added to wake up hibernating machines when the expression defined by the configuration variable UNHIBERNATE becomes True. The configuration variables relating to condor_rooster are described in section 3.3.35.
  • Added the ability to extract information from the user event log reader's state buffer to the user log reader. This is implemented through a new ReadUserLogStateAccess C++ class as defined in read_user_log.h.
  • Changes to the value of the configuration variable CERTIFICATE_MAPFILE or the contents of the file to which it refers no longer require a full restart of Condor. Instead, the command condor_reconfig will cause the changes to be utilized.
  • The condor_master daemon will now print the path and arguments to any daemons it starts if D_FULLDEBUG is enabled. Previously, there was no way to get it to display the arguments with which it was starting a daemon.
  • The condor_had daemon now has the ability to control daemons other than the condor_negotiator. This is controlled via the HAD_CONTROLLEE macro.
  • Condor now recognizes VOMS extensions in X.509 proxies. The VOMS attributes are encoded in the job ClassAd attribute X509UserProxySubject.
  • The condor_startd can now clean up stranded virtual machines, following a crash of Condor or its host operating system.
  • Following a crash, the condor_gridmanager no longer restarts all of the jobmanagers for gt2 jobs. This should improve recovery time.
  • Condor works better with the ClassAds categorized as generic in the condor_collector daemon. Various daemons that register themselves with generic ClassAds can now have tools which use the -subsystem option manipulate their ClassAds properly.
  • Condor now provides a mechanism to enforce strict resource limiting for some universes of running jobs.
Configuration Variable Additions and Changes:
  • The new configuration variable EMAIL_SIGNATURE specifies a custom signature to be appended to e-mail sent by the Condor system. If defined, then this custom signature replaces the default one specified internally. There is no default value for this variable.
  • The new configuration variable CKPT_SERVER_CLIENT_TIMEOUT informs the condor_schedd how long in seconds it is willing to wait to try and talk to a condor_ckpt_server process before declaring a condor_ckpt_server down. See section 3.3.11 for the complete description.
  • The new configuration variable CKPT_SERVER_CLIENT_TIMEOUT_RETRY informs the condor_schedd that once a condor_ckpt_server is been marked as down, how may seconds must pass before the condor_schedd will try and communicate with the condor_ckpt_server again. See section 3.3.11 for the complete description.
  • The new configuration variable CKPT_SERVER_REMOVE_STALE_CKPT_INTERVAL informs the condor_ckpt_server to begin removal of stale checkpoints at the specified interval in seconds. See section 3.3.8 for the complete description.
  • The new configuration variable CKPT_SERVER_STALE_CKPT_AGE_CUTOFF informs the condor_ckpt_server how old a checkpoint file's access time must be in order to be considered stale. This time is compared against the current notion of now when the checkpoint server checks the checkpoint image file. See section 3.3.8 for the complete description.
  • The new configuration variable SlotWeight may be used to give a slot greater weight when calculating usage, computing fair shares, and enforcing group quotas. See 3.3.10 for the complete description.
  • The new configuration variable MAX_PERIODIC_EXPR_INTERVAL implements a ceiling on the time between evaluation of periodic expressions, due to the adaptive timing implied by the configuration variable PERIODIC_EXPR_TIMESLICE. See 3.3.11 for the complete description.
  • The new configuration variable GRIDMANAGER_SELECTION_EXPR can be used to control how many condor_gridmanager processes will be spawned to manage grid universe jobs. As a part of this change, removed the configuration variable and supporting code for GRIDMANAGER_PER_JOB since the new configuration variable supersedes it. See 3.3.11 for the complete description.
  • The configuration variable GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE and the corresponding throttle GRIDMANAGER_MAX_PENDING_SUBMITS have been removed.
  • The new configuration variable GRID_MONITOR_DISABLE_TIME controls how long the condor_gridmanager will wait after encountering an error before attempting to restart a Grid Monitor job. See 3.3.23 for the complete description.
  • The new pre-defined configuration macro DETECTED_MEMORY indicates the amount of physical memory (RAM) detected by Condor. The value is given in Mbytes.
  • The new pre-defined configuration macro DETECTED_CORES indicates the number of CPU cores detected by Condor.
  • The new configuration variable DELEGATE_FULL_JOB_GSI_CREDENTIALS controls whether a full or limited X.509 proxy is delegated for grid type gt2 grid universe jobs. See 3.3.26 for the complete description.
  • The new configuration variable UNHIBERNATE is used by the condor_startd to advertise in its ClassAd a boolean expression specifying when the machine should be woken up, for example by condor_rooster. See 3.3.10 for the complete description.
  • The new configuration variable HIBERNATION_PLUGIN specifies the path to the plug-in which the condor_startd uses both to detect the low power state capabilities of a machine and to switch the machine to a low power state. See 3.3.10 for the complete description.
  • The new configuration variable HIBERNATION_PLUGIN_ARGS specifies additional command line arguments which the condor_startd will pass to the plug-in when invoking it to switch the machine to a low power state. See 3.3.10 for the complete description.
  • The new configuration variable HIBERNATION_OVERRIDE_WOL can be used to direct the condor_startd to ignore Wake On LAN (WOL) capabilities of the machine's network interface, and to switch to a low power state even if the interface does not support WOL, or if WOL is disabled on it. See 3.3.10 for the complete description.
  • The new configuration variable DAGMAN_USER_LOG_SCAN_INTERVAL controls how long condor_dagman waits between checking job log files for status updates. See 3.3.25 for the complete description.
  • The new configuration variable DAGMAN_DEFAULT_NODE_LOG sets the default log file name for the new condor_dagman default node log file feature. See 3.3.25 for the complete description.
  • Removed the configuration variable DAGMAN_DELETE_OLD_LOGS ; new log file reading code makes it obsolete.
  • The new configuration variable HAD_CONTROLLEE is used to specify the name of the daemon which the condor_had controls. This name should match the daemon name in the condor_master's DAEMON_LIST.
Bugs Fixed:
  • Fixed a bug in ClassAd functions where arguments which should have been correctly coerced into strings instead evaluated to ERROR.
  • Fixed a confusing diagnostic message with the JobRouter, which happened when a job was removed within 5 minutes of being submitted.
  • Fixed a bug in which the use of dynamic slots (see section 3.13.7) caused the machine ClassAd attribute SLOT_STARTD_ATTRS to disappear from the ClassAd for some slots.
  • Fixed a Windows platform bug in which the window belonging to a Condor job does not receive a paint message.
  • Fixed a bug causing condor_q -analyze to crash when there was no condor_schedd daemon ClassAd file.
  • Fixed a condor_procd crash caused when the environment of a monitored process exceeded 1MByte in /proc.
  • Fixed a Windows platform bug which could cause the condor_credd to crash if a requested credential is not in the password store.
  • Fixed a bug that was causing the job event log rotation lock to be created with incorrect permissions.
  • Fixed a bug in the rotation of the job event log which could cause it never to be rotated in the Windows port of Condor.
  • Fixed a potential race condition in the job event log initialization.
  • Fixed race condition which could cause a crash of the condor_collector and condor_schedd on shutdown.
  • Fixed a bug in which the condor_master would sometimes die and produce a dprintf_failure.MASTER file when either restarting due to new binary timestamps or when started initially.
  • Fixed a memory leak related to SOAP configuration variables that occurred when Condor was reconfigured.
  • Fixed a bug in which the submit description file command cron_day_of_week was erroneously ignored.
  • Fixed bug in which the configuration variables MAX_JOB_QUEUE_LOG_ROTATIONS and GRIDMANAGER_SELECTION_EXPR would not work properly at start up; they only worked after a condor_reconfig.
  • Fixed a bug in which SOAP operations were being incorrectly authorized with the peer IP <0.0.0.0>.
  • Fixed a Windows platform bug in which not all Condor daemons were trusted by the Windows Firewall (previously known as Internet Connection Firewall or ICF).
  • Fixed a shutdown race condition in the condor_master with respect to high availability daemons.
  • Fixed a bug in which a Condor daemon incorrectly determined it had run out of socket descriptors.
  • Fixed a bug where the condor_schedd would block for very long periods of time while trying to connect to a down checkpoint server. Now the condor_schedd will do a blocking connect with a timeout to the checkpoint server for a configurable number of seconds. If the connect fails, the condor_schedd will put a moratorium on connecting to the checkpoint server until the configurable moratorium period passes. The configuration file variables that describe this behavior are described above.
  • Changed the check that condor_dagman does for other condor_dagman instances running the same DAG, if it finds a lock file at startup. Now, if condor_dagman is not sure whether the other DAGMan is alive, it continues, rather than exiting.
  • Fixed a major file descriptor leak in the Stork daemon.
  • Fixed a bug in which successful Stork transfers were marked as failed.
  • Fixed an uncommon memory leak in the user event log file reading code when reading badly formatted events.
  • Fixed a bug in which multiple machine ClassAds in the condor_collector with the same Name, but different StartdIPAddr attribute values, would cause the condor_negotiator to exit with an error. This is unusual and should not happen in a typical Condor installation. The most likely cause is using condor_advertise to advertise custom ClassAds for grid matchmaking.
  • Fixed a bug that caused condor_dagman to core dump if all submit attempts failed on a DAG node having a POST script. This bug has existed since Condor version 7.1.4.
  • Fixed a memory leak in the condor_schedd, which occurred when the configuration variable NEGOTIATOR_MATCH_EXPRS was used.
  • Fixed a bug in the Windows platform code that treats scripts as executables. Unknown file extensions were treated as an error, rather than as a Windows executable.
  • The condor_job_router now correctly sets the ClassAd attribute EnteredCurrentStatus to the current time when creating a new routed job. Previously, it copied this attribute from the original job.
  • The condor_job_router emits a more friendly log message when it observes that the routed copy of the job was removed.
  • A fix has been made for a problem seen in 7.3.1 in which Condor daemons using CCB to connect to other Condor daemons would sometimes consume large amounts of CPU time for no good reason.
  • Fixed a rare failure case bug in which attempts to connect via CCB could stay in a pending state indefinitely.
  • A Unix only bug caused Condor daemons to fail to start if MAX_FILE_DESCRIPTORS was configured higher than the current hard limit inherited by Condor. If Condor is running as root, this is no longer the case.
  • The condor_gridmanager now advertises grid ClassAds properly when there are multiple condor_collector daemons.
  • When using condor_q -xml and -format together to limit the number of ClassAd attributes returned in the query, the XML container tag was not generated. This is fixed, but now the preferred way to limit the returned attributes is to use condor_q option -attributes.
  • Fixed a bug in which the Unix condor_master failed when trying to restart itself, if the configuration variable MASTER_LOCK was defined, or if the condor_master was invoked with the -t option. This bug has existed since the 7.0 series, and likely has existed much longer than that.
  • Fixed a significant memory leak in the gahp_server. This leak was only present in previous Condor 7.3.x releases.
  • Fixed a bug that can cause a removed job that is held and then released to return to idle status.
  • The Globus jar files distributed with the x86-64 RHEL 5 RPMs were damaged, causing gt4 grid type jobs to fail. This has been fixed.
Versienummer:7.3.2
Releasestatus:Final
Besturingssystemen:Windows 7, Windows 2000, Linux, BSD, Windows XP, macOS, Solaris, UNIX, Windows Server 2003, Windows Vista, Windows Server 2008
Website:Condor
Download:http://www.cs.wisc.edu/condor/downloads-v2/download.pl
Licentietype:Voorwaarden (GNU/BSD/etc.)
Moderatie-faq Wijzig weergave

Reacties


Er zijn nog geen reacties geplaatst

Op dit item kan niet meer gereageerd worden.



Apple iOS 10 Google Pixel Apple iPhone 7 Sony PlayStation VR AMD Radeon RX 480 4GB Battlefield 1 Google Android Nougat Watch Dogs 2

© 1998 - 2016 de Persgroep Online Services B.V. Tweakers vormt samen met o.a. Autotrack en Carsom.nl de Persgroep Online Services B.V. Hosting door True