Simultaneously, this document poses requirements for the batch system in a general form, i.e., in order to achieve the goals below, it is necessary for our abstraction of the batch system to conform to some minimal requirements.
Rationale:
Feasibility in the model BS. Most batch system support the concept
of priority or otherwise allow the administrator to more or less directly
affect the job's pending time. For example, the LSF system allows
queue-level priorities so SMBSM could submit jobs to different queues.
Goal 2. SAM will attempt to group the pending, or newly arriving, jobs, with other either running or pending jobs. (The jobs are not related from the end user's standpoint.)
Rationale: Resource types can be categorized into those allowing multiple concurrent users and those whose users are mutually exclusive. Examples of mutex-usable resources are CPU and virtual memory (with the small exception of different jobs sharing code in text segments) . Examples of concurrently usable resources are network bandwidth or MSS bandwidths, where e.g., delivery of the same data unit benefits mutliple projects. Thus, grouping of jobs that can together use the same resource will in general increase productivity (primary goal 3). In practice, SMBSM may try to dispatch several projects at about the same time if their datasets overlap significantly and if running the projects successively would cause unnecessary disk cache thrashing.
Feasibility in the model BS. May not be a direct feature. For
example, LSF supports different conditions that must be met in order
for a job to run, including another job having started. This LSF feature
would allow to enforce strict grouping of jobs whereas SMBSM would
probably like to suggest the grouping as a hint. It is possible
however that some grouping may be achieved by increasing selectively
the priorities (or reducing the wait time) of the several jobs to
be grouped, see the previous goal.
Goal 3. SAM will create additional conditions for a job to run.
Rationale: Typically, it is undesirable for a SAM consumer to start if the files cannot be delivered to its project. This is because the large application binary will waste the virtual memory for potentially indefinite time and occupy the "job slot" for no reason, thus preventing other jobs from running, who may have their files staged.
Feasibility in the model BS. Depending on the exact model, SMBSM will use appropriate mechanishm, for example, the following is available from LSF:
Goal 4. SAM will try to preempt the running jobs
whose data delivery has dramatically degraded.
Rationale: However well SAM has done resource planning, it is possible that the data for a project has stopped coming. There may be a network outage, or an ATL robot malfunction. Even if the system works properly, there may begin a higher-priority access mode that will effectively reduce available bandwidth for other access modes, including the one for the project at hand. At this point, the rationale for goal 3 fully applies.
Feasibility in the model BS. No special feature is probably required from the BS. Even if the BS provides preemption, SAM will likely prefer not to use abrupt termination of the job by means of e.g, sending it a SIGTERM.
Since this condition is specific for the "getNextFile" operation, SAM will
probably signal the consumer process that there are no more files available
"any time soon" and have SMBSM resubmit the job and the project back
to the BS (perhaps at a later time). It may be possible to do so transparently
to the user. The result would be that, with the exception of short, "reasonable"
delays, the heavyweight consumer application is at the running state (and
thus using the job slot etc.) only when there are data available!