0. THE PURPOSE OF THIS DOCUMENT This document assumes that the goals of resource management have been understood. We present our initial views on how we model resource management (RM) and what components comprise the RM architecture in SAM. We aim at (a) establishing the common understanding of the model, (b) outline directions for our studies by presenting different alternatives and raising issues on which the design will depend. 1. THE MODEL 1.1. Types of resources We roughly divide the available resources into "local" and "global", depending on whether they are assigned to the station or are a property of the site (or, even further, global to the entire SAM world). Local resources are controlled by a station and/or the associated batch system: a) Cache disk b) CPU c) etc Global resources are shared by different stations: a) MSS resources, accessible by multiple stations of the site (and beyond?): i) Robot arm mounting tapes ii) Tape drives in the MSS iii) network connections to the MSS b) Network connections among stations. If the station is a cluster, we view the whole station as a pointlike "node" on the distributed cache network. 1.2 Other Resource Management Entities SAM resource management implies integration with the other resource managers available: a) The MSS is modeled after Enstore. MSS provides a narrow interface to request individual files, with priorities set on a per-request basis. It is NOT possible to reorder, or modify already submitted requests. The MSS provides information about which volume contains a given file (e.g., at the time of initial file storing). b) The traditional batch system will most likely be modeled after LSF. The batch system schedules and distributes (in case of a cluster) user jobs based on resource availability and their allocation by certain categories. A study of the LSF capabilities is in progress in order to model the batch system in one of the two ways: i) In the simplest case, the batch system can only manage a few "classical" resources such as CPU, memory, scratch disk, but NOT e.g., SAM cache disk or MSS resources. SAM would then have to use priorities (assigned on a per-job or a per-queue basis) to control dispatch of the user jobs in order to achieve the goals (see the GOALS document). The advantage of such a model is that it is most general and will likely work with any "classical" queue-oriented batch system. The major disadvantage of the model is that by manipulating user jobs (either before or after submission), the SAM system effectively replaces the BS scheduler. ii) In the more sophisticated case, SAM leaves the scheduling of jobs to the BS. However, the SAM resources and policies are taken into account by formulating, on behalf of the user jobs, the resource requirements for each job, so as to incorporate all the SAM - controlled resources. This batch system must support either administrator-defined resource types, or abstract, symbolically named resources. This model is in the spirit of Condor's match-making service where different entities "advertise" available abstract resources and different user jobs require certain abstract resources. It is not at all clear whether LSF can support abstract resources. 2. THE ARCHITECTURE SAM resource management will essentially take place at two levels. 2.1 At the station level, we will develop an SM-BS integration module which will ensure cooperation of the SM (which manages disk cache and runs projects) with the BS (which manages CPU and other machine resources and runs user jobs). The strategy of this cooperation depends on the adopted model of the BS (see above), but in any case, the function is, essentially, in relating the SAM project to the BS job. In simpler terms, we will relate the existing "sam start project" command to the submittion of a job into a BS queue (the job will likely need to be instrumented so as to provide additional "hooks" into the SM). The goal of the SM-BS integration is to implement the experiment policies on resource allocation by groups (primary GOAL 2) and to co-allocate resources (e.g. disk cache together with CPU) in order to maximize the job throughput (primary GOAL 3). 2.2 We will need to perform certain functions at the global level. Note that the "global" scope is in turn split into at least two levels: the site-wide and truly global. Site-wide Resource Manager will essentially coordinate access to the MSS and manage LAN bandwith. Global, or inter-site RM will manage the (shared) WAN bandwidth. It appears desirable to design a single "global" Resource Manager which will be scalable to the hierarchy, i.e., to develop a single algorithm that will deal with various slightly different resources (from the same "global" category) at different levels. The envisioned functions of the global Resource Manager are: a) control file transfer requests by (i) prioritizing and (ii) regulating their flow, so as to enforce the experiment policies on the global resource allocation by the access mode (primary GOAL 1). b) (optional) load balancing among stations by determining, on a per-project basis, the optimal station to run. This function is for the primary GOAL 3. It is not at present clear to what extent this function is necessary in the first release of the resource management. To be exact, we have three options, in the order of increasing sophistication: (i) leave the decision to the user (do nothing on SAM's part) (ii) determine the station semi-statically, based on the access mode and the group of the user, and the policies stored in the database. (iii) determine the station based on current job loads of the group at the available stations, and based on resource availability at (and for) these stations. Note that it is desirable to leave all the load balancing to the BS. However there may be multiple batch system installations in the system. Furthermore, the BS may not support administrator-defined resources (see the models above).