SAM Station Design

1 Introduction

This design document continues the effort to understand, in order to implement, the functionality of the station. We develop the ideas born in discussions with (or expressed by) several people, especially VW, LL, DP, RW. All these people should assume credit for this document (unless their ideas have been changed or misinterpreted beyond all recognition). Some of the previous ideas were captured in Lee's document. The new ideas are mostly contained in the actual section on cache management. This revision incorporates (wherever possible) comments from RW, Ruth,
VW.

In the remainder of the introduction, we elaborate on our design goals and then present our rationale based on the analysis of the resource management issues. We will also clarify some of the terms we use most frequently. In Section 2, we present the actual disk management ideas. In Section 3, we address the issues of administering SAM station.  Section 4 describes what work has to be done on the Database side of our project. Section 5 concludes with the anticipated impact on users, the ultimate judges of our work.
 

1.1 Goals


As the Run II data handling system, SAM ought to manage (and optimize whenever possible) hardware resources in order to provide the most efficient data access for the users. In order to do so, SAM controls (monitors and regulates) data access by different physics groups and individuals. We envision a hierarchy in resource management and data access coordination: the global resource manager called the optimizer is responsible for the global resources (primarily related to the ATL, such as tape mount rate and tape drives); the station coordinates allocation of resources to projects; and projects coordinate data access and resource usage by their consumers (which are also end users of the resources).

We assume that the network bandwidth as seen by the station is abundant (thanks to the efforts of our partner projects) so that we can exclude from the station design. We will defer CPU management until the batch system is fully understood. Thus, the station primarily manages the disk (and the files cached on it). The goal of this design is to suggest a picture for efficient and convenient access to disk files through SAM. We will not concentrate on management of disk bandwidth, i.e., we will largely ignore disk contention, assuming that the disk requests will be randomized well enough so as to provide natural balancing of disk I/O across the many disks.
Note: it is hard to define "efficient" until we fully understand the meaning of the throughput that we aim to maximize. It is almost certain however that the Station's Cache Manager will try to minimize the number of file transfers to/from HSM because these are expensive in any reasonably defined cost metric.
 

1.2 Rationale


The station will have to cooperate with non-SAM resource management systems, primarily with the batch system. SAM will not  assume the role of a batch system job scheduler; rather, it will assist the batch system in resource co-allocation in the following way. The traditional batch system usually handles only exclusive-access resources such as CPU, physical memory, scratch disk, etc.. In addition, the batch system may not properly manage global resources such as those related to the ATL, which we currently believe  will be the most scarce among all the resources.

SAM will aim at remedying potential deficiencies of the batch system. For example, SAM may help determine a job's priority by comparing the job's SAM resource requirements vs resource availability. SAM will treat a disk-cached file as a resource. (It may not be immediately obvious that a cached file is a resource, because unlike the more known resources such as CPU, this resource is sharable. For Computer Science, however, sharable resources are known as a canonical type). Clearly, the availability of a job's files on disk cache greatly affects its expected turnaround and therefore the extent to which it may be desirable to schedule the job sooner.

In building the SAM station, we defer most of the resource management issues until (a) the batch system is completely understood and (b) some global optimization is begun.  Development of a disk management in station will, however, be the first step towards the station-batch interface. We assume that, given the ability to allocate and schedule disks, the knowledge of the data file sets requested by projects (both running and queued), combined with a user-supplied CPU per event estimate, will provide the necessary basis to build the station-batch interface.

In summary, the rationale for the station design in the present form, i.e, largely restricted to disk management, is as follows. We strongly believe that an intelligent disk (cache) management is (a) a well-defined task of the station, to be integrated seamlessly into the bigger picture, (b) a natural step towards efficient overall resource management, rather than a diversion from the ongoing overall analysis, and (c) a necessity at the present stage of the SAM evolution as a project.
 

1.3 Definitions


Specifically in the context of the station design, we will use the following terms:

2 Disk Management


A Station is said to manage a disk if and only if:

While these points may seem obvious for some readers, they may not be obvious for everyone, thus requiring an explanation. We have already seen how violation of the second item brings SAM into an inconsistent state. As for the first item, consider an example where a non-SAM entity writes a file onto a SAM-managed disk. If the space had not been allocated, the station may dedicate some or all of the physical blocks for another purpose thus leading to unpredictable consequences.

The station will either manage the cache automatically or provide administrative tools for direct disk manipulation by the human administrator. The former encompasses what Lee's document refers to as Short/Long Term Caches and Buffers and is described in the following subsection.  The latter is primarily based on file locking/unlocking in the end of this Section (as well as on explicit allocate() operation, see the end of the section).
 

2.1 Cache Management: the Life Cycle of a Disk File


The primary contribution of the present document is given by the following discussion. The proposed design differs significantly from earlier ideas.

The distinction between the Cache and the Buffer is too fine and becomes cumbersome when enforced by the design. In many cases, it is simply not possible to predict whether a file will be reused in near future or not.Treating a part of the disk as a buffer simply means a particular (FIFO) cache replacement algorithm. We are not presenting any particular cache replacement; moreover, we assume that multiple algorithms will be possible (and dynamically set) for various parts of the total disk on the station. Thus, we erase the boundaries between Buffer, Short Term Cache, Long Term cache while understanding that different parts of the station may be configured to effectively be one of such. Thus, we treat all the station's disk as THE CACHE.

The Station's Cache Manager (CM) is responsible for coordination of projects requesting files and proper cooperation with the global resource manager (i.e., the optimizer). The cache management algorithm will essentially generalize that in the project master's replenisher: while the replenisher serves only its (directly attached) project master, the station's disk manager serves any number of projects, possibly with overlapping file requests. In other words, the replinisher can be instantiated either within the process space of the project or, in the canonical case, in the process space of station master.

For backward compatibility with projects that must (or wish to) run without the station master, the cache manager will implement all the interfaces of the replenisher. Thus, every project master will communicate with the same interface implemented either as directly attached replenisher or in the station, with the decision being made at project startup time.

When a project is started, its snapshot files are added to the "requested file" set in the Cache Manager. The CM then requests authorization from the optimizer for all the newly requested files i.e., those that weren't already known before this project started (The CM "knows" a file if it is already cached or requested to be cached.) At all times, each file in the "requested" set is associated with at least one project that expressed interest in it.

When the authorization for a file arrives, the file is added to the "can go" file list. This is the list of files, hopefully grouped by volume (if the optimizer has done good job) whose HSM->disk retrieval can begin as soon as there is enough cache space. Specifically, if the disk requirements for the next delivery group (see below) can be met by erasing some of the disposable files  (called "can free" in the replenisher), CM instructs the stager(s) to erase the disposable files and initiate the deliveries for the group. A delivery group is a sublist of the "can go" list that is a unit of ENCP work; naturally, it is a set of files from one physical volume (tape). If tape mounts are the most scarce resource, a group includes all the files from the tape that are needed by all the known projects. If disk space becomes limited as well, the group size may decrease to a single file (as in the initial implementation of the replenisher).

When a stager notifies the CM of a successful file retrieval completion, the file becomes a "cached file" and is served to the projects associated with the file. The newly cached file is marked as being in use. Its new location is added to the database. Each project then serves the file to its consumers in the usual way; when all the consumers are done, the project releases the file by calling CM. It is important for CM to be able to limit the time a project takes to process a file,  much like projects themselves have time limits for their consumers to process a file.

Finally, when all the projects release a file, the file is added to the "disposable" list (see above) and the CM reviews its chances to deliver a next group, at which point the file may be erased. Exactly what disposable files are selected to be erased is irrelevant for this document; what is important is that the CM possesses enough information about file accesses (both past and near future) in order to execute some intelligent generalization of LRU or another cache algorithm (see the section on persistent variables). When a file is erased from disk, its associated location is erased from the database.

If we want multiple stations to access each other's caches, the decision by CM on when to erase a file may become quite complicated. We assume that the global resource manager will coordinate inter-station file exchange; for now, we can either (1) disallow a station accessing a file from another station, or (2) allow remote cache access but then be prepared for the possibility of delivery errors and ignore them.

2.2 Locking of Files on Disk


It is a requirement to the station Cache Manager to support the notion of a locked (AKA pinned) file, i.e., a file that has been marked as "unerasable" until further notice. We will assume that any cached file (whether in use or disposable) may be locked on disk by a user with sufficient privileges. Clearly, uncontrolled use of this facility will incapacitate the CM by eventually locking of all the files thus leaving effectively no free space on disk and precluding any intelligent cache algorithm from execution. Therefore, the locking of files is primarily intended for specific kinds of data (such as Thumbnail or calibration) and by group administrators only.

Locked files (and their occupied space) are effectively excluded from the disk management algorithms above. It is critical, however, that similarly to any other disk files, locked files are subject to full access history monitoring. This access history will be provided to the administrators for their viewing pleasure (well, actually to facilitate decisions to change the contents of the locked area).
 

2.3 Output Buffer Allocation


It is important that SAM be responsible for controlling of the output buffer allocation. Although we will most likely choose to set aside ouput buffer area, ultimately we must treat both input and output areas as parts of THE DISK for the following reasons. First, we must ensure proper rate of output buffer flushing and reasonable availability of output buffer area for user jobs as it affects the overall progress of projects (and we are concerned with the consumption rate in this design). Second, an "output" file may become an "input" file soon enough that the distinction between input area and output area becomes quite artificial. We therefore envision the aforementioned  allocate()  method in the station interface. (In the SAM stub in the analysis framework we already have the place to call it; the framework will do so right before opening an output file.)

3 Station Configuration and Administrator Roles


Station configuration is the set of parameters to be  controlled by system and group administrators. The number of parameters  should be neither too small (lest administrators think that SAM is too simplistic or that they don't have enough control) nor too large (lest administrators get too confused). These parameters fall into approximately three categories:

Note: allocations of some global resources, such as tape drives or tape mounts per hour, to a station will likely not be a part of that station configuration; rather, those will define the configuration of the global resource manager (optimizer).

Example activities of administrators changing these parameters include:

4 Persistent Objects: Support Required from the Database and the Db Server

Station master is a permanent "stateful" server, therefore, it must store its state persistently in order to recover from software failures and system reboots. Upon startup, the station master reads it state from the database using the interface with the server. The latter is of course driven by what constitutes the state of the station.

In this section, we present the required DB support for the proposed design. It is not the purpose of this document to decide exact table organization in the database; we possess great expertise with other project developers to do so.  Instead, we intend to define what variables must be made persistent.

The quasi-permanent configuration-related variables are based on the following entities and relationships:

The more dynamic objects that are created by the station itself will require the following entities to be added to the database: We hereby suggest that the remaining information could then be derived from these tables upon station startup. For example, the access history for a particular file is based on the already existing analysis_projects table and analyzed_files table.

The Db server interfaces should be such that they allow storage and retrieval of the above station variables. In addition, interfaces to record significant events, which already include project begin/end, should be extended  so as to incorporate file delivery/erasure.
 

5 Impact on the Users


In this section we attempt to predict the change in "look and feel" of SAM, i.e., give the flavor of new commands and outline benefits for the end users (aside from performance increase due to extensive caching of files). With the introduction of the SAM station, and from that time on, a clear distinction will be made between administrators and end users. Almost all of the the new commands/tools will be for use by administrators for configuring and restarting the station.

Typical command lines for configuration will feel like:

sam add disk --disk=/sam/cache1 --size=1000000 --station=central-analysis
sam increase/set allocation --group=mcc99 --disk=/sam/cache1 --size=200000

Typical administrative command to lock a file on disk:

sam lock --file=sim.pmc02_01.pythia.zhbbmet_mb1.1av_200evts.292_1753

(This command may involve physical moving of the file.)

As for the end users, the major benefit will be in relieving them from explicit buffer allocation/cleanup for their projects. The sam start project command (or its successor) will be a request to the station, rather than an action of physically starting the project master; therefore, the command may fail if the station rejects the job. Furthermore, as we work towards the integration with the batch system, we will more frequently speak of a user job and less frequently of a project. A single consumer project is a part of the user job which essentially entails (1) starting of a project, (2) running of an analysis program, and (3) stopping a project. Our tendency is toward a single command such as one of the following:

sam run XXX.py <params>
sam submit XXX.py <params>

Users will have to deal with SAM-imposed resource restrictions, such as disk/ATL usage. We are excited to see how we can, by (seemingly) creating problems for every particular individual, enlighten the life of the Collaboration as a whole!
 
 


Igor Terekhov (for the SAM team)

 

 
 
 

=============================================================================

Project   : SAM
Package : sam_doc
$Id: station.html,v 1.10 2000/03/24 19:58:27 vranicar Exp $

This work is part of a development project, called SAM, which consists of a
number of coordinated packages each named sam_xxxx .

Notice of authorship, copyright status,  and terms and conditions, should
the software eventually become available for use outside Fermilab, can be
found in the README and LICENCE files in the top level directory of the main
sam package.

==============================================================================