In the remainder of the introduction, we elaborate on our design goals
and then present our rationale based on the analysis of the resource management
issues. We will also clarify some of the terms we use most frequently.
In Section 2, we present the actual disk management ideas. In Section 3,
we address the issues of administering SAM station. Section 4 describes
what work has to be done on the Database side of our project. Section 5
concludes with the anticipated impact on users, the ultimate judges of
our work.
As the Run II data handling system, SAM ought to manage (and optimize
whenever possible) hardware resources in order to provide the most efficient
data access for the users. In order to do so, SAM controls (monitors and
regulates) data access by different physics groups and individuals. We
envision a hierarchy in resource management and data access coordination:
the global resource manager called the optimizer is responsible
for the global resources (primarily related to the ATL, such as tape mount
rate and tape drives); the station coordinates allocation of resources
to projects; and projects coordinate data access and resource
usage by their consumers (which are also end users of the resources).
We assume that the network bandwidth as seen by the station is abundant
(thanks to the efforts of our partner projects) so that we can exclude
from the station design. We will defer CPU management until the batch system
is fully understood. Thus, the station primarily manages the disk
(and the files cached on it). The goal of this design is to suggest a picture
for efficient and convenient access to disk files through SAM. We will
not concentrate on management of disk bandwidth, i.e., we will largely
ignore disk contention, assuming that the disk requests will be randomized
well enough so as to provide natural balancing of disk I/O across the many
disks.
The station will have to cooperate with non-SAM resource management
systems, primarily with the batch system. SAM will not assume the
role of a batch system job scheduler; rather, it will assist the batch
system in resource co-allocation in the following way. The traditional
batch system usually handles only exclusive-access resources such as CPU,
physical memory, scratch disk, etc.. In addition, the batch system may
not properly manage global resources such as those related to the ATL,
which we currently believe will be the most scarce among all the
resources.
SAM will aim at remedying potential deficiencies of the batch system. For example, SAM may help determine a job's priority by comparing the job's SAM resource requirements vs resource availability. SAM will treat a disk-cached file as a resource. (It may not be immediately obvious that a cached file is a resource, because unlike the more known resources such as CPU, this resource is sharable. For Computer Science, however, sharable resources are known as a canonical type). Clearly, the availablity of a job's files on disk cache greatly affects its expected turnaround and therefore the extent to which it may be desirable to schedule the job sooner.
In building the SAM station, we defer most of the resource management issues until (a) the batch system is completely understood and (b) some global optimization is begun. Development of a disk management in station will, however, be the first step towards the station-batch interface. We assume that, given the ability to allocate and schedule disks, the knowledge of the data file sets requested by projects (both running and queued), combined with a user-supplied CPU per event estimate, will provide the necessary basis to build the station-batch interface.
In summary, the rationale for the station design in the present form,
i.e, largely restricted to disk management, is as follows. We strongly
believe that an intelligent disk (cache) management is (a) a well-defined
task of the station, to be integrated seamlessly into the bigger picture,
(b) a natural step towards efficient overall resource management, rather
than a diversion from the ongoing overall analysis, and (c) a necessity
at the present stage of the SAM evolution as a project.
Specifically in the context of the station design, we will use the
following terms:
A station is said to manage a disk if and only if:
The station will either manage the cache automatically or provide administrative
tools for direct disk manipulation by the human administrator. The former
encompasses what Lee's document refers to as Short/Long Term Caches and
Buffers and is described in the following subsection. The latter
is primarily based on file locking/unlocking in the end of this Section
(as well as on explicit allocate() operation, see the Interfaces
Section).
The primary contribution of the present document is given by the
following discussion. The proposed design differs significantly from earlier
ideas.
The distinction between the Cache and the Buffer is too fine and becomes cumbersome when enforced by the design. In many cases, it is simply not possible to predict whether a file will be reused in near future or not.Treating a part of the disk as a buffer simply means a particular (FIFO) cache replacement algorithm. We are not presenting any particular cache replacement; moreover, we assume that multiple algorithms will be possible (and dynamically set) for various parts of the total disk on the station. Thus, we erase the boundaries between Buffer, Short Term Cache, Long Term cache while understanding that different parts of the station may be configured to effectively be one of such. Thus, we treat all the station's disk as THE CACHE.
The Station's Cache Manager (CM) is responsible for coordination of
projects requesting files and proper cooperation with the global resource
manager (i.e., the optimizer). I was in the meetings, so I believe I know
the point you are talking to in the following paragraph, but it still
wasn't clear to me what is meant. Maybe replace 'directly attached
replenisher' with something like 'the replinisher can be instantiated
either within the process space of the project or, in the canonical
case, in the process space of station master.'The Naturally, t The cache management
algorithm will essentially generalize that in the project master's replenisher:
while
the replenisher serves only its (directly attached) project master, the
station's disk manager serves any number of projects, possibly with overlapping
file requests.
For backward compatibility with projects that must (or wish to) run without the station master, the cache manager will implement all the interfaces of the replenisher. Thus, every project master will communicate with the same interface implemented either as directly attached replenisher or in the station, with the decision being made at project startup time.
When a project is started, its snapshot files are added to the
"requested file" set in the Cache Manager. The CM then requests
authorization from the optimizer for all the newly requested files
(i.e., those that weren't already known cached or
requested before this project started). At all times, each file in the
"requested" set is associated with at least one project that expressed
interest in it.
When the authorization for a file arrives, the file is added to the
"can go" file list. This is the list of files, hopefully grouped by
volume (if the optimizer has done good job Is
the optimizer the only one responsible for grouping? What about 1)
The optimizer approves three files for transfer from volume 1 and a
couple minutes later another request for the same volume is received?
Should the optimizer be limber enough to pass a new four file group to
the project? 2) The optimizer receives two requests for the same
volume, but for different stations. Should it give hints to the
stations that those two transfers should be launched aggressively thus
allowing for 'coincidental' optimization should both stations be able
to handle the file now? ) whose HSM->disk retrieval can begin
as soon as there is enough cache space. Specifically, if the disk
requirements for the next delivery group (see below) can be met by
erasing some of the disposable files (called "can free"
in the replenisher), CM instructs the stager(s) to erase the
disposable files and initiate the deliveries for the group. A delivery
group is a sublist of the "can go" list that is a unit of ENCP work;
naturally, it is a set of files from one physical volume (tape). If
the tape mounts is are the most scarce resource, a group includes
all the files from the tape that are needed by all the known
projects. If disk space becomes limited as well, the group size may
decrease to a single file (as in the initial implementation of the
replenisher).
When a stager notifies the CM of a successful file retrieval completion, the file becomes a "cached file" and is served to the projects associated with the file. The newly cached file is marked as being in use. Its new location is added to the database. Each project then serves the file to its consumers in the usual way; when all the consumers are done, the project releases the file by calling CM. It is important for CM to be able to limit the time a project takes to process a file, (is this true or is it important to be able to know when the file is no longer being used and a timeout is one, last ditch, method to determine this? One can imagine other mechanisms to determine this that would allow for better information to be obtained earlier) much like projects themselves have time limits for their consumers to process a file.
Finally, when all the projects release a file, the file is added to the "disposable" list (see above) and the CM reviews its chances to deliver a next group, at which point the file may be erased (What if another station has already been authorized to retrieve that same file? It would be a shame to have station A flush the cache just moments before station B starts it's RCP). Exactly what disposable files are selected to be erased is irrelevant for this document; what is important is that the CM possesses enough information about file accesses (both past and near future) in order to execute some intelligent generalization of LRU or another cache algorithm (see the section on persistent variables). When a file is erased from disk, its associated location is erased from the database.
It is a requirement to the station Cache Manager to support the
notion of a locked (AKA pinned) file, i.e., a file that
has been marked as "unerasable" until further notice. We will assume
that any cached file (whether in use or disposable) may be locked on
disk by a user with sufficient privileges. Clearly, uncontrolled use
of this facility will incapacitate the CM by eventually locking of all
the files thus leaving effectively no free space on disk and
precluding any intelligent cache algorithm from execution. Therefore,
the locking of files is primarily intended for specific kinds of data
(such as Thumbnail or calibration) and by group administrators
only. (I haven't really thought this through
fully yet, but I wonder if we are missing an abstraction here. Pinned
files are quite different from 'cache' in all senses of the two words.
It might be better to implement them as such instead of trying to mash
them together)
Locked files (and their occupied space) are effectively excluded from the disk management algorithms above. It is critical, however, that similarly to any other disk files, locked files are subject to full access history monitoring. This access history will be provided to the administrators for their viewing pleasure (well, actually to facilitate decisions to change the contents of the locked area).
Station configuration is the set of parameters to be controlled
by system and group administrators. The number of parameters should
be neither too small (lest administrators think that SAM is too simplistic
or that they don't have enough control) nor too large (lest administrators
get too confused). These parameters fall into approximately three categories:
Example activities of administrators changing these parameters include:
In this section, we present the required DB support for the proposed design. It is not the purpose of this document to decide exact table organization in the database; we possess great expertise with other project developers to do so. Instead, we intend to define what variables must be made persistent.
The quasi-permanent configuration-related variables are based on the following entities and relationships:
The Db server interfaces should be such that they allow storage and
retrieval of the above station variables. In addition, interfaces to record
significant events, which already include project begin/end, should be
extended so as to incorporate file delivery/erasure.
In this section we attempt to predict the change in "look and feel"
of SAM, i.e., give the flavor of new commands and outline benefits for
the end users (aside from performance increase due to extensive caching
of files). With the introduction of the SAM station, and from that time
on, a clear distinction will be made between administrators and end users.
Almost all of the the new commands/tools will be for use by administrators
for configuring and restarting the station.
Typical command lines for configuration will feel like:
sam add disk --disk=/sam/cache1 --size=1000000
--station=d0mino
sam increase allocation --group=mcc99
--disk=/sam/cache1 --size=200000 (Is 'sam set
allocation' more natural?)
Typical administrative command to lock a file on disk:
sam lock --file=sim.pmc02_01.pythia.zhbbmet_mb1.1av_200evts.292_1753
(This command may involve physical moving of the file.)
As for the end users, the major benefit will be in relieving them from explicit buffer allocation/cleanup for their projects. The sam start project command (or its successor) will be a request to the station, rather than an action of physically starting the project master; therefore, the command may fail if the station rejects the job. Furthermore, as we work towards the integration with the batch system, we will more frequently speak of a user job and less frequently of a project. A single consumer project is a part of the user job which essentially entails (1) starting of a project, (2) running of an analysis program, and (3) stopping a project. Our tendency is toward a single command such as one of the following:
sam run XXX.py <params>
sam submit XXX.py <params>
Users will have to deal with SAM-imposed resource restrictions, such
as disk/ATL usage. We are excited to see how we can, by (seemingly) creating
problems for every particular individual, enlighten the life of the Collaboration
as a whole!
=============================================================================
Project : SAM
Package : sam_doc
$Id: rich-station.html,v 1.1 1999/11/17 00:16:24 terekhov Exp $
This work is part of a development project, called SAM, which consists
of a
number of coordinated packages each named sam_xxxx .
Notice of authorship, copyright status, and terms and conditions,
should
the software eventually become available for use outside Fermilab,
can be
found in the README and LICENCE files in the top level directory of
the main
sam package.
==============================================================================