In the remainder of the introduction, we elaborate on our design goals
and then present our rationale based on the analysis of the resource management
issues. We will also clarify some of the terms we use most frequently.
In Section 2, we present the actual disk management ideas. In Section 3,
we address the issues of administering SAM station. Section 4 describes
what work has to be done on the Database side of our project. Section 5
concludes with the anticipated impact on users, the ultimate judges of
our work.
As the Run II data handling system, SAM ought to manage (and optimize
whenever possible) hardware resources in order to provide the most efficient
data access for the users. In order to do so, SAM controls (monitors and
regulates) data access by different physics groups and individuals. We
envision a hierarchy in resource management and data access coordination:
the global resource manager called the optimizer is responsible
for the global resources (primarily related to the ATL, such as tape mount
rate and tape drives); the station coordinates allocation of resources
to projects; and projects coordinate data access and resource
usage by their consumers (which are also end users of the resources).
We assume that the network bandwidth as seen by the station is abundant
(thanks to the efforts of our partner projects) so that we can exclude
from the station design. We will defer CPU management until the batch system
is fully understood. Thus, the station primarily manages the disk
(and the files cached on it). The goal of this design is to suggest a picture
for efficient and convenient access to disk files through SAM. We will
not concentrate on management of disk bandwidth, i.e., we will largely
ignore disk contention, assuming that the disk requests will be randomized
well enough so as to provide natural balancing of disk I/O across the many
disks.
Note: it is hard to define "efficient" until we fully understand
the meaning of the throughput that we aim to maximize. It is almost
certain however that the Station's Cache Manager will try to minimize
the number of file transfers to/from HSM because these are expensive
in any reasonably defined cost metric.
The station will have to cooperate with non-SAM resource management
systems, primarily with the batch system. SAM will not assume the
role of a batch system job scheduler; rather, it will assist the batch
system in resource co-allocation in the following way. The traditional
batch system usually handles only exclusive-access resources such as CPU,
physical memory, scratch disk, etc.. In addition, the batch system may
not properly manage global resources such as those related to the ATL,
which we currently believe will be the most scarce among all the
resources.
SAM will aim at remedying potential deficiencies of the batch system. For example, SAM may help determine a job's priority by comparing the job's SAM resource requirements vs resource availability. SAM will treat a disk-cached file as a resource. (It may not be immediately obvious that a cached file is a resource, because unlike the more known resources such as CPU, this resource is sharable. For Computer Science, however, sharable resources are known as a canonical type). Clearly, the availability of a job's files on disk cache greatly affects its expected turnaround and therefore the extent to which it may be desirable to schedule the job sooner.
In building the SAM station, we defer most of the resource management issues until (a) the batch system is completely understood and (b) some global optimization is begun. Development of a disk management in station will, however, be the first step towards the station-batch interface. We assume that, given the ability to allocate and schedule disks, the knowledge of the data file sets requested by projects (both running and queued), combined with a user-supplied CPU per event estimate, will provide the necessary basis to build the station-batch interface.
In summary, the rationale for the station design in the present form,
i.e, largely restricted to disk management, is as follows. We strongly
believe that an intelligent disk (cache) management is (a) a well-defined
task of the station, to be integrated seamlessly into the bigger picture,
(b) a natural step towards efficient overall resource management, rather
than a diversion from the ongoing overall analysis, and (c) a necessity
at the present stage of the SAM evolution as a project.
Specifically in the context of the station design, we will use the
following terms:
A Station is said to manage a disk if and only if:
The station will either manage the cache automatically or provide administrative
tools for direct disk manipulation by the human administrator. The former
encompasses what Lee's document refers to as Short/Long Term Caches and
Buffers and is described in the following subsection. The latter
is primarily based on file locking/unlocking in the end of this Section
(as well as on explicit allocate() operation, see the end of the
section).
The primary contribution of the present document is given by the
following discussion. The proposed design differs significantly from earlier
ideas.
The distinction between the Cache and the Buffer is too fine and becomes cumbersome when enforced by the design. In many cases, it is simply not possible to predict whether a file will be reused in near future or not.Treating a part of the disk as a buffer simply means a particular (FIFO) cache replacement algorithm. We are not presenting any particular cache replacement; moreover, we assume that multiple algorithms will be possible (and dynamically set) for various parts of the total disk on the station. Thus, we erase the boundaries between Buffer, Short Term Cache, Long Term cache while understanding that different parts of the station may be configured to effectively be one of such. Thus, we treat all the station's disk as THE CACHE.
The Station's Cache Manager (CM) is responsible for coordination of projects requesting files and proper cooperation with the global resource manager (i.e., the optimizer). The cache management algorithm will essentially generalize that in the project master's replenisher: while the replenisher serves only its (directly attached) project master, the station's disk manager serves any number of projects, possibly with overlapping file requests. In other words, the replinisher can be instantiated either within the process space of the project or, in the canonical case, in the process space of station master.
For backward compatibility with projects that must (or wish to) run without the station master, the cache manager will implement all the interfaces of the replenisher. Thus, every project master will communicate with the same interface implemented either as directly attached replenisher or in the station, with the decision being made at project startup time.
When a project is started, its snapshot files are added to the "requested file" set in the Cache Manager. The CM then requests authorization from the optimizer for all the newly requested files i.e., those that weren't already known before this project started (The CM "knows" a file if it is already cached or requested to be cached.) At all times, each file in the "requested" set is associated with at least one project that expressed interest in it.
When the authorization for a file arrives, the file is added to the "can go" file list. This is the list of files, hopefully grouped by volume (if the optimizer has done good job) whose HSM->disk retrieval can begin as soon as there is enough cache space. Specifically, if the disk requirements for the next delivery group (see below) can be met by erasing some of the disposable files (called "can free" in the replenisher), CM instructs the stager(s) to erase the disposable files and initiate the deliveries for the group. A delivery group is a sublist of the "can go" list that is a unit of ENCP work; naturally, it is a set of files from one physical volume (tape). If tape mounts are the most scarce resource, a group includes all the files from the tape that are needed by all the known projects. If disk space becomes limited as well, the group size may decrease to a single file (as in the initial implementation of the replenisher).
When a stager notifies the CM of a successful file retrieval completion, the file becomes a "cached file" and is served to the projects associated with the file. The newly cached file is marked as being in use. Its new location is added to the database. Each project then serves the file to its consumers in the usual way; when all the consumers are done, the project releases the file by calling CM. It is important for CM to be able to limit the time a project takes to process a file, much like projects themselves have time limits for their consumers to process a file.
Finally, when all the projects release a file, the file is added to the "disposable" list (see above) and the CM reviews its chances to deliver a next group, at which point the file may be erased. Exactly what disposable files are selected to be erased is irrelevant for this document; what is important is that the CM possesses enough information about file accesses (both past and near future) in order to execute some intelligent generalization of LRU or another cache algorithm (see the section on persistent variables). When a file is erased from disk, its associated location is erased from the database.
If we want multiple stations to access each other's caches, the decision by CM on when to erase a file may become quite complicated. We assume that the global resource manager will coordinate inter-station file exchange; for now, we can either (1) disallow a station accessing a file from another station, or (2) allow remote cache access but then be prepared for the possibility of delivery errors and ignore them.
It is a requirement to the station Cache Manager to support the
notion of a locked (AKA pinned) file, i.e., a file that has
been marked as "unerasable" until further notice. We will assume that any
cached file (whether in use or disposable) may be locked on disk by a user
with sufficient privileges. Clearly, uncontrolled use of this facility
will incapacitate the CM by eventually locking of all the files thus leaving
effectively no free space on disk and precluding any intelligent cache
algorithm from execution. Therefore, the locking of files is primarily
intended for specific kinds of data (such as Thumbnail or calibration)
and by group administrators only.
Locked files (and their occupied space) are effectively excluded from
the disk management algorithms above. It is critical, however, that similarly
to any other disk files, locked files are subject to full access history
monitoring. This access history will be provided to the administrators
for their viewing pleasure (well, actually to facilitate decisions to change
the contents of the locked area).
It is important that SAM be responsible for controlling of
the output buffer allocation. Although we will most likely choose to set
aside ouput buffer area, ultimately we must treat both input and output
areas as parts of THE DISK for the following reasons. First, we must ensure
proper rate of output buffer flushing and reasonable availability of output
buffer area for user jobs as it affects the overall progress of projects
(and we are concerned with the consumption rate in this design). Second,
an "output" file may become an "input" file soon enough that the distinction
between input area and output area becomes quite artificial. We therefore
envision the aforementioned allocate() method in the
station interface. (In the SAM stub in the analysis framework we already
have the place to call it; the framework will do so right before opening
an output file.)
Station configuration is the set of parameters to be controlled
by system and group administrators. The number of parameters should
be neither too small (lest administrators think that SAM is too simplistic
or that they don't have enough control) nor too large (lest administrators
get too confused). These parameters fall into approximately three categories:
Example activities of administrators changing these parameters include:
In this section, we present the required DB support for the proposed design. It is not the purpose of this document to decide exact table organization in the database; we possess great expertise with other project developers to do so. Instead, we intend to define what variables must be made persistent.
The quasi-permanent configuration-related variables are based on the following entities and relationships:
The Db server interfaces should be such that they allow storage and
retrieval of the above station variables. In addition, interfaces to record
significant events, which already include project begin/end, should be
extended so as to incorporate file delivery/erasure.
In this section we attempt to predict the change in "look and feel"
of SAM, i.e., give the flavor of new commands and outline benefits for
the end users (aside from performance increase due to extensive caching
of files). With the introduction of the SAM station, and from that time
on, a clear distinction will be made between administrators and end users.
Almost all of the the new commands/tools will be for use by administrators
for configuring and restarting the station.
Typical command lines for configuration will feel like:
sam add disk --disk=/sam/cache1 --size=1000000 --station=central-analysis
sam increase/set allocation --group=mcc99 --disk=/sam/cache1 --size=200000
Typical administrative command to lock a file on disk:
sam lock --file=sim.pmc02_01.pythia.zhbbmet_mb1.1av_200evts.292_1753
(This command may involve physical moving of the file.)
As for the end users, the major benefit will be in relieving them from explicit buffer allocation/cleanup for their projects. The sam start project command (or its successor) will be a request to the station, rather than an action of physically starting the project master; therefore, the command may fail if the station rejects the job. Furthermore, as we work towards the integration with the batch system, we will more frequently speak of a user job and less frequently of a project. A single consumer project is a part of the user job which essentially entails (1) starting of a project, (2) running of an analysis program, and (3) stopping a project. Our tendency is toward a single command such as one of the following:
sam run XXX.py <params>
sam submit XXX.py <params>
Users will have to deal with SAM-imposed resource restrictions, such
as disk/ATL usage. We are excited to see how we can, by (seemingly) creating
problems for every particular individual, enlighten the life of the Collaboration
as a whole!
=============================================================================
Project : SAM
Package : sam_doc
$Id: station.html,v 1.10 2000/03/24 19:58:27 vranicar Exp $
This work is part of a development project, called SAM, which consists
of a
number of coordinated packages each named sam_xxxx .
Notice of authorship, copyright status, and terms and conditions,
should
the software eventually become available for use outside Fermilab,
can be
found in the README and LICENCE files in the top level directory of
the main
sam package.
==============================================================================