From lueking@d0sgifm03 Wed Oct 27 21:27:46 1999 Date: Tue, 14 Sep 1999 11:48:07 -0500 (CDT) From: Lee Lueking 630.840.8236 Reply-To: lueking@fnal.gov To: terekhov@fnal.gov Subject: station design notes I. Introduction Our goal is to manage, and optimize when possible, the use of the following resources: 1. network: network throughput 2. Tape:robot robot arm and tower, tape mounts, and tape file seek 3. Disk: buffer for input and output and cache for various purposes 4. Other resources, cpu, memory, et cetera. Item 2 is a global resource shared by all stations. Items 3 and 4 are resources specific to a particular station. Item 1 can be either local or global, as it is the connection between the stations and the global services. Management and priorities need to be established at three levels: 1. Station: functional set of hardware, e.g. On-line, farm, CAS. 2. Working group: physics or other group of people, e.g. Top, Higgs, calibration, particle ID. 3. Individual person: specific individual. Station is the first level, and some stations will have absolute priority, for example the on-line system. Below station, working group and individual priorities will need to be adjusted. We imagine there will be less than 10 groups and each will be given equal priority. Individuals will be part of one or more groups and will be parceled out group resources by the group head, or resource manager. There will likely be the need for umbrella groups under which freight train projects, with many cooperating consumers, can work with adequate resources. We anticipate that the priorities will be set by a set of policies, which are enforced by adjusting a set of parameters controlling the various resources. These policies should be dynamic and adjust to differing work loads on the system at various times, such as day, night, week day, week end. Also, the policies should allow the service levels to degrade for certain users when it detects a disproportionate number of resources being used for them. II. Network It is believed that the network throughput should not be a bottleneck, since the network is over-designed by a large factor. Multiple network pipes on CAS will be effectively utilized by the enstore software (specifically encp) with a round robin algorithm which works best if there are no other applications sharing the pipes with enstore. We believe that we can easily solve any potential bottlenecks in network throughput by buying additional, relatively inexpensive, network hardware. We plan to provide at least 150 MBps for CAS. III. Robot and Tape The central tape plant will be the resource which is shared by all stations. It may be possible to increase the number of drives in the central storage, which is inexpensive, but we feel that the robot arm (which includes 2 arms and possible tower contention) will be the critical resource and must be managed carefully. IV. Disk One of the key functions of the station must be management of the individual station disks. This management is quite different depending on whether the station has distributed storage (not shared among boxes), like a farm, or a multi-processor box with common storage (shared disks) like on the CAS. The farm situation is easily understood, and therefore for the station design we concentrate our attention on the CAS configuration. There are several considerations (parameters) which determine the needs and performance of the disks. Nwidth - Stripe width; The number of disks in a stripe set Nldisk - Number of disk logical strip sets Nread - Number of readers; The number of concurrent applications reading the disk. Nwrite - Number of writers;The number of concurrent applications writing to the disk. Rread - Rate of reading; MBps for each reader. Rwrite - Rate of writing; MBps for each writer. Sdisk - Size of each disk; size in GB Sfile -Size of largest file; size in GB We imagine that different types of usage will require different needs. One can imagine several categories of disk usage which might be under the control of the SAM station; 1. input buffer - a buffered file is used once and removed to make space for the next file. Recovery policy: re-stage if un-released file is lost. 2. output buffer- Space for output from producer. Recovery policy: none 3. short term cache (STC)- buffer space which is converted to cache by keeping enough data around so that it is used more than once. This can occur if many users request the same data, or one consumer uses the same file multiple times. There are three possible ways of disposing this data a. delete when "released" and buffer space needed. b. save due to multiple hits over a short period of time c. move to long term cache based on multiple hit count. Recovery policy: none because these will automatically be recovered when requested. 4. long term cache (LTC) - Data which is, based on usage patterns or by group requests, is kept on disk over longer periods of time. Each group has an allocated amount of LTC. It is removed from cache using policy based on LRU or other policy requested by group. Recovery policy: auto recovery at low priority. 5. pinned cache - This is data which is maintained on disk at all times as determined by all groups. Specifically, this refers to Thumbnail and possibly some calibration data. Recovery policy: auto recovery at low priority. The "recovery policy" indicates the procedure when data is lost, say because a disk breaks. The specifications for these disk areas can be different. One possibility we discussed for the buffer area is to use a large number of unstriped disks. This works well for many consumers reading the same pool of files, with a few assumptions. Folloewing are parameters for each station. Nconsumer - Total number of consumers for all projects Nproject - Number of projects Ncpu - Number of cpus Tcread - Time for a consumer to read a typical file Tcwrite - Time for consumer to write file (assume Tcread=Tcwrite) Tstore - Time to store typical output file to enstore. Tewrite - Time for enstore to write file to buffer. If we assume that Tcread >> Tewrite, and Nconsumer <= Nldisk, then the project manager(s) could direct readers to disks either randomly or through some round robin style, in such a manner that readers would not contend with each other for Rread bandwidth. In fact, we can further assume that the number of concurrent consumer applications must be less than the Ncpu which could put a constraint on the number of disks needed for buffer. However, if jobs are heavy on input (I) then Tread might be smaller than Twrite and the overhead of Twrite becomes larger and each disk supporting a simultaneous reader and writer becomes more likely. There is also a slight space premium due to edge effects of fitting Nfiles x Sfile onto Sdisk with no striping or logical volume management. We estimate the space wasted to be less than 5% for Sfile = 2GB and Sdisk = 18 GB. Once a file is moved into short term cache, assuming the disk is shared with a buffer area, then it becomes more likely this disk may be needed by multiple readers. Disk used for long term cache will probably not have a large number of concurrent readers and the usage patterns will be unpredictable. It may be more effective use of the storage to use small stripe sets, (say Nwidth = 2 or 4) or logically connected volumes. (this is not determined yet.) Pinned cache will eventually consist of the largest disk area under SAM. Our assumption is that many (10-20) simultaneous consumers will be traversing this data constantly. Typically, users would like to go from beginning to end, and it does make analysis at this level easier. Presumably, if the data were laid out on many unstriped (Nwidth=1) disks, the project manager could shepherd each reader through the data so he would be out of step with other readers of the same disk. However, these jobs tend to be extremely Rread intensive and striping is probably warranted. This will need some empirical research using ROOT (assuming we use root for thumbnails). V. CPU, memory and Batch system (yet to be discussed)