Policy for Adding User Data Files to SAM

 

I. Introduction


Adding user data to the SAM system is straightforward, but needs to have easily enforced rulesand well understood proceedures  in order for the data to be useful, managable, and not interfere with the official datasets. There are several problems associated with adding user data to the system. Unless the user is trusted, there is no way of knowing that what the user is putting in is really what is described by the metadata. Users tend to do processing steps which are not officially sanctioned by the collaboration, like using untested versions of  applications, or unofficial input  parameters, et cetera. User data needs to be directed to particular output tapes in a way that they neither step on official data, nor on each other. Users can have voratious appetites for storage especially if it is easy to use and apparently free. Users tend to not be vary cognizant of issues like file size and can fill up tapes with thousands of tiny files that are painful for the storage system to store and retrieve. Users may not employ "standardized" file naming conventions so it is important that the metadata they provide has meaning, and in any case the names must be unique. In order for the data to be useful for others, it is important to maintain a complete history of the processing chain in SAM, and this may be difficult to enforce. Finally, cleaning up obsolete user data will be very difficult, especially if it is not organized well in the beginning.

Procedures for adding official data to SAM have been in place for many years and are well understood. They include two basic kinds of data 1. import, and 2. project. Import data refers to data moved into the SAM system that was created outside of the control of SAM, i.e. not using SAM analysis projects. An example of this is a Monte Carlo file that has been created at a remote processing center without using a sam station and with no sam protocal. Each data file to be imported includes a description (metadata) file and a parameter file further describing its contents. Data that has been processed within a SAM station, in a project, has process and project  information already in the SAM system and the description files are slightly more streamlined since much of the metadata has already been entered into SAM reguarding application and other vital information.

A special category of file has been created for storing online information like epics, luminosity and other files. This has a data type called "online_archive".  This is convenient because the user has complete freedom to store data as needed and to tar groups of files or directories together before storing them. However, we fear that a similar "offline_archive"  definition might be abused as every manor of data would be stored including backups of  project areas, temp areas, theses, downloads from the internet, family photo albums, et cetera. Of course, it is possible to abuse any of the user storage mechanism in this way, just not quite as convenient.
 

II. Policy


In order to address these issues, the following policies need to be established:

  1. User data is stored on tape to File Families based on physics group.
  2. Each group will appoint a liaison  who will  be contacted by people within the group when they need to store data.
  3. Groups are given quotas for tape usage. How the tapes are being filled and by whom should be monitored internally to the group. Additional quota is granted by the ORB.Recycling tapes is only done on a complete tape basis and requested through the ORB.
  4. The mapping of files to tape File Families is determined by the group name in the description files through the autodestination feature of SAM. A map entry for each group must be added for each group.
  5. If groups desire finer-grained mappings to File Families, they need arrange this through the ORB.
  6. Users must certify their description files with SAM (and the ORB) before storing data into the system.
  7. Specific parameters which must be approved include (for imported files):
    1. application family,
    2. application version,
    3. appliction name (indicating "unofficial")
    4. group,
    5. phase,
    6. physics process,
    7. decay channel,
    8. stream,
    9. data  tier (indicating "unofficial")
    10. filename adhering to convention (loose but must ensure  uniqueness),
    11. number of events and first and last event number,
    12. file size
    13. begininging and ending date and times,
    14. center of mass energy.
    15. Others as needed/added to the metadata
  8. Specific parameters which must be approved include (for inline processed files):
    1. Application name (including "unofficial"),family and version.
    2. filename adhering to convention (loose but must ensure  uniqueness),
    3. file size
    4. number of events, first and last event,
    5. start and end date and time
    6. parent file  name
    7. SAM process ID
  9. Maintain a complete processing chain in sam if at all possible. This means storing (or at least declaring) the output for each processing step into the system
  10. The file family width will always be set to 1. If additional write bandwidth is needed the ORB wil be consulted.

III. Procedure

Refer to the document  User File Store Details  for more information regarding the procedure.