SAM delivers files and keeps track of which files you have successfully analyzed. It delivers the files on disk to you first and does the tape mounts in a way that makes sure everyone gets fair access to tape. It mitigates the problem of one person hogging the tape drives for days and/or flooding the tape system.
SAM saves information about projects, dataset definitions, and snapshots (three important SAM terms that are defined below) in the database for all to review.
SAM allows you to create your own dataset definitions. You can also create recovery dataset definitions that will give you files that you need to reprocess when your analysis or the data handling system has failed you. SAM eliminates the need for you to keep lots of tcl files around.
SAM is also used by D0. Using common software decreases support requirements.
SAM is designed work with the GRID.
Data taken since November 2004 are only available through SAM. (Raw data is an exception to this. As of July 2005, raw data are still in DFC too.)
CAF segments send a request to SAM for a data file as soon as they are ready to process one. SAM does not have a fixed plan for which files go to which CAF segments, it responds to requests for another file as fast as possible and is simultaneously caching more files to respond to future file requests. This approach optimizes the speed of file delivery and usage of resources.
One result of SAM's optimization of file delivery is that the order files are delivered is not predictable. Files are not delivered in the same order when jobs are repeated. Even the assignment of files to CAF segments is not predictable. It is up to the user to deal with this.
When running recovery jobs on CAF, the segments are not given the same segment numbers as the original failed segments. Input files are distributed to segments differently. This will force many people to change their method of bookkeeping when they start using SAM. It is up to the user to come up with ways to deal with this.
Creating too many SAM projects at once can overload system resources. DFC does not have projects.
There is a detailed CAF Users Manual with instructions on preparing jobs for the CAF. There is a link to it on the CAF Home Webpage . We do not repeat that information here, but only describe things required to use SAM.
Currently, we support general users running jobs on the CAF's at Fermilab using release 6.1.0. We also support small test jobs running on non-CAF machines. This will be expanded in the future. If you have an immediate need to use other releases or the remote CAF's and do not already know how to do that properly, contact cdfsam-admin@fnal.gov.
Currently, SAM administrators are recommending the following policy regarding access to large datasets from tape. If you are using SAM, you do not need to send requests to prestage data to the data handling group. If you receive an automated warning email about one of your jobs, forward it to cdfsam-admin@fnal.gov and let your job continue until you receive further instructions.
source ~cdfsoft/cdf2.shrc setup cdfsoft2 6.1.0 # as of 09SEPT2005 setup diskcache_i -q GCC_3_4_3 v2_06_21 setup dcap v2_32_f0408 setup sam -q caf_prdYou are required to use release 6.1.0 and setup the additional product versions as shown above. This web page will be updated as these versions change. "setup sam -q caf_prd" is an optional optimization step that allows your job to use a SAM database server dedicated to CAF jobs, which is separate from the default database servers used for queries and other activities.
module disable ConfigManager
module talk DHInput
maxFiles set 5
include dataset $env(SAM_DATASET)
cache set SAM
exit
The parameter maxFiles is the maximum
number of input files one segment will
process and can be set to any integer greater than
0.
module talk DHInput
include file file1.root
include file file2.root
include file file3.root
include file file4.root
include file file5.root
include file file6.root
exit
It is necessary to disable the "ConfigManager" to avoid database access problems.
For CAF jobs, the only "include" statements should be the one shown above, "include dataset $env(SAM_DATASET)". The design for CAF jobs is that the CAF submitter creates one project for many segments. This one project passes datafiles to many segments. If you have other commands of the form "include file", "include fileset", or "include dataset" in your TCL, then it will interfere with the proper deliver of data to your segments and create one project per include statement. There is a limit to how many projects the servers can support and they will rapidly be overloaded if many projects are started by one user.
maxFiles should be calculated as follows.
You will not use SAM if you do not include the "cache set SAM" statement. The default is to use the older DFC system.
To use the CAF Gui
setup cdfsoft2 development CafSubmit --dhaccess=SAM \ --tarFile=/cdf/scratch/YourUserName/tarFileWith_runshAndruntcl.tgz\ --outLocation=YourUserName@fcdflnxY.fnal.gov:/cdf/scratch/YourUserName/\$.tgz \ --procType=medium --group=common --dataset=xpmm0f \ --email=YourUserName@fnal.gov\ --start=1 --end=20 ./run.sh \$Most of this is the same whether or not you use SAM. There are two differences. First, "--dhaccess=SAM" is required if you want to use SAM. Second, "--dataset=xpmm0f" must be set to the name of the dataset definition.
The dataset definition can be one that already exists. In the future, we expect physics groups and individuals to create standard dataset definitions that are used by many people. In addition, for each "CDF dataset", a corresponding SAM dataset definition has already been created with the same name. We use "CDF dataset" to refer to the datasets that were defined before SAM. For example, the primary output datasets are "CDF datasets". Examples would be xpmm0f, or bhmu0d. There are instructions on how to create your own dataset definition here: How To Create SAM Dataset Definitions and Snapshots.
The number of segments is calculated as follows:
Here we describe running a simple test job with one input file. This can be run on a central LINUX machine. This can also be run on a user's desktop machine, if the machine is setup and configured properly. It is highly recommended that users run this kind of test before submitting a large job to the CAF.
What is described here will work well for a single machine, but if you envision running many non-CAF machines to process data using SAM then you need to speak with SAM experts first. Unless configured carefully, running on multiple non-CAF machines will overload SAM resources and cause many problems.
There are three ways to do this: with a minimal set of environment variables, running a script that emulates the CAF submitter, or using TCL commands. In all three cases the first thing you want to do is create a one file dataset definition.
Here is an example of how to create a dataset definition for a single data file for test purposes. How To Create SAM Dataset Definitions and Snapshots explains dataset definition in much more detail. Please, do not try to use this to create a dataset definition for each file in a large dataset with many files.
sam create dataset definition \
--defname="newDatasetDefinitionName" \
--group=test \
--defdesc="description" \
--dim="FILE_NAME bd02d8e6.05bdhmu0"
sam take snapshot --defname="datasetDefinitionName" --group=test
export SAM_STATION=cdf-sam
export SAM_DATASET=datasetDefinitionName
export SAM_PROJECT=yourUserName_`date +%Y%m%d_%H%M%S`
Do NOT set these variables when submitting
to CAF, the CAF will do it for you.
There is an alternative to the procedure described above which better emulates what really happens on the CAF. The major difference is that when you just set the environment variables, the DHInput module will automatically create the SAM project, however, the script below creates the project explicitly and DHInput just uses the existing project to get the files.
You need to copy the script below and run it. The script mimics the behavior of the CAF submitter. Be careful to not run this script on a CAF node, because the script creates a project. If a large CAF job was submitted and each of its segments created its own project, then resources would be overloaded. Further comments are given in the script. The script can be started with the command:
thisScript.sh run.sh datasetDefinitionName > log 2>&1 &
#!/bin/sh
# checking the usage of the script
# no SAM specifics
script_name=${1}
if [ ! -x "${script_name}" ]
then
echo "script name ${script_name} is not valid"
echo "Usage ${0} script_name dataset_name"
exit 2
fi
dataset_name=${2}
if [ -z "${dataset_name}" ]
then
echo "The dataset name needs to be given!"
echo "Usage ${0} script_name dataset_name"
exit 3
fi
echo ${0} will run ${script_name} for dataset ${dataset_name}
source ~cdfsoft/cdf2.shrc
# setting up the SAM environment variables.
# SAM specific needed for any interactive job
# either on the CAF or on a remote SAM station
set -x
echo "Setting SAM environment variables"
export SAM_STATION=cdf-sam
export SAM_DATASET=${dataset_name}
export SAM_GROUP=test
export SAM_PROJECT=${USER}_${SAM_STATION}_${SAM_DATASET}_`date +%s`_$$
export SAM_USER_NAME=${USER}
export SAM_QUALIFIER=caf_prd
export SAM_VERSION=
set +x
echo doing setup sam ${SAM_VERSION} -q ${SAM_QUALIFIER}
setup sam ${SAM_VERSION} -q ${SAM_QUALIFIER}
set -x
# starting the SAM project,
# this part is mimicing the behaviour of the CAF starting section,
# not needed for submission, e.g. on remote SAM stations.
echo starting project $SAM_PROJECT
sam start project --station=${SAM_STATION} --project=${SAM_PROJECT}\
--defname=${SAM_DATASET} --group=${SAM_GROUP} --retryMaxCount=200
SX=$?
echo return from sam start project $SX
echo Run section 1 on node $HOST by user $USER
set +x
export -n SAM_VERSION
# caf does not pass SAM_VERSION to the sections
# user job gets started
# needed for any job
${script_name} 1
# stopping the SAM project,
# this part is mimicing the behaviour of the CAF end section,
# not needed for submission, e.g. on remote SAM stations
set -x
echo ending project $SAM_PROJECT
sleep 10
sam stop project --force --station=${SAM_STATION}\
--project=${SAM_PROJECT} --retryMaxCount=20
|
Do not use this approach on CAF. And do not use more than one "include" statement in your TCL. Each include will trigger the creation of a project. Users creating multiple projects will overwhelm the SAM system and possibly even crash it, upsetting everyone from management to the user next door.
The third alternative is to modify the TCL file. In some cases, this is the easiest and simplest approach, but it also tempting to try to use this to do business in the old DFC style by using "include file" many times. Do not do this.
In this approach, one modifies the TCL and then does not need to set environmental variables or do anything else. Just execute the script you would run on a CAF node. This also works interactively. The DHInput module will automatically create a project and generate its name.
talk DHInput
cache set SAM
include dataset datasetDefinitionName
maxFiles set 10
sam
station set cdf-sam
exit
exit
talk DHInput
cache set SAM
include file cd02d8df.00bddip0
maxFiles set 1
sam
station set cdf-sam
exit
exit
One last time, only use ***ONE*** "include file"
here or SAM resources will be overloaded
The SAM project keeps track of which files are successfully delivered and consumed. When a CAF job completes, it will send you an email which includes a "SAM Project Summary" as well as the report from CAF on the results of your job and its many segments. The section on monitoring explains how to get a "SAM Project Summary" if you are not running on CAF or lost the email. You can review these to determine which of your segments succeeded or failed. You also need to check the output of your job (log files and ntuples) to determine which segments succeeded or failed. Neither SAM nor CAF can be expected to catch all types of failures.
This part of the procedure needs further improvement and some people are working on this. Constructive input from users may help, although there are constraints inherent in SAM which will be impractical to change. Currently, there is a procedure that will allow you to recover failures which are detected by SAM. The script below will make a new dataset definition which will select the files that failed. If you run a new analysis job using this dataset definition, you can rerun your analysis on this data. Note that the segments will not have the same numbers as in the original job, nor will the assignment of files to segments be the same. Users need to deal with this bookkeeping issue on their own. This numbering issue is one of the constraints from SAM that it is impractical to change.
Currently, the only advice for rerunning segments that fail where SAM does not detect the failure is to rerun the entire job from scratch. SAM will not detect the following types of failures:
If the "SAM Project Summary" lists a file as delivered, then SAM delivered the file to the CAF segment, but the segment did not report back to SAM that all the events were successfully processed. This can be caused by a crash, abort, or CAF node restart.
One surprising behavior you might see. On rare occasions, there will be some CAF problem that causes a CAF segment to be stopped and restarted from scratch. SAM sees the restarted segment as a new additional segment. The files delivered and consumed by the original segment may or may not be considered to have failed. A new set of files is delivered to the restarted process. The restarted process will get a new additional CPID. You can detect this situation by counting the CPID's in the "SAM Project Summary". If there are more CPID's than the number of segments you started, then some restarts occurred during your job. In this case, you have to carefully count files to make sure all make it into the recovery dataset definition or start over from scratch."
sam generate recovery project \
--project="dataset_to_be_recovered" \
--recoDefname="recovery_dataset"
In the script above, the "sam generate recovery project" command will create a list of all files which were consumed with the same Consumer Process ID as any of the single files that failed. That means that all files within the CAF segments which failed will get in the the recovery dataset definition. Usually that is the behavior you want to have. However, if you have a one to one relation between your input files and the output files you might not need to recover a whole CAF segment, but only get the individual files which failed. In this case you need to edit the recovery_by_file.sh script and change one line from:
sam generate recovery project --project=$FAILED_PROJECT_NAME --printQuery > ${TMP_FILE_NAME}
to:
sam generate recovery project --ignoreConsumed --project=$FAILED_PROJECT_NAME --printQuery > ${TMP_FILE_NAME}
setup sam sam get project summary --project=<project-name>where you get the project name from SAMTV or the Database Browser. This gives you the ID of each consumer (each segment of your CAF job has a consumer ID, or CPID) and the number of files delivered, consumed, failed, etc. If you use the "-v" qualifier, you get a list of every file and to which consumer it went.