How To Use SAM to Access CDF Data

Contents

What is SAM

SAM delivers files and keeps track of which files you have successfully analyzed. It delivers the files on disk to you first and does the tape mounts in a way that makes sure everyone gets fair access to tape. It mitigates the problem of one person hogging the tape drives for days and/or flooding the tape system.

SAM saves information about projects, dataset definitions, and snapshots (three important SAM terms that are defined below) in the database for all to review.

SAM allows you to create your own dataset definitions. You can also create recovery dataset definitions that will give you files that you need to reprocess when your analysis or the data handling system has failed you. SAM eliminates the need for you to keep lots of tcl files around.

SAM is also used by D0. Using common software decreases support requirements.

SAM is designed work with the GRID.

Data taken since November 2004 are only available through SAM. (Raw data is an exception to this. As of July 2005, raw data are still in DFC too.)

CDF Users Should be Aware of These Differences Between DFC and SAM

CAF segments send a request to SAM for a data file as soon as they are ready to process one. SAM does not have a fixed plan for which files go to which CAF segments, it responds to requests for another file as fast as possible and is simultaneously caching more files to respond to future file requests. This approach optimizes the speed of file delivery and usage of resources.

One result of SAM's optimization of file delivery is that the order files are delivered is not predictable. Files are not delivered in the same order when jobs are repeated. Even the assignment of files to CAF segments is not predictable. It is up to the user to deal with this.

When running recovery jobs on CAF, the segments are not given the same segment numbers as the original failed segments. Input files are distributed to segments differently. This will force many people to change their method of bookkeeping when they start using SAM. It is up to the user to come up with ways to deal with this.

Creating too many SAM projects at once can overload system resources. DFC does not have projects.

SAM Registration

SAM requires that you register yourself. You only need to do it once. Go here to register: Registration Website . Select groups "test" and "cdf". It does not make any difference whether or not you select other groups. Ignore them. Many people were registered automatically by the SAM group based on information from the CAF. At the same web site, you can check to see if you are already registered or update your registration.

Concepts and Definitions

There are a few important SAM terms that you need to know.

Preparing a SAM Job to Run on a Fermilab CAF

There is a detailed CAF Users Manual with instructions on preparing jobs for the CAF. There is a link to it on the CAF Home Webpage . We do not repeat that information here, but only describe things required to use SAM.

Currently, we support general users running jobs on the CAF's at Fermilab using release 6.1.0. We also support small test jobs running on non-CAF machines. This will be expanded in the future. If you have an immediate need to use other releases or the remote CAF's and do not already know how to do that properly, contact cdfsam-admin@fnal.gov.

Currently, SAM administrators are recommending the following policy regarding access to large datasets from tape. If you are using SAM, you do not need to send requests to prestage data to the data handling group. If you receive an automated warning email about one of your jobs, forward it to cdfsam-admin@fnal.gov and let your job continue until you receive further instructions.

The CAF Shell Script (run.sh)

In the shell script that starts the job on the CAF node, use these lines to setup the environment.
   source ~cdfsoft/cdf2.shrc
   setup cdfsoft2 6.1.0
   # as of 09SEPT2005
   setup diskcache_i -q GCC_3_4_3 v2_06_21
   setup dcap v2_32_f0408
   setup sam -q caf_prd
You are required to use release 6.1.0 and setup the additional product versions as shown above. This web page will be updated as these versions change. "setup sam -q caf_prd" is an optional optimization step that allows your job to use a SAM database server dedicated to CAF jobs, which is separate from the default database servers used for queries and other activities.

The TCL Script

  1. Add this to the tcl script
      module disable ConfigManager
      module talk DHInput
        maxFiles set 5
        include dataset $env(SAM_DATASET)
        cache set SAM
      exit
    
    The parameter maxFiles is the maximum number of input files one segment will process and can be set to any integer greater than 0.
  2. REMOVE from the tcl script anything that looks like this. This will overload the SAM servers and cause everyone serious problems. People will come looking for you if you do not.
      module talk DHInput
        include file file1.root
        include file file2.root
        include file file3.root
        include file file4.root
        include file file5.root
        include file file6.root
      exit
    

It is necessary to disable the "ConfigManager" to avoid database access problems.

For CAF jobs, the only "include" statements should be the one shown above, "include dataset $env(SAM_DATASET)". The design for CAF jobs is that the CAF submitter creates one project for many segments. This one project passes datafiles to many segments. If you have other commands of the form "include file", "include fileset", or "include dataset" in your TCL, then it will interfere with the proper deliver of data to your segments and create one project per include statement. There is a limit to how many projects the servers can support and they will rapidly be overloaded if many projects are started by one user.

maxFiles should be calculated as follows.

The safety factor simply allows for variations in the time to process each file. 3/4 is a reasonable value to use. The queue time limit depends on which CAF queue you choose to use. For example, as this is being written, the medium CAF queue has a 12 hour time limit. The time to process a file will be different for each user. If you do not know, run your job on one typical size file as a test.

You will not use SAM if you do not include the "cache set SAM" statement. The default is to use the older DFC system.

CafGui and CafSubmit

You must use CafGui or CafSubmit. Do not use the older obsolete Cafcom or Cafcom.py script. Cafcom will both not work correctly and also create a project per CAF segment which will overload the SAM servers. Do not use it. Cafcom might still work for DFC access at the moment, but it is deprecated and not supported there either.

To use the CAF Gui

After you click submit, you will be given a popup window. If you see that the number of files and the dataset are what you expected, select "yes" and continue.

If you wish to use CafSubmit, you can use this:
setup cdfsoft2 development

CafSubmit --dhaccess=SAM \
--tarFile=/cdf/scratch/YourUserName/tarFileWith_runshAndruntcl.tgz\
--outLocation=YourUserName@fcdflnxY.fnal.gov:/cdf/scratch/YourUserName/\$.tgz \
--procType=medium --group=common --dataset=xpmm0f \
--email=YourUserName@fnal.gov\
--start=1 --end=20 ./run.sh \$
Most of this is the same whether or not you use SAM. There are two differences. First, "--dhaccess=SAM" is required if you want to use SAM. Second, "--dataset=xpmm0f" must be set to the name of the dataset definition.

The dataset definition can be one that already exists. In the future, we expect physics groups and individuals to create standard dataset definitions that are used by many people. In addition, for each "CDF dataset", a corresponding SAM dataset definition has already been created with the same name. We use "CDF dataset" to refer to the datasets that were defined before SAM. For example, the primary output datasets are "CDF datasets". Examples would be xpmm0f, or bhmu0d. There are instructions on how to create your own dataset definition here: How To Create SAM Dataset Definitions and Snapshots.

The number of segments is calculated as follows:

Round any fractions up!

Using Remote CAF's

At the present time, using SAM at remote CAF's is not supported for general users. This is something that will change.

Run a Test Job on One File on a Non-CAF Computer

Here we describe running a simple test job with one input file. This can be run on a central LINUX machine. This can also be run on a user's desktop machine, if the machine is setup and configured properly. It is highly recommended that users run this kind of test before submitting a large job to the CAF.

What is described here will work well for a single machine, but if you envision running many non-CAF machines to process data using SAM then you need to speak with SAM experts first. Unless configured carefully, running on multiple non-CAF machines will overload SAM resources and cause many problems.

There are three ways to do this: with a minimal set of environment variables, running a script that emulates the CAF submitter, or using TCL commands. In all three cases the first thing you want to do is create a one file dataset definition.

Here is an example of how to create a dataset definition for a single data file for test purposes. How To Create SAM Dataset Definitions and Snapshots explains dataset definition in much more detail. Please, do not try to use this to create a dataset definition for each file in a large dataset with many files.

    sam create dataset definition \ 
        --defname="newDatasetDefinitionName" \
        --group=test \
        --defdesc="description" \ 
        --dim="FILE_NAME bd02d8e6.05bdhmu0"
  
    sam take snapshot --defname="datasetDefinitionName" --group=test

Alternative 1: Use environment variables

Set three environment variables before starting the job by running the same script and TCL you will run on a CAF node.
    export SAM_STATION=cdf-sam 
    export SAM_DATASET=datasetDefinitionName
    export SAM_PROJECT=yourUserName_`date +%Y%m%d_%H%M%S`
Do NOT set these variables when submitting to CAF, the CAF will do it for you.

Alternative 2: Run a Script that Emulates the CAF Submitter

There is an alternative to the procedure described above which better emulates what really happens on the CAF. The major difference is that when you just set the environment variables, the DHInput module will automatically create the SAM project, however, the script below creates the project explicitly and DHInput just uses the existing project to get the files.

You need to copy the script below and run it. The script mimics the behavior of the CAF submitter. Be careful to not run this script on a CAF node, because the script creates a project. If a large CAF job was submitted and each of its segments created its own project, then resources would be overloaded. Further comments are given in the script. The script can be started with the command:

    thisScript.sh run.sh datasetDefinitionName > log 2>&1 &
#!/bin/sh

# checking the usage of the script
# no SAM specifics
script_name=${1}
if [ ! -x "${script_name}" ]
then
     echo "script name ${script_name} is not valid"
     echo "Usage ${0} script_name dataset_name"
     exit 2
fi

dataset_name=${2}

if [ -z "${dataset_name}" ]
then
     echo "The dataset name needs to be given!"
     echo "Usage ${0} script_name dataset_name"
     exit 3
fi

echo ${0} will run ${script_name} for dataset ${dataset_name}

source ~cdfsoft/cdf2.shrc

# setting up the SAM environment variables. 
# SAM specific needed for any interactive job
# either on the CAF or on a remote SAM station
set -x
echo "Setting SAM environment variables"
export SAM_STATION=cdf-sam
export SAM_DATASET=${dataset_name}
export SAM_GROUP=test
export SAM_PROJECT=${USER}_${SAM_STATION}_${SAM_DATASET}_`date +%s`_$$
export SAM_USER_NAME=${USER}
export SAM_QUALIFIER=caf_prd
export SAM_VERSION=
set +x

echo doing setup sam ${SAM_VERSION} -q ${SAM_QUALIFIER}
setup sam ${SAM_VERSION} -q ${SAM_QUALIFIER}
set -x

# starting the SAM project,
# this part is mimicing the behaviour of the CAF starting section,
# not needed for submission, e.g. on remote SAM stations.
echo starting project $SAM_PROJECT
sam start project --station=${SAM_STATION} --project=${SAM_PROJECT}\
--defname=${SAM_DATASET} --group=${SAM_GROUP} --retryMaxCount=200
SX=$?
echo return from sam start project $SX
echo Run section 1 on node $HOST by user $USER
set +x

export -n SAM_VERSION
# caf does not pass SAM_VERSION to the sections

# user job gets started
# needed for any job
${script_name} 1

# stopping the SAM project,
# this part is mimicing the behaviour of the CAF end section,
# not needed for submission, e.g. on remote SAM stations
set -x
echo ending project $SAM_PROJECT
sleep 10
sam stop project --force --station=${SAM_STATION}\
--project=${SAM_PROJECT} --retryMaxCount=20

Alternative 3: Modify your TCL

Do not use this approach on CAF. And do not use more than one "include" statement in your TCL. Each include will trigger the creation of a project. Users creating multiple projects will overwhelm the SAM system and possibly even crash it, upsetting everyone from management to the user next door.

The third alternative is to modify the TCL file. In some cases, this is the easiest and simplest approach, but it also tempting to try to use this to do business in the old DFC style by using "include file" many times. Do not do this.

In this approach, one modifies the TCL and then does not need to set environmental variables or do anything else. Just execute the script you would run on a CAF node. This also works interactively. The DHInput module will automatically create a project and generate its name.

    talk DHInput
      cache set SAM
      include dataset datasetDefinitionName
      maxFiles set 10
      sam
        station set cdf-sam
      exit
    exit

You can even skip the dataset definition, and run pretty much as previously.
    talk DHInput
      cache set SAM
      include file cd02d8df.00bddip0
      maxFiles set 1
      sam
        station set cdf-sam
      exit
    exit
One last time, only use ***ONE*** "include file" here or SAM resources will be overloaded

Recovery

The SAM project keeps track of which files are successfully delivered and consumed. When a CAF job completes, it will send you an email which includes a "SAM Project Summary" as well as the report from CAF on the results of your job and its many segments. The section on monitoring explains how to get a "SAM Project Summary" if you are not running on CAF or lost the email. You can review these to determine which of your segments succeeded or failed. You also need to check the output of your job (log files and ntuples) to determine which segments succeeded or failed. Neither SAM nor CAF can be expected to catch all types of failures.

This part of the procedure needs further improvement and some people are working on this. Constructive input from users may help, although there are constraints inherent in SAM which will be impractical to change. Currently, there is a procedure that will allow you to recover failures which are detected by SAM. The script below will make a new dataset definition which will select the files that failed. If you run a new analysis job using this dataset definition, you can rerun your analysis on this data. Note that the segments will not have the same numbers as in the original job, nor will the assignment of files to segments be the same. Users need to deal with this bookkeeping issue on their own. This numbering issue is one of the constraints from SAM that it is impractical to change.

Currently, the only advice for rerunning segments that fail where SAM does not detect the failure is to rerun the entire job from scratch. SAM will not detect the following types of failures:

If the "SAM Project Summary" lists a file as delivered, then SAM delivered the file to the CAF segment, but the segment did not report back to SAM that all the events were successfully processed. This can be caused by a crash, abort, or CAF node restart.

One surprising behavior you might see. On rare occasions, there will be some CAF problem that causes a CAF segment to be stopped and restarted from scratch. SAM sees the restarted segment as a new additional segment. The files delivered and consumed by the original segment may or may not be considered to have failed. A new set of files is delivered to the restarted process. The restarted process will get a new additional CPID. You can detect this situation by counting the CPID's in the "SAM Project Summary". If there are more CPID's than the number of segments you started, then some restarts occurred during your job. In this case, you have to carefully count files to make sure all make it into the recovery dataset definition or start over from scratch."

    sam generate recovery project \
        --project="dataset_to_be_recovered" \
        --recoDefname="recovery_dataset"

In the script above, the "sam generate recovery project" command will create a list of all files which were consumed with the same Consumer Process ID as any of the single files that failed. That means that all files within the CAF segments which failed will get in the the recovery dataset definition. Usually that is the behavior you want to have. However, if you have a one to one relation between your input files and the output files you might not need to recover a whole CAF segment, but only get the individual files which failed. In this case you need to edit the recovery_by_file.sh script and change one line from:

  sam generate recovery project --project=$FAILED_PROJECT_NAME --printQuery > ${TMP_FILE_NAME}
to:
  sam generate recovery project --ignoreConsumed --project=$FAILED_PROJECT_NAME --printQuery > ${TMP_FILE_NAME}

Monitoring

There are a variety of options for looking at information about your project while it is running: