SAM Command Line Dataset Definition

NOTE 0:The following commands are only valid for version 5 of SAM and they not accurate any more in the new version 6/7 of SAM. Please go here for the new command syntax.

NOTE 1: The terms Dataset Definitions and Project Defintions are synonymous, as are the terms Datasets and Project Snapshots. These terms are used interchangably in this document, and may be used interchangably in the SAM commands.

0.  Quick Start for the Impatient

To define a dataset definition from a set of constraints, type (after the setups):

sam define dataset --group=<group> --defname=<name> {constraints}
or (the original names still work)
sam define project --group=<group> --defname=<name> {constraints}

where valid constraints are listed below , the most useful being the file name pattern, example: --filename='%ttbar%', and percent sign (%) being the wildcard. The definition name that you choose must be unique and the group name must be known to the SAM database. For valid groups, see the Quickie Query List for Work Groups on the SAM Data Browsing pages. If successful, you may skip the rest of this document and go to running of SAM projects.

1.  Introduction

Projects are established and used in processing sets of data without the user needing to track which files are in the list, or which files have been processed. Projects are created by employing either the SAM Project Editor GUI tool or the command line interface. The steps in the life of a project are as follows:
  1. A list of constraints is submitted to the translate constraints method and the project can be "sized" based on the response.  The constraint list can be in the form of selection parameters or a dimension/constraint expression in the new dataset dimension notation.
  2. Once the results look like what you want, its "definition" can be stored using the create definition method.
  3. A dataset definition can be used to generate a dataset of the files satisfying it, by using the create dataset command. This dataset is the point in time "snapshot" of the files that currently meet the criteria of the definition. And, yes, the old create snapshot command is identical.
  4. If a dataset is believed to exist, it can be verified using the verify method. This will check the list of files generated when the dataset was originally created with the current search and warn the user if there are any changes or problems. This encourages the re-use of existing projects, thus assuring reproducible results.
  5. An "analysis" project can be created from the dataset using the create analysis method.

2. Initial Setup


Initialization - To initialize the environment, the user should enter the following, if not already done:

    setup sam

3. SAM Project Commands

sam translate constraints - Translate a set of constraint parameters, or a dimension/constraint criteria into a summary list of the files that would result from such a project definition. It returns the number of files found, average file size and number of volumes needed to access the data. Note: resulting files are included in the list only if they are immediately available.

usage:
sam translate constraints <constraints_or_dim>
where <constraints_or_dim> is either:
[--runnum=run_number] [--eventnum=event_number] [--datatier=data_tier] [--filename=file_name] [--physicaldatastream=physical_datastream_name] [--logicaldatastream==logical_datastream_name] [--physicaldataset=physical_dataset_name] [--applicationfamily=application_family] [--applicationfamilyversion=application_family_version]
or:
--dim=dimensions_and_constraints
or:
--rpn=dimensions_and_constraints_in_rpn_format (may be useful in odd situations where you're scripting things)
return: 1. files info, 2. volume info
for help using constraints:
sam translate constraints --help will provide the basic help, while
sam translate constraints --dim=help will provide detailed help on the new dimension, including a list of available dimensions and how to use them.
sam create dataset definition -
(or sam define dataset for short, or the old names sam create project definition or sam define project)
Use the constraints used to translate constraints above, to create and store a dataset definition in the database. The note on file availability (see translate constraints) applies.
usage:
sam define dataset --defname=dataset_definition_name --group=work_group_name [--defdesc=dataset_definition_description] <constraints_or_dim>
where <constraints_or_dim> is either:
[--runnum=run_number] [--eventnum=event_number] [--datatier=data_tier] [--filename=file_name] [--physicaldatastream=physical_datastream_name] [--logicaldatastream==logical_datastream_name] [--physicaldataset=physical_dataset_name] [--applicationfamily=application_family] [--applicationfamilyversion=application_family_version]
or:
--dim=dimensions_and_constraints (see the help for translate constraints above for a list of available dimensions)
or:
--rpn=dimensions_and_constraints_in_rpn_format
return: Status of the definition creation.
sam create dataset -or- sam create project snapshot
Use an existing dataset definition to create a dataset (aka snapshot) of currently available files, and their relevant volumes. If you omit the group when creating a dataset, the group will default to that of the dataset definition.
usage:
sam create dataset {--defname=project_definition_name || --defid=project_definition_id} [--group=work_group_name] [--snapdesc=project_snapshot_desc]

return: Status of the dataset/snapshot creation. Note: the group option is needed only if the dataset is created in the context of a group other than the group for which the project had been defined.

sam verify snapshot -or- sam verify project snapshot
Verify that the files and volumes for the specified project have not changed since it was created. You may use either the project definition name or id to identify the snapshot. And, you may indicate a specific snapshot version, or omit the version to verify the latest snapshot for the project definition.
usage:
sam verify project snapshot --defname=project_definition_name [--snapvers=project_snapshot_version]
or:
sam verify project snapshot --defid=project_definition_id [--snapvers=project_snapshot_version]

return: The file differences between the original snapshot and the version of the snapshot that would be created if the same definition were applied now. Each file listed starts with either a plus or minus sign. Files new to SAM since the original snapshot was recorded start with a plus sign (+). Files that were in the original snapshot, but are now not in SAM start with a minus sign (-). These missing files might have been deleted, or are otherwise inaccessible.

4. Example of setting up and using a project

We would like to analyze all data between runs 100930 and 100930 with a data tier type of "digitized".  First, we test the constraints using the translate constraints command:

sam translate constraints --runnum=100930 --datatier=digitized

File Count:  11
Average File Size:  1910894

The return tells us that there are 11 files which satisfy these constraints with an average size of 1910894 kBytes. Next, we decide that a more (or less) complex query is needed than the translate constraints method allows. Possibly, we need to use dimensions that were not provided in the original list of constraints. For example, we can decide we want all these files, but only if they are not in a certain physical datastream.
sam translate constraints --dim="(run_number 100930 data_tier digitized) minus physical_datastream_name electron+jet"

File Count:  2
Average File Size:  1760936

Let's create a project definition using the original definition, and proceed through the chain of creating a snapshot and an analysis project. First, the project definition is created. We can create the project definition using the original constraints.
sam define project --defname=ace_project --group=groupa --runnum=100930 --datatier=digitized

Project definition created with Id: 2159

Or, we can create the project definition using the more complex dimensions.
sam define project --defname=ace_project --group=groupa --dim="(run_number 100930 data_tier digitized) minus physical_datastream_name electron+jet"

Project definition created with Id: 2160

Optional: create a snapshot.
sam create project snapshot --defname=ace_project

Snapshot Id: 2181 Version: 1

You may omit this step and create a "new" snapshot on the fly when starting the actual project.

Later, you may find it useful to compare your analysis results with the current physics data, by verifying the original project snapshot.

sam verify project snapshot --defname='ace_project_old'

Defaulting to the latest snapshot version of 2 for the project definition.
Project Snapshot file differences:
- ALL_076151_04.IGOR_01
- ALL_076151_32.IGOR_01
- EXPRESS_076151_03.IGOR_01
- INSPILL_076151_01.IGOR_01
- EXPRESS_076151_02.IGOR_01

This means that since the last time the snapshot was created for the given definition, the set of available files has changed. If you were to create a project snapshot now, you would not get these five files.

Once you have created a dataset of interest, proceed to retrieving the dataset files: running the SAM projects.