SAM Dataset Definition Design

SAM V2.1 introduces the new Dataset Definition design. In this design, Dataset Definition is synonymous with the old term Project Definition, while Dataset is synonymous with the old term Project Snapshot. A Dataset Definition is a set of dimensions and constraints that specify the files of interest. A Dataset is a point in time view of the files that meet the criteria of the Dataset Definition.

SAM Dataset Definitions are redesigned to eliminate 2 of the major limitations of the first implementation. First, the dataset definitions must be indepenedent of the database structures. No raw SQL statements are stored for the definitions. Second, dataset definitions must allow for both the declaration of simple projects and more complex projects. Simple projects must be simple to create, while more complex projects may require more knowledge of the dataset definition constructs. Complex projects must include the basic set logic of union, intersection and minus. In other words, the ability to define a project in terms of "this set of files" or "this set" (union), "this set" and "this set" (intersection), and "this set" minus "this set".

In addition to these changes, the new SAM Dataset Definitions will provide users with more flexible ways to query existing dataset definitions. Tools are provided that allow you to peruse dataset definitions by group, by person, or even by the conditions used to define the dataset definition. Additionally, the new keyword concept allows you to define your own physics oriented keywords for each data file and define and search dataset definitions by these keywords.

To achieve these goals, a few new constructs are introduced in the dataset definition constraint specification language. First, Dimensions are used to define named searchable concepts. These concepts usually map directly back to specific database columns, but may map back to more complex structures. The resolution of the named dimension back to the database structure is of no concern to anyone but the SAM database server developers. Second, Constraints will continue to be specified in the syntax used for the older project definition constraint design, allowing you to simply supply the value and the comparison operator is determined for you. However, constraints may also include the comparison operator. Finally, the complete set of dimensions and constraints is stored in a manner that allows users to peruse existing dataset definitions by the conditions used to define them.

Dataset definitions may be defined using either a new GUI interface, a command line utility, or a Python API. The GUI provides a spreadsheet style oriented interface for dimension/constraint entries, which are then mapped into the RPN notation for constraint storage. The command line entry of dimensions and constraints will require an understanding of the either a Post-Fix or RPN notation.

Detailed descriptions of all Dimensions, Constraints, and the syntax for specifying constriants are included below. Also included are design descriptions of the new GUI and the command line interface.

Dimensions

Dimensions are modelled using the Dimension concept popular in data warehousing circles. The entire SAM meta-data schema can be thought of as one big warehouse, which has many queryable dimensions made available to the users. Each of the available dimensions is specified by the users without respect to any table structure knowledge. The database server developers maintain the dimension to database table mappings, ensuring properly tuned queries are produced by the various allowed combinations of dimensions.

The original implementation of the SAMDbServer included a small allowable subset of dimensions as specified in ProjectUser.idl's Constraint struct. The new implementation provides those and a few more dimensions. It also allows the SAM DbServer developers to simply add more dimensions as needed, with a variable list of dimensions recorded in the new Db2Dataset.idl interface section noted below.

  Original Constraint struct from ProjectUser.idl:
        struct Constraint {
         string runNumber;
         string eventNumber;
         string dataTierType;
         string physicalDatastreamName;
         string logicalDatastreamName;
         string physicalDatasetName;
         string processFamily;
         string processFamilyVersion;
         string fileList;
      };
    
  New Dimensions from Db2Dataset.idl:
      // Get a list of available dimensions for querying.
      typedef sequence<string> Strings;
      Strings getAvailableDimensions();
    
      // Dimensions passed as input.
      typedef sequence<Dimension> Dimensions;
      struct Dimension {
        string dimName;
        string dimOperator;
        string dimValue;
        string dimSetOperator; //rpn style
      };

      // 

The mapping of dimensions to database queries may be done by hand in the SAM DbServer's code, or may involve the use of a new tool. Currently a tool dubbed the Ant Colony, produced by Geppetto's Workshop is being reviewed for this purpose. This tool allows for a data model to be defined pictorially, with dimensions, facts and metrics made available for querying. Resolving the dimensions into SQL queries is done by a dynamic table selection algorithm which reads rules defined for the data mapping to ensure that only valid, tuned SQL queries are produced. The initial version of the Dataset Definition query resolution relies on a home-grown version of an Ant/Dimension query builder. As this code runs up against limitations, or the Ant Colony code provides major new features, the underlying engine may easily be switched out, without affecting the end user interfaces.

Dimension Naming

Some original thoughts on dimension naming rules which may change as we proceed are included here. Well, they don't really fit anywhere else in this document yet, so here they are. The dimension name must be a valid dimension name in your list of allowable dimensions. While it would be possible to allow dimension names to contain spaces, SAM is implementing a limited dimension name convention, allowing no spaces in the dimension names. Dimension names will begin with a character from A-Z, will contain only A-Z,_,$ characters, and will be case insensitive. The initial Dimension names will all be derived from their source column names. This is possible for the initial set of Dimensions, as they do not include any real complex dimensions. As more complex dimensions are introduced, the naming conventions may need to be revisited.

Constraints

Constraints are specified for a dimension in order to limit the resulting rows. The initial SAM constraint language will remain intact, allowing for the use of the special characters that make entry of constraint a littler simpler than their corresponding SQL operators. For example, the following tricks remain the same.

Constraint Examples
ConstraintResolves to
1,14,567 in (1,14,567)
14-21 between 14 and 21
44,50-70 = 44 or between 50 and 70
44,45,1-8 in (44,45) or between 1 and 8
%ztautau% like '%ztautau%'
ztautau.101= 'ztautau.101'
Table 1

But, if you want to be more precise, you can always specify the full comparison operator syntax. For example, you could enter the full "Resolves to" constraint as shown in table 1. Allowable constraint language syntax rules are detailed in table 2. The first column shows the full comparison operator syntax, while the second column shows the short-hand notation that is allowed. Note the details regarding the short-hand in column 3.

Constraint Short Hand
Full Constraint SyntaxShort HandDetails
= 11No quotes needed for numbers.
= 'a'aQuotes needed for full notation but not for shorthand. Shorthand could also be 'a'. Single quotes are optional for shorthand, but required when the shorthand text contains spaces.
like 'a%'a%Presence of a wildcard (%) implies like operator in shorthand. Same quote rules as above.
in (4,5)4,5Comma separated shorthand implies or predicate.
in ('a','b')a,bComma separated shorthand implies or predicate. Same quote rules apply for character data.
between 1 and 81-8Dash in shorthand implies between operator. For between, notice that the low value must preceed the high value. Also, while between is allowed for characters it is only recommended for use with numeric constraints. As such, the dash shorthand only applies to numeric dimensions.
!= 22!22Only a minor shorthand, but may be easier for users.
not in (12,24,48)!(12,24,48)Exclamation point preceding parenthesis singles a not in clause.
> 34n/aNo shorthand available.
>= 67n/aNo shorthand available.
< 51n/aNo shorthand available.
<= 40n/aNo shorthand available.
not like 'qid%'n/aNo shorthand available.
is nulln/aNo shorthand available.
is not nulln/aNo shorthand available.
Table 2

Reverse Polish Notation

The Dimension/Constraint/Operator syntax will be stored in the database in a Reverse Polish Notation (RPN) syntax. End users will be allows to specify dimension/constraint logic in either RPN syntax or in a SQL-like, in-fix notation. The RPN logic for dimension/constraint set definitions is based on the premise of many of today's modern calculators. For more information on RPN calculators, see one of the various sites referenced in table 3.

RPN References
http://www.hpmuseum.org/rpn.htm
http://users.ece.gatech.edu/~mleach/revpol/
http://www.whisqu.se/per/docs/math35.htm
Table 3

Each argument in the RPN syntax is a dimension/constraint combination of the format:

    dimensionName [constraintOperator] constraintValue

The dimension name follows the naming rules outlined above in the section on dimensions. The constraint operator is optional. Both the constraint operator and constraint value must follow the constraint notation rules outlined in the constraints section.

Each of these dimension/constraint arguments are then treated just as a number in the RPN calculator examples, but unlike the calculators, SAM only allows 3 set functions: and, or, minus. Some examples are noted below.

RPN Syntax Examples
Example Constraint 1
file_name %ztautau% file_name %zfigfig% or
resolves to a forward SQL-ish constraint of...
file_name like '%ztautau%' or file_name like '%zfigfig%'
Example Constraint 2
file_name %ztautau% run_number 14,23 and data_tier digitized or
resolves to...
(file_name like '%ztautau%' and run_number in (14,23)) or data_tier = 'digitized'
Example Constraint 3
run_number 1,33,100-199 file_name foo%,bar% and physical_datastream jet and physical_datastream electron+muon+jet data_tier raw and minus
resolves to...
(((run_number in (1,33) or run_number between 100 and 199) and (file_name like 'foo%' or file_name like 'bar%')) and physical_datastream = 'jet') minus (physical_datastream = 'electron+muon+jet' and data_tier = 'raw')
Example Constraint 4
project_name my-favorite-project1 project_name my-favorite-project1 delivered_status !delivered and minus
resolves to...
project_name = 'my-favorite-project1' minus (project_name = 'my-favorite-project1' and delivered_status != 'delivered')

Table 4

Existing Dataset Definitions as Constraints

When defining constraints, you may also specify an existing dataset definition as a constraint. The definition will store a string of "Existing Set" for the dimension name, and the name of the set as the constraint value. This will be interpreted by the dimension resolution algorithm and will use that existing set's definition in place of the so named dimension.

Finding Existing Definitions

Existing dataset definitions can be found in many ways. You can traverse through the definitions that you have created. You can traverse through all definitions created for your group. Or, you can use the new keywords to find all definitions that include a keyword of interest. You can also peruse definitions based on the dimensions used to define them, searching for existing dataset definitions that used the same search criteria that you are using.

In addition, a search utility will be available, allowing you to search for the various conditions, instead of just traversing through voluminous lists of definitions.

GUI Dataset Definition Tool

The GUI for creating dataset definitions consists of two portions, a section for finding dataset definitions and another for defining and viewing them. See, the prototype html version of the GUI. This prototype was prepared in PHP on another machine, with just a sample of the final, raw html pasted in this area for review by the sam-design team.

The left frame depicts a search area which lets you traverse through dataset definitions in a hierarchical fashion. The hierarchy actually navigates you through various common ways that you might want to search dataset definitions: by the group that created them, by the person who create them, by the dimensions they contain, or by the keywords they contain. Dimensions are a new concept introduced in this design. Keywords are a new concept requested by D0 Users. It is also envisioned that the find area on the left will include buttons for searching on various categories.

Then, when you finally find the dataset definition that you want to review, you click it to display it in the Dataset Definition Detail window on the right. The top section of the detail frame is a spreadsheet style interface allowing you to enter the most complex dimension/constraint arguments in the most simple manner. The interface allows you to specify multiple dimensions as rows. The rows are combined using and predicates, while the columns are combined using or predicates. The complex combination of sets is accomplished by denoting a set predicate of the three: and (intersect), or (union), minus. You can combine pre-defined sets by selecting a dimension name of Existing Set and then entering the dataset definition name in the Constraint Value field.

All the the RPN examples noted in the section earlier can easily be mapped to the new GUI interface as shown in the following examples.

GUI Example Constraints
Interface Example for Constraint Example 1
DimensionConstraint Valueor
file_name%ztautau% 
file_name %zfigfig%
Interface Example for Constraint Example 2
DimensionConstraint Valueor
file_names%ztautau% 
run_number14-23 
set: or  
data_tierdigitized 
Interface Example for Constraint Example 3
DimensionConstraint Valueor
run_number1,33,100-199 
file_namefoo%,bar% 
physical_datastreamjet 
set: minus  
physical_datastreamelectron+muon+jet 
data_tierraw 
Interface Example for Constraint Example 4
DimensionConstraint Valueor
project_namemy-favorite-project1 
set: minus  
project_namemy-favorite-project1 
delivered_status!delivered 
Diagram 2

The implementation platform for the GUI is still to be determined. The main consideration is that the tool choosen work across the various environments used by the majority of SAM users. That means a broadly available utility is required, which may limit it to the HTTP model and server side tools such as CGI/Python, Java Server Pages, Java Servlets, PHP, or other. Given that they need to communicate with the CORBA database server, the Java solutions are being considered as the first choice for implementation.

Command Line Interface

A command line will be supported which allows you to submit your definitions in full RPN or InFix notation. Both will be supported. The command line will look a like lot the current "sam create dataset definition", only it will allow a value for one of the 2 new parameters, --rpn or --infix, and will not accept --sqlfrom, --sqlwhere. Any of the other old constraints of --filename, --runnum, --eventnum, etc. will be accepted and simply and'ed together with any other constraint values. The sam command line will map these original constraints to their new dimensions/constraints.

Supporting Database Structure

In the new design, we are simply storing the dimension/constraint pairs in their RPN syntax. This requires only minor changes to the database structure for dataset definitions, with 2 new tables being added, PROJECT_DEFINITION_DIMENSIONS and DIMENSIONS. The original PROJECT_DEFINITIONS table can be used as is, but with the 3 columns PROJ_DEF_TABLES, PROJ_DEF_WHERE, and PROJ_DEF_CONSTRAINT no longer used (note, they will no longer be used later, but are needed now for the conversion of the constraints to the old format to the new - all but PROJ_DEF_CONSTRAINT, which is already obsolete). The new tables required are shown in table 5 below.

The ideal goal is to change the names so that the terms "dataset definition" and "project snapshot" can be replaced by those more familiar to physicists, "dataset definition" and "dataset". As a start towards that movement, the views noted in table 6 below will be included now. Code can now be written that refers to these new names of DATASETS, DATASET_DEFINITIONS, and DATASET_DIMENSIONS.

      PROJECT_DEFINITION_DIMENSIONS
      Name                            Null?    Type           Constraint
      ------------------------------- -------- -------------- ----------
      PROJ_DEF_DIM_ID                 NOT NULL NUMBER(38)     PK
      PROJ_DEF_ID                     NOT NULL NUMBER(38)     FK
      DIMENSION_NAME                  NOT NULL VARCHAR2(50)   FK
      DIMENSION_IS_DATASET            NOT NULL NUMBER(1,0)    0/1 binary
      CREATE_DATE                     NOT NULL DATE
      CREATE_USER                     NOT NULL VARCHAR2(50)
      CONSTRAINT_VALUE                NOT NULL CLOB
      RPN_PREDICATE                            VARCHAR2(50)
      UPDATE_DATE                              DATE
      UPDATE_USER                              VARCHAR2(50)

      DIMENSIONS
      Name                            Null?    Type           Constraint
      ------------------------------- -------- -------------- ----------
      DIMENSION_NAME                  NOT NULL VARCHAR2(50)   PK
      CREATE_DATE                     NOT NULL DATE
      CREATE_USER                     NOT NULL VARCHAR2(50)

      DATA_FILE_KEYWORDS
      Name                            Null?    Type           Constraint
      ------------------------------- -------- -------------- ----------
      FILE_ID                         NOT NULL NUMBER(38)     PK+, FK
      KEYWORD_TYPE_ID                 NOT NULL NUMBER(38)     PK+, FK
      KEYWORD_VALUE                   NOT NULL VARCHAR2(4000)
      CREATE_DATE                     NOT NULL DATE
      CREATE_USER                     NOT NULL VARCHAR2(50)
      UPDATE_DATE                              DATE
      UPDATE_USER                              VARCHAR2(50)

      KEYWORD_TYPES
      Name                            Null?    Type           Constraint
      ------------------------------- -------- -------------- ----------
      KEYWORD_TYPE_ID                 NOT NULL NUMBER(38)     PK
      KEYWORD_TYPE                    NOT NULL VARCHAR2(50)
      CREATE_DATE                     NOT NULL DATE
      CREATE_USER                     NOT NULL VARCHAR2(50)
      UPDATE_DATE                              DATE
      UPDATE_USER                              VARCHAR2(50)

                         New Tables
                          Table 5

      DATASET_DEFINITIONS (one-for-one view of PROJECT_DEFINITIONS, minus columns)
      Name in View                    Different Column in Source Table
      ------------------------------- ----------------------------------
      DATASET_DEF_ID                  PROJ_DEF_ID
      WORK_GRP_ID                     
      PERSON_ID                       
      DATASET_DEF_NAME                PROJ_DEF_NAME
      CREATE_DATE                     
      CREATE_USER                     
      UPDATE_DATE                     
      UPDATE_USER                     
      DATASET_DEF_DESC                PROJ_DEF_DESC

      DATASET_DIMENSIONS (one-for-one view of PROJECT_DEFINITION_DIMENSIONS)
      Name in View                    Different Column in Source Table
      ------------------------------- ----------------------------------
      DATASET_DIM_ID                  PROJ_DEF_DIM_ID
      DATASET_DEF_ID                  PROJ_DEF_ID
      DIMENSION_NAME                  
      DIMENSION_IS_DATASET            
      CONSTRAINT_VALUE                
      CREATE_DATE                     
      CREATE_USER                     
      RPN_PREDICATE                   
      UPDATE_DATE                     
      UPDATE_USER                     

      DATASETS (one-for-one view of PROJECT_SNAPSHOTS)
      Name in View                    Different Column in Source Table
      ------------------------------- ----------------------------------
      DATASET_ID                      PROJ_SNAP_ID
      DATASET_DEF_ID                  PROJ_DEF_ID
      DATASET_VERSION                 PROJ_SNAP_VERSION
      WORK_GRP_ID                     
      PERSON_ID                       
      CREATE_DATE                     
      CREATE_USER                     
      UPDATE_DATE                     
      UPDATE_USER                     
      DATASET_DESC                    PROJ_SNAP_DESC

                         Views Using New Names
                                Table 6

Then, with the above views, if we eventually decided to go to the new names altogether, we need only change the tables so that PROJECT_DEFINITIONS is replaced by DATASET_DEFINITIONS and PROJECT_SNAPSHOTS is replaced by DATATSETS. Also, we would then want to change the following column names in the tables noted below in table 7, to keep consistent naming conventions.

      PROJECT_FILES (Rename the columns noted to the right)
      Name                            Null?    Type
      ------------------------------- -------- ----
      FILE_NUMBER                     NOT NULL NUMBER(38)  
      PROJ_SNAP_ID                    NOT NULL NUMBER(38)  => DATASET_ID
      FILE_ID                         NOT NULL NUMBER(38)
      CREATE_DATE                     NOT NULL DATE
      CREATE_USER                     NOT NULL VARCHAR2(50)
      UPDATE_DATE                              DATE
      UPDATE_USER                              VARCHAR2(50)

      CONSUMED_FILES (Rename the columns noted to the right)
      Name                            Null?    Type
      ------------------------------- -------- ----
      PROJ_SNAP_ID                    NOT NULL NUMBER(38)  => DATASET_ID
      FILE_NUMBER                     NOT NULL NUMBER(38)
      PROCESS_ID                      NOT NULL NUMBER(38)
      CONSUMER_ID                     NOT NULL NUMBER(38)
      CONSUMED_FILE_STATUS            NOT NULL VARCHAR2(25)
      CREATE_DATE                     NOT NULL DATE
      CREATE_USER                     NOT NULL VARCHAR2(50)
      UPDATE_DATE                              DATE
      UPDATE_USER                              VARCHAR2(50)

      ANALYSIS_PROJECTS (Rename the columns noted to the right)
      Name                            Null?    Type
      ------------------------------- -------- ----
      PROJECT_ID                      NOT NULL NUMBER(38)
      STATION_ID                      NOT NULL NUMBER(38)
      WORK_GRP_ID                     NOT NULL NUMBER(38)
      PERSON_ID                       NOT NULL NUMBER(38)
      PROJ_SNAP_ID                    NOT NULL NUMBER(38)  => DATASET_ID
      PROJECT_STATUS                  NOT NULL VARCHAR2(25)
      PROJECT_NAME                    NOT NULL VARCHAR2(1024)
      CREATE_DATE                     NOT NULL DATE
      CREATE_USER                     NOT NULL VARCHAR2(50)
      PROJECT_DESC                             VARCHAR2(4000)
      START_TIME                               DATE
      END_TIME                                 DATE
      UPDATE_DATE                              DATE
      UPDATE_USER                              VARCHAR2(50)
      PROJECT_PROCESS_ID                       NUMBER(38)
      NODE_ID                                  NUMBER(38)

           Modified Tables if Later Change to Only New Names
                            Table 7

Data Conversion

The data conversion requires a conversion from the original storing of raw SQL from/where clauses in the Project_Definitions table to the storing of dimensions/constraints/predicates in the Project_Definition_Dimensions table. This conversion can be done in 2 steps. The first step will utilize the new convertInFixtoRPN method that is introduced in the ProjectUser.idl interface. This conversion method will convert the InFix SQL wheres in Proj_Def_Where to RPN type format and store them as is in the Project_Definition_Dimensions table, with the actual table.column name combinations stored in the Dimension_Name column, and the complete where comparison stored in the Constraint_Value column. Then, a pass must be made to convert all Dimension_Names to the corresponding new Dimension Name that is required. Since we will be using a dynamic table selection mapping, the Proj_Def_Tables column need not be converted.

During development of the new dataset definition interface, the original methods can run as is. They will simply add more data to convert from the original SQL from/where format to the new dimension format.


Matthew Vranicar
Last modified: Wed Aug 30 17:59:05 CDT 2000