SAM V2.1 introduces the new Dataset Definition design. In this design, Dataset Definition is synonymous with the old term Project Definition, while Dataset is synonymous with the old term Project Snapshot. A Dataset Definition is a set of dimensions and constraints that specify the files of interest. A Dataset is a point in time view of the files that meet the criteria of the Dataset Definition.
SAM Dataset Definitions are redesigned to eliminate 2 of the major limitations of the first implementation. First, the dataset definitions must be indepenedent of the database structures. No raw SQL statements are stored for the definitions. Second, dataset definitions must allow for both the declaration of simple projects and more complex projects. Simple projects must be simple to create, while more complex projects may require more knowledge of the dataset definition constructs. Complex projects must include the basic set logic of union, intersection and minus. In other words, the ability to define a project in terms of "this set of files" or "this set" (union), "this set" and "this set" (intersection), and "this set" minus "this set".
In addition to these changes, the new SAM Dataset Definitions will provide users with more flexible ways to query existing dataset definitions. Tools are provided that allow you to peruse dataset definitions by group, by person, or even by the conditions used to define the dataset definition. Additionally, the new keyword concept allows you to define your own physics oriented keywords for each data file and define and search dataset definitions by these keywords.To achieve these goals, a few new constructs are introduced in the dataset definition constraint specification language. First, Dimensions are used to define named searchable concepts. These concepts usually map directly back to specific database columns, but may map back to more complex structures. The resolution of the named dimension back to the database structure is of no concern to anyone but the SAM database server developers. Second, Constraints will continue to be specified in the syntax used for the older project definition constraint design, allowing you to simply supply the value and the comparison operator is determined for you. However, constraints may also include the comparison operator. Finally, the complete set of dimensions and constraints is stored in a manner that allows users to peruse existing dataset definitions by the conditions used to define them.
Dataset definitions may be defined using either a new GUI interface, a command line utility, or a Python API. The GUI provides a spreadsheet style oriented interface for dimension/constraint entries, which are then mapped into the RPN notation for constraint storage. The command line entry of dimensions and constraints will require an understanding of the either a Post-Fix or RPN notation.
Detailed descriptions of all Dimensions, Constraints, and the syntax for specifying constriants are included below. Also included are design descriptions of the new GUI and the command line interface.
Dimensions
Dimensions are modelled using the Dimension concept popular in data warehousing circles. The entire SAM meta-data schema can be thought of as one big warehouse, which has many queryable dimensions made available to the users. Each of the available dimensions is specified by the users without respect to any table structure knowledge. The database server developers maintain the dimension to database table mappings, ensuring properly tuned queries are produced by the various allowed combinations of dimensions.
The original implementation of the SAMDbServer included a small allowable subset of dimensions as specified in ProjectUser.idl's Constraint struct. The new implementation provides those and a few more dimensions. It also allows the SAM DbServer developers to simply add more dimensions as needed, with a variable list of dimensions recorded in the new Db2Dataset.idl interface section noted below.
Original Constraint struct from ProjectUser.idl:
struct Constraint {
string runNumber;
string eventNumber;
string dataTierType;
string physicalDatastreamName;
string logicalDatastreamName;
string physicalDatasetName;
string processFamily;
string processFamilyVersion;
string fileList;
};
New Dimensions from Db2Dataset.idl:
// Get a list of available dimensions for querying.
typedef sequence<string> Strings;
Strings getAvailableDimensions();
// Dimensions passed as input.
typedef sequence<Dimension> Dimensions;
struct Dimension {
string dimName;
string dimOperator;
string dimValue;
string dimSetOperator; //rpn style
};
//
The mapping of dimensions to database queries may be done by hand in the SAM DbServer's code, or may involve the use of a new tool. Currently a tool dubbed the Ant Colony, produced by Geppetto's Workshop is being reviewed for this purpose. This tool allows for a data model to be defined pictorially, with dimensions, facts and metrics made available for querying. Resolving the dimensions into SQL queries is done by a dynamic table selection algorithm which reads rules defined for the data mapping to ensure that only valid, tuned SQL queries are produced. The initial version of the Dataset Definition query resolution relies on a home-grown version of an Ant/Dimension query builder. As this code runs up against limitations, or the Ant Colony code provides major new features, the underlying engine may easily be switched out, without affecting the end user interfaces.
Dimension Naming
Some original thoughts on dimension naming rules which may change as we proceed are included here. Well, they don't really fit anywhere else in this document yet, so here they are. The dimension name must be a valid dimension name in your list of allowable dimensions. While it would be possible to allow dimension names to contain spaces, SAM is implementing a limited dimension name convention, allowing no spaces in the dimension names. Dimension names will begin with a character from A-Z, will contain only A-Z,_,$ characters, and will be case insensitive. The initial Dimension names will all be derived from their source column names. This is possible for the initial set of Dimensions, as they do not include any real complex dimensions. As more complex dimensions are introduced, the naming conventions may need to be revisited.
Constraints
Constraints are specified for a dimension in order to limit the resulting rows. The initial SAM constraint language will remain intact, allowing for the use of the special characters that make entry of constraint a littler simpler than their corresponding SQL operators. For example, the following tricks remain the same.
| Constraint | Resolves to |
|---|---|
| 1,14,567 | in (1,14,567) |
| 14-21 | between 14 and 21 |
| 44,50-70 | = 44 or between 50 and 70 |
| 44,45,1-8 | in (44,45) or between 1 and 8 |
| %ztautau% | like '%ztautau%' |
| ztautau.101 | = 'ztautau.101' |
But, if you want to be more precise, you can always specify the full comparison operator syntax. For example, you could enter the full "Resolves to" constraint as shown in table 1. Allowable constraint language syntax rules are detailed in table 2. The first column shows the full comparison operator syntax, while the second column shows the short-hand notation that is allowed. Note the details regarding the short-hand in column 3.
| Full Constraint Syntax | Short Hand | Details |
|---|---|---|
| = 1 | 1 | No quotes needed for numbers. |
| = 'a' | a | Quotes needed for full notation but not for shorthand. Shorthand could also be 'a'. Single quotes are optional for shorthand, but required when the shorthand text contains spaces. |
| like 'a%' | a% | Presence of a wildcard (%) implies like operator in shorthand. Same quote rules as above. |
| in (4,5) | 4,5 | Comma separated shorthand implies or predicate. |
| in ('a','b') | a,b | Comma separated shorthand implies or predicate. Same quote rules apply for character data. |
| between 1 and 8 | 1-8 | Dash in shorthand implies between operator. For between, notice that the low value must preceed the high value. Also, while between is allowed for characters it is only recommended for use with numeric constraints. As such, the dash shorthand only applies to numeric dimensions. |
| != 22 | !22 | Only a minor shorthand, but may be easier for users. |
| not in (12,24,48) | !(12,24,48) | Exclamation point preceding parenthesis singles a not in clause. |
| > 34 | n/a | No shorthand available. |
| >= 67 | n/a | No shorthand available. |
| < 51 | n/a | No shorthand available. |
| <= 40 | n/a | No shorthand available. |
| not like 'qid%' | n/a | No shorthand available. |
| is null | n/a | No shorthand available. |
| is not null | n/a | No shorthand available. |
Reverse Polish Notation
The Dimension/Constraint/Operator syntax will be stored in the database in a Reverse Polish Notation (RPN) syntax. End users will be allows to specify dimension/constraint logic in either RPN syntax or in a SQL-like, in-fix notation. The RPN logic for dimension/constraint set definitions is based on the premise of many of today's modern calculators. For more information on RPN calculators, see one of the various sites referenced in table 3.
| RPN References |
|---|
| http://www.hpmuseum.org/rpn.htm
http://users.ece.gatech.edu/~mleach/revpol/ http://www.whisqu.se/per/docs/math35.htm |
Each argument in the RPN syntax is a dimension/constraint combination of the format:
dimensionName [constraintOperator] constraintValue
The dimension name follows the naming rules outlined above in the section on dimensions. The constraint operator is optional. Both the constraint operator and constraint value must follow the constraint notation rules outlined in the constraints section.
Each of these dimension/constraint arguments are then treated just as a number in the RPN calculator examples, but unlike the calculators, SAM only allows 3 set functions: and, or, minus. Some examples are noted below.
|
Example Constraint 1 file_name %ztautau% file_name %zfigfig% or resolves to a forward SQL-ish constraint of... file_name like '%ztautau%' or file_name like '%zfigfig%' |
|
Example Constraint 2 file_name %ztautau% run_number 14,23 and data_tier digitized or resolves to... (file_name like '%ztautau%' and run_number in (14,23)) or data_tier = 'digitized' |
|
Example Constraint 3 run_number 1,33,100-199 file_name foo%,bar% and physical_datastream jet and physical_datastream electron+muon+jet data_tier raw and minus resolves to... (((run_number in (1,33) or run_number between 100 and 199) and (file_name like 'foo%' or file_name like 'bar%')) and physical_datastream = 'jet') minus (physical_datastream = 'electron+muon+jet' and data_tier = 'raw') |
|
Example Constraint 4 project_name my-favorite-project1 project_name my-favorite-project1 delivered_status !delivered and minus resolves to... project_name = 'my-favorite-project1' minus (project_name = 'my-favorite-project1' and delivered_status != 'delivered') |
Existing Dataset Definitions as Constraints
When defining constraints, you may also specify an existing dataset definition as a constraint. The definition will store a string of "Existing Set" for the dimension name, and the name of the set as the constraint value. This will be interpreted by the dimension resolution algorithm and will use that existing set's definition in place of the so named dimension.
Finding Existing Definitions
Existing dataset definitions can be found in many ways. You can traverse through the definitions that you have created. You can traverse through all definitions created for your group. Or, you can use the new keywords to find all definitions that include a keyword of interest. You can also peruse definitions based on the dimensions used to define them, searching for existing dataset definitions that used the same search criteria that you are using.
In addition, a search utility will be available, allowing you to search for the various conditions, instead of just traversing through voluminous lists of definitions.
GUI Dataset Definition Tool
The GUI for creating dataset definitions consists of two portions, a section for finding dataset definitions and another for defining and viewing them. See, the prototype html version of the GUI. This prototype was prepared in PHP on another machine, with just a sample of the final, raw html pasted in this area for review by the sam-design team.
The left frame depicts a search area which lets you traverse through dataset definitions in a hierarchical fashion. The hierarchy actually navigates you through various common ways that you might want to search dataset definitions: by the group that created them, by the person who create them, by the dimensions they contain, or by the keywords they contain. Dimensions are a new concept introduced in this design. Keywords are a new concept requested by D0 Users. It is also envisioned that the find area on the left will include buttons for searching on various categories.
Then, when you finally find the dataset definition that you want to review, you click it to display it in the Dataset Definition Detail window on the right. The top section of the detail frame is a spreadsheet style interface allowing you to enter the most complex dimension/constraint arguments in the most simple manner. The interface allows you to specify multiple dimensions as rows. The rows are combined using and predicates, while the columns are combined using or predicates. The complex combination of sets is accomplished by denoting a set predicate of the three: and (intersect), or (union), minus. You can combine pre-defined sets by selecting a dimension name of Existing Set and then entering the dataset definition name in the Constraint Value field.
All the the RPN examples noted in the section earlier can easily be mapped to the new GUI interface as shown in the following examples.
| Interface Example for Constraint Example 1 | Dimension | Constraint Value | or | |||
|---|---|---|---|---|---|---|
| file_name | %ztautau% | |||||
| file_name | %zfigfig% | |||||
| Interface Example for Constraint Example 2 | Dimension | Constraint Value | or | |||
| file_names | %ztautau% | |||||
| run_number | 14-23 | |||||
| set: or | ||||||
| data_tier | digitized | |||||
| Interface Example for Constraint Example 3 | Dimension | Constraint Value | or | |||
| run_number | 1,33,100-199 | |||||
| file_name | foo%,bar% | |||||
| physical_datastream | jet | |||||
| set: minus | ||||||
| physical_datastream | electron+muon+jet | |||||
| data_tier | raw | |||||
| Interface Example for Constraint Example 4 | Dimension | Constraint Value | or | |||
| project_name | my-favorite-project1 | |||||
| set: minus | ||||||
| project_name | my-favorite-project1 | |||||
| delivered_status | !delivered | |||||
The implementation platform for the GUI is still to be determined. The main consideration is that the tool choosen work across the various environments used by the majority of SAM users. That means a broadly available utility is required, which may limit it to the HTTP model and server side tools such as CGI/Python, Java Server Pages, Java Servlets, PHP, or other. Given that they need to communicate with the CORBA database server, the Java solutions are being considered as the first choice for implementation.
Command Line Interface
A command line will be supported which allows you to submit your definitions in full RPN or InFix notation. Both will be supported. The command line will look a like lot the current "sam create dataset definition", only it will allow a value for one of the 2 new parameters, --rpn or --infix, and will not accept --sqlfrom, --sqlwhere. Any of the other old constraints of --filename, --runnum, --eventnum, etc. will be accepted and simply and'ed together with any other constraint values. The sam command line will map these original constraints to their new dimensions/constraints.
Supporting Database Structure
In the new design, we are simply storing the dimension/constraint pairs in their RPN syntax. This requires only minor changes to the database structure for dataset definitions, with 2 new tables being added, PROJECT_DEFINITION_DIMENSIONS and DIMENSIONS. The original PROJECT_DEFINITIONS table can be used as is, but with the 3 columns PROJ_DEF_TABLES, PROJ_DEF_WHERE, and PROJ_DEF_CONSTRAINT no longer used (note, they will no longer be used later, but are needed now for the conversion of the constraints to the old format to the new - all but PROJ_DEF_CONSTRAINT, which is already obsolete). The new tables required are shown in table 5 below.
The ideal goal is to change the names so that the terms "dataset definition" and "project snapshot" can be replaced by those more familiar to physicists, "dataset definition" and "dataset". As a start towards that movement, the views noted in table 6 below will be included now. Code can now be written that refers to these new names of DATASETS, DATASET_DEFINITIONS, and DATASET_DIMENSIONS.
PROJECT_DEFINITION_DIMENSIONS
Name Null? Type Constraint
------------------------------- -------- -------------- ----------
PROJ_DEF_DIM_ID NOT NULL NUMBER(38) PK
PROJ_DEF_ID NOT NULL NUMBER(38) FK
DIMENSION_NAME NOT NULL VARCHAR2(50) FK
DIMENSION_IS_DATASET NOT NULL NUMBER(1,0) 0/1 binary
CREATE_DATE NOT NULL DATE
CREATE_USER NOT NULL VARCHAR2(50)
CONSTRAINT_VALUE NOT NULL CLOB
RPN_PREDICATE VARCHAR2(50)
UPDATE_DATE DATE
UPDATE_USER VARCHAR2(50)
DIMENSIONS
Name Null? Type Constraint
------------------------------- -------- -------------- ----------
DIMENSION_NAME NOT NULL VARCHAR2(50) PK
CREATE_DATE NOT NULL DATE
CREATE_USER NOT NULL VARCHAR2(50)
DATA_FILE_KEYWORDS
Name Null? Type Constraint
------------------------------- -------- -------------- ----------
FILE_ID NOT NULL NUMBER(38) PK+, FK
KEYWORD_TYPE_ID NOT NULL NUMBER(38) PK+, FK
KEYWORD_VALUE NOT NULL VARCHAR2(4000)
CREATE_DATE NOT NULL DATE
CREATE_USER NOT NULL VARCHAR2(50)
UPDATE_DATE DATE
UPDATE_USER VARCHAR2(50)
KEYWORD_TYPES
Name Null? Type Constraint
------------------------------- -------- -------------- ----------
KEYWORD_TYPE_ID NOT NULL NUMBER(38) PK
KEYWORD_TYPE NOT NULL VARCHAR2(50)
CREATE_DATE NOT NULL DATE
CREATE_USER NOT NULL VARCHAR2(50)
UPDATE_DATE DATE
UPDATE_USER VARCHAR2(50)
New Tables
Table 5
DATASET_DEFINITIONS (one-for-one view of PROJECT_DEFINITIONS, minus columns)
Name in View Different Column in Source Table
------------------------------- ----------------------------------
DATASET_DEF_ID PROJ_DEF_ID
WORK_GRP_ID
PERSON_ID
DATASET_DEF_NAME PROJ_DEF_NAME
CREATE_DATE
CREATE_USER
UPDATE_DATE
UPDATE_USER
DATASET_DEF_DESC PROJ_DEF_DESC
DATASET_DIMENSIONS (one-for-one view of PROJECT_DEFINITION_DIMENSIONS)
Name in View Different Column in Source Table
------------------------------- ----------------------------------
DATASET_DIM_ID PROJ_DEF_DIM_ID
DATASET_DEF_ID PROJ_DEF_ID
DIMENSION_NAME
DIMENSION_IS_DATASET
CONSTRAINT_VALUE
CREATE_DATE
CREATE_USER
RPN_PREDICATE
UPDATE_DATE
UPDATE_USER
DATASETS (one-for-one view of PROJECT_SNAPSHOTS)
Name in View Different Column in Source Table
------------------------------- ----------------------------------
DATASET_ID PROJ_SNAP_ID
DATASET_DEF_ID PROJ_DEF_ID
DATASET_VERSION PROJ_SNAP_VERSION
WORK_GRP_ID
PERSON_ID
CREATE_DATE
CREATE_USER
UPDATE_DATE
UPDATE_USER
DATASET_DESC PROJ_SNAP_DESC
Views Using New Names
Table 6
Then, with the above views, if we eventually decided to go to the new names altogether, we need only change the tables so that PROJECT_DEFINITIONS is replaced by DATASET_DEFINITIONS and PROJECT_SNAPSHOTS is replaced by DATATSETS. Also, we would then want to change the following column names in the tables noted below in table 7, to keep consistent naming conventions.
PROJECT_FILES (Rename the columns noted to the right)
Name Null? Type
------------------------------- -------- ----
FILE_NUMBER NOT NULL NUMBER(38)
PROJ_SNAP_ID NOT NULL NUMBER(38) => DATASET_ID
FILE_ID NOT NULL NUMBER(38)
CREATE_DATE NOT NULL DATE
CREATE_USER NOT NULL VARCHAR2(50)
UPDATE_DATE DATE
UPDATE_USER VARCHAR2(50)
CONSUMED_FILES (Rename the columns noted to the right)
Name Null? Type
------------------------------- -------- ----
PROJ_SNAP_ID NOT NULL NUMBER(38) => DATASET_ID
FILE_NUMBER NOT NULL NUMBER(38)
PROCESS_ID NOT NULL NUMBER(38)
CONSUMER_ID NOT NULL NUMBER(38)
CONSUMED_FILE_STATUS NOT NULL VARCHAR2(25)
CREATE_DATE NOT NULL DATE
CREATE_USER NOT NULL VARCHAR2(50)
UPDATE_DATE DATE
UPDATE_USER VARCHAR2(50)
ANALYSIS_PROJECTS (Rename the columns noted to the right)
Name Null? Type
------------------------------- -------- ----
PROJECT_ID NOT NULL NUMBER(38)
STATION_ID NOT NULL NUMBER(38)
WORK_GRP_ID NOT NULL NUMBER(38)
PERSON_ID NOT NULL NUMBER(38)
PROJ_SNAP_ID NOT NULL NUMBER(38) => DATASET_ID
PROJECT_STATUS NOT NULL VARCHAR2(25)
PROJECT_NAME NOT NULL VARCHAR2(1024)
CREATE_DATE NOT NULL DATE
CREATE_USER NOT NULL VARCHAR2(50)
PROJECT_DESC VARCHAR2(4000)
START_TIME DATE
END_TIME DATE
UPDATE_DATE DATE
UPDATE_USER VARCHAR2(50)
PROJECT_PROCESS_ID NUMBER(38)
NODE_ID NUMBER(38)
Modified Tables if Later Change to Only New Names
Table 7
Data Conversion
The data conversion requires a conversion from the original storing of raw SQL from/where clauses in the Project_Definitions table to the storing of dimensions/constraints/predicates in the Project_Definition_Dimensions table. This conversion can be done in 2 steps. The first step will utilize the new convertInFixtoRPN method that is introduced in the ProjectUser.idl interface. This conversion method will convert the InFix SQL wheres in Proj_Def_Where to RPN type format and store them as is in the Project_Definition_Dimensions table, with the actual table.column name combinations stored in the Dimension_Name column, and the complete where comparison stored in the Constraint_Value column. Then, a pass must be made to convert all Dimension_Names to the corresponding new Dimension Name that is required. Since we will be using a dynamic table selection mapping, the Proj_Def_Tables column need not be converted.
During development of the new dataset definition interface, the original methods can run as is. They will simply add more data to convert from the original SQL from/where format to the new dimension format.