Index


MetaData HowTo: Basic Version

When you store a file with SAM, you store some data with it in order to find it later. A filename is unique in all places and times in SAM so if you choose "test" for your filename, you are likely to find you get an error. Therefore you should choose a filename that is unique -- usually by adding the unix time stamp and your station name or location. You should never use the filename with wildcards to search for files: that is what the metadata are used for.

The minimalist version of a metadata file contains the program name, version, number of events, time produced, your name, where it was produced, the type of run, the group, stream and some descriptive text that may be the same for your private dataset. You can also put a reference to some web page where additional information on the dataset is kept.

An example of this is shown here; while it says Monte Carlo and Generator, this is only a temporary kludge that you must use for all files. Use this anyway for a real file or ntuple. They are left here for illustration that parameters can be anything. More documentation about metadata can be found at: http://cdfdb-prd.fnal.gov/sam_user_pyapi/examples/

from import_classes import *

appfamily=AppFamily('generator', '1.00', 'generator')
filename = 'rs-1ev-test-031106-1910.root'
t = SAMMCFile(filename,Events(1, 2, 2),
"generated",
appfamily,
"01/21/2003 10:59:09",
"01/21/2003 11:20:08",
18,
{
'Global':
{ 'ProducedByName':'mrenna',
  'OriginName':'fermilab',
  'Phase':'unspecified',
  'FacilityName':'fixed-target-farm',
  'ProducedForName':'mrenna',
  'RunType':'Monte Carlo',
  'GroupName':'cdf',
  'Stream':'m', 
  'Description':'test mc',
},
'CDF':
{ 'DataSet':'stink2',
  'html':'http://cepa.fnal.gov/personal/mrenna/',
}
'Generated' :{
'AppFamily':'generator',
'FirstEvent':'1',
'AppVersion':'1.00',
'LastEvent':'2',
'NumRecords':'2',
'AppName':'generator',
'TotalEvents':'2',
'RunNumber':54321,}
}
)
from import_classes import *

appfamily=AppFamily('generator', '1.00', 'generator')
filename = 'rs-1ev-test-031106-1910.root'
t = SAMMCFile(filename,Events(1, 2, 2),
"generated",
appfamily,
"01/21/2003 10:59:09",
"01/21/2003 11:20:08",
18,
{
'Global':
{ 'ProducedByName':'mrenna',
  'OriginName':'fermilab',
  'Phase':'unspecified',
  'FacilityName':'fixed-target-farm',
  'ProducedForName':'mrenna',
  'RunType':'Monte Carlo',
  'GroupName':'cdf',
  'Stream':'m', 
  'Description':'test mc',
},
'CDF':
{ 'DataSet':'stink2',
  'html':'http://cepa.fnal.gov/personal/mrenna/',
},
'Pythia':
{ 'cdfrelease':'testpycdfrelease',
      'collider':'testpycollider', 
    'comments':'testpycomments',
    'decaytable':'testpydecaytable',
    'energy':'testpyenergy', 
    'et_jet_cut':'testpyet_jet_cut', 
    'fact_scale':'testpyfact_scale', 
    'lamqcd5':'testpylamqcd5', 
    'numrecords':'testpynumrecords', 
    'partons':'testpypartons',
    'pdf':'testpypdf', 
    'physicsprocess':'testpyphysicsprocess',
    'picobarns':'testpypicobarns',    
    'qcd_order':'testpyqcd_order',    
    'qcd_power':'testpyqcd_power',    
    'qed_order':'testpyqed_order',    
    'qed_power':'testpyqed_power',    
    'ranseed1':'testpyranseed1',     
    'ranseed2':'testpyranseed2',     
    'renorm_scale':'testpyrenorm_scale', 
    'runnumber':'testpyrunnumber',  
    'useevtgen':'testpyuseevtgen',    
    'useqq':'testpyuseqq',
    'validated':'testpyvalidated',
    'version':'testpyversion',      
    'webpage':'testpywebpage',
},
'Herwig' :
{  'cdfrelease':'testhercdfrelease',
   'collider':'testhercollider',      
    'comments':'testhercomments',      
    'decaytable':'testherdecaytable',    
    'energy':'testherenergy',      
    'et_jet_cut':'testheret_jet_cut',    
    'fact_scale':'testherfact_scale',    
    'lamqcd5':'testherlamqcd5',       
    'numrecords':'testhernumrecords',    
    'partons':'testherpartons',       
    'pdf':'testherpdf',           
    'physicsprocess':'testherphysicsprocess',
    'picobarns':'testherpicobarns',     
    'qcd_order':'testherqcd_order',     
    'qcd_power':'testherqcd_power',     
    'qed_order':'testherqed_order',    
    'qed_power':'testherqed_power',     
    'ranseed1':'testherranseed1',      
    'ranseed2':'testherranseed2',      
    'renorm_scale':'testherrenorm_scale',  
    'runnumber':'testherrunnumber',     
    'validated':'testhervalidated',
    'version':'testherversion',       
    'webpage':'testherwebpage',        
},
'Alpgen' :{
    'collider':'testalpcollider',
    'comments':'testalpcomments',      
    'dr_jj_cut':'testalpdr_jj_cut',     
    'dr_lj_cut':'testalpdr_lj_cut',     
    'energy':'testalpenergy',        
    'et_jet_cut':'testalpet_jet_cut',    
    'et_lep_cut':'testalpet_lep_cut',    
    'fact_scale':'testalpfact_scale',    
    'lamqcd5':'testalplamqcd5',        
    'll_mass_cut':'testalpll_mass_cut',    
    'numrecords':'testalpnumrecords',     
    'partons':'testalppartons',        
    'pdf':'testalppdf',           
    'physicsprocess':'testalpphysicsprocess', 
    'picobarns':'testalppicobarns',      
    'qcd_order':'testalpqcd_order',      
    'qcd_power':'testalpqcd_power',      
    'qed_order':'testalpqed_order',      
    'qed_power':'testalpqed_power',      
    'ranseed1':'testalpranseed1',       
    'ranseed2':'testalpranseed2',      
    'renorm_scale':'testalprenorm_scale',   
    'runnumber':'testalprunnumber',      
    'validated':'testalpvalidated',
    'version':'testalpversion',        
    'webpage':'testalpwebpage',        
    'weight':'testalpweight',      
    },
'Madgraph' :{
    'collider':'testmadcollider',         
    'comments':'testmadcomments',         
    'dr_jj_cut':'testmaddr_jj_cut',        
    'dr_lj_cut':'testmaddr_lj_cut',        
    'energy':'testmadenergy',         
    'et_jet_cut':'testmadet_jet_cut',       
    'et_lep_cut':'testmadet_lep_cut',       
    'fact_scale':'testmadfact_scale',       
    'lamqcd5':'testmadlamqcd5',          
    'll_mass_cut':'testmadll_mass_cut',      
    'numrecords':'testmadnumrecords',       
    'partons':'testmadpartons',          
    'pdf':'testmadpdf',              
    'physicsprocess':'testmadphysicsprocess',   
    'picobarns':'testmadpicobarns',        
    'qcd_order':'testmadqcd_order',        
    'qcd_power':'testmadqcd_power',        
    'qed_order':'testmadqed_order',        
    'qed_power':'testmadqed_power',       
    'ranseed1':'testmadranseed1',        
    'ranseed2':'testmadranseed2',         
    'renorm_scale':'testmadrenorm_scale',     
    'runnumber':'testmadrunnumber',
    'validated':'testmadvalidated',
    'version':'testmadversion',
    'webpage':'testmadwebpage',          
    'weight':'testmadweight',                
},
'Generated' :{
'AppFamily':'generator',
'FirstEvent':'1',
'AppVersion':'1.00',
'LastEvent':'2',
'NumRecords':'2',
'AppName':'generator',
'TotalEvents':'2',
'RunNumber':54321,}
}
)

When you have store metadata for a file, you can retrieve it with the command:

sam get metadata --file=<myfile>

top


Datasets Explained: SAM Datasets, CDF DataSets, Datasets, Project Snapshots

Unfortunately the word "dataset" is heavily overloaded in the data handling world. Furthermore, in the deep dark history of SAM, a change was made in syntax and so there are some references to words that mean the same thing.

First, a cdf dataset has a more modern (especially in Grid) concept of a "data collection". That is a group of files that are common in their properties. For our implementation of SAM in CDF, we have maintained this as a parameter although more sophisticated ways of handling this are being hammered out.

A SAM dataset definition corresponds to a selection of files meeting some criteria set by a variety of parameters that describe them based on the declarations made in the metadata.

A very simple way through the morass is to use the parameter cdf.dataset in defining a sam dataset and that is the end of the story. Using more sophisticated combinations of parameters requires care that one has specified the collection of files uniquely. Tools exist to allow you to examine files you care to inspect, but this is indeed a complex operation.

Once a dataset definition is made, it can be used to specify the files to be delivered to a project. When that delivery has been done, sam keeps permanent record of the project that was run and it is possible to always go back to find out what files were used. This is called a "dataset" within the context of sam or a "project snapshot" within the context of sam.

When a dataset definition is made, it is possible to immediately in the definition with "sam create dataset". Once this is done the definition is frozen. This is useful if you want to make sure that your definition is not modified - by someone else!

top


SAM Datasets: Figuring out what parameters are defined for each file, their values and getting access to metadata

Information on parameters and metadata for datasets in Randy's browser under the "sam" reports and shows up in the pulldown menus as:

Here is a description of how to use them and what they do.

  • SAM:File Parameter Names by Dataset

    One is presented with two fields to fill:

    One will want to fill the first with a favorite project defintion name. This example uses jbot0h.

    The second is not useful until one has done the query or knows something about one's snapshots. Snapshots and datasets are described in this document as well..

    If one submits one's request after giving a dataset definition one obtains the following. ( The example for jbot0h is instructional) For every snapshot, one sees what parameters there are and how many times they have different values for all the files. One does NOT see all the files and the parameter next to each one. That would be information overload at this point. The columns obtained are:

    Here is a more detailed description of each column

  • SAM:File Parameter Values by Dataset

    In this case, one is presented the same fields to fill in as before. This time, for the example, jbot0h is chosen and the latest snapshot is entered (28).

    The resulting report may be found at this URL. where the following columns are returned:

    The only difference from the report above is that the value is given. For the case where occurances are the same for all files, only the single value appears.

    This effectively gives you a way to recover the value of parameters for a dataset defintion and inasmuch as we match sam dataset definitions in cdf to cdf datasets, this gives parameters defining a dataset.

    Also, for the case of some parameters that have meaning, Randy has hyperlinked it to more detailed information. Clicking on jbot0h in this example case will give a page of all the files that exist in that dataset and details on the location of every file in every cache, and a basic dump of all metadata for each file. Files are listed by default 125 at a time.

    top


    File Availabity Status and Dimension Queries

    What does file availabity status mean? It means that there is AT LEAST ONE accessible location for the file:

    If there are one or more locations, and at least one of them is considered "good", then the file is available.

    What was it used for? It was used because:

    Users were confused when they'd do "translate constraints" and see a list of N files, then run their project and be delivered only M files (M<N), because the constraints they provided did not account for "files with locations" (aka 'file_availability_status'='available').

    In the past, this columns was set to 'available' if the file ever received a location (but not maintained after that point).

    Another file status value is FILE_CONTENT_STATUS, which is inherent to a file (including any and all replicas) and is a global judgement that the integrity of the file itself is good or bad. This is not a reflection of the quality of the physics contents.

    FILE_CONTENT_STATUS is a function of the file itself.

    FILE_AVAILABILITY_STATUS is a function of all valid sam_locations which might contain copies of the file; the file is available if at least one replica is accessible.

    top


    Sam Locate

    You can find every location of a file everywhere using the sam locate command. We can use one of the files that came from the sam translate example to show how this works:
    [stdenis@nglas05 ~]$ sam locate sm-store-test1.root
    
    which yields
    ['/pnfs/cdfen/filesets/SM/SMTest,ia3937']
    
    Here is a more intersting one (common file used in all tests, so it gets around):
    [stdenis@nglas05 ~]$ sam locate gb01defd.0001exo0
    
    yielding
    ['/pnfs/cdfen/filesets/GI/GI05/GI0500/GI0500.0,ia3638', 
    'ncdf68.fnal.gov:/scratch/sam/cache1/boo', 
    'nglas09.fnal.gov:/cdf/scratch/sam/cache/cdfyale/prd/boo', 
    'lf7.ph.gla.ac.uk:/localhome/sam/cache1/boo', 
    'tuhept.phy.tufts.edu:/home/sam/cache1/boo', 
    'cdf3.uchicago.edu:/cdf/data3a/boo', 
    'nglas07.fnal.gov:/cdf/scratch/sam/pro/boo', 
    'nglas08.fnal.gov:/cdf/scratch/sam/prd/boo', 
    'testwulf.hpcc.ttu.edu:/home/sam/cache1/prd/boo',
    'matrix.physics.ox.ac.uk:/eweak/disk1/sam/boo', 
    'cdfg.ph.gla.ac.uk:/data3/sam/prd/boo', 
    'cdf001.ucsd.edu:/cdf/data01/cdf001/cdf-sam/cache/boo', 
    'nglas03.fnal.gov:/data3/sam/prd/boo', 
    'nglas10.fnal.gov:/data/nglas10/a/sam/prd/boo',
    'nglas05.fnal.gov:/data3/sam/pro/boo',
    'nglas04.fnal.gov:/data3/sam/pro/boo',
    'nglas06.fnal.gov:/data1/sam/prd/boo', 
    'fcdfdata016.fnal.gov:/data1/cdf-sam/cache/cdf-scotgrid-2/prd/boo', 
    'pccdf2.ts.infn.it:/cdf3/sam_cache/boo', 
    'dcap://cdfdca-door01:dcap://cdfdca.fnal.gov:25125/pnfs/fnal.gov/usr//cdfen/filesets/GI/GI05/GI0500/GI0500.0',
    'dcap://cdfdca-door03:dcap://cdfdca.fnal.gov:25137/pnfs/fnal.gov/usr//cdfen/filesets/GI/GI05/GI0500/GI0500.0',
    'dcap://cdfdca-door02:dcap://cdfdca.fnal.gov:25136/pnfs/fnal.gov/usr//cdfen/filesets/GI/GI05/GI0500/GI0500.0',
    'dcap://cdfdca-door04:dcap://cdfdca.fnal.gov:25138/pnfs/fnal.gov/usr//cdfen/filesets/GI/GI05/GI0500/GI0500.0',
    'cdfsam.cnaf.infn.it:/cdf/data/data001/SAM-100GB-cache/boo']
    

    top


    Define a Dataset that combines others

    
    sam translate constraints --dim="__SET__ jbot0h or __SET__ jbot1h"
    
    will give files in both datasets. (Warning: case sensitive, so OR does not work) For example
    sam translate constraints --dim="__SET__ jbot0h" you will get 690 files 
    
    and the same with jbot1h gives 41 so the "or" gives 731.

    top