Last update: Mon 03/18/02 16:54
a_entry a_owner a_short_description_of_the_problem a_status package sam_vsn zdate_added zdate_finished
new station features  Andrew, Sinisa  initial delivery unit, limit on cache each project canuse on each node, intrastation transfer preferred locations. Igor asks the question, what problems are we trying to solve? 1. User starts up and abandons a project. a. Project limit exceeded b. Files are locked 2. User starting O(10^3) files per project. a. reqyires info about files from database b. locking of many files and/or prefetch of a lot of files 3. On farm a project is started, too many files are pre-fetched and locked files create potential dead-lock condition. 4. One user starts too many projects. a. Project limit is exceeded. b. Only a limited number of user jobs can run simultaneously in the batch queue and it blocks everyone else. 5. Sam start project and consumer are separated by a long time. Need way to use local batch queue to maintain projects until ready to run. Fixes: 1. Stop a project where files are cached but consumers do not show up. Need to check if jobs are scheduled - ????? 2. Limit pre-fetch a. Do not create all requests at once. New dbserver interface with that gets info for fetching N files. b. Do not initiate too many transfers. Deliver all ?worthy? units. Have opter create smaller units. Limit is per project per node! Optimizers now deals only with enstore (optimizes tape usages), needs to deal with stations, and stations? state. Can change the optimizer code so it does not have to worry about any given station breaking optimizer rules. 3. Limit locking for any given project. Goal for next 2 weeks. Andrew will finish 1 completely, also 2a,. Sinisa will do the db server part. 2b is done, sinisa will change the optimizer. This will be done by Mar25 , tested, ready to be put on d0mino during the April 2 scheduled downtime.    sam_station, Sam_optimizer  4.0.0.?  03/12/02   
Dzero-sam "initiative"  Lee, Mark Sosebee, DCD, Dzero  For identified d0 sites, benchmark network performance, establish and test working sam stations, systematically move data to these sites and measure performance and bottlenecks. Working with Networking (DCD) group to get started with tools developed by IEPM project (SLAC).        02/11/02   
farm proxy  Sinisa  Need to explore and develop a proxy server (or other solution) to enable running sam on distributed systems on a private network, e.g. behind a switch or firewall. Explore IP tunneling and VPNs w/ networking. Part of support for site autonomy.        02/11/02   
pick non-raw events  Matt  Fix dimensions to enable picking of not-raw events. Need to fix parentage for files. Involves schema changes if we want to use denormalized approach.        02/05/02   
de-centralized name service  Andrew (D0 grid)  Additional infrastructure to de-centralize station operation. This means a station would be more self sufficient and could operate for some ammount of time without access to the outside world. Also, failover to db alternatives to fnal central database system.    sam_nameserver    02/12/02   
               
de-centralized stations  Andrew (d0 grid)  Additional infrastructure to de-centralize station operation. This means a station would be more self sufficient and could operate for some ammount of time without access to the outside world. Also, failover to db alternatives to fnal central database system.    sam_station, Sam_db_server    02/12/02   
d0mino backend  Chris, Jason Allen  bring up sam station to manage new d0mino backend compute servers. Run the station server software on d0mino, requires some changes to project since home areas for jobs in the queue will be on linux and for        03/12/02   
debug cache algo  Sinisa  There are indications that the station caching algorithm is not working as desired. Needs to be debugged and fixed if this is true.        03/19/02   
reduce log file  Sinisa  Need to open a new sam log file every day. Can use sam log class, or configure so this will work.    sam_log    03/12/02   
D0 support  Lauri + others    ongoing  NA  NA     
sam user  Lauri  take over sam_user with Carmenita. Cleanup commands, error messages, adding new commands  ongoing  sam_user  NA     
builds  Lauri  standardizing builds. requires a lot of design which is the bulk of the job. Benefit is anyone can build any piece of sam easily.  evolutionary         
get run  Lauri  sam get run command -(needs further specification)    sam_user, sam_db_server       
archive logs  Sinisa, Lee, Lauri  archiving of log files (waiting on sam on sun). Need stager and encp only, could have station running on other node. Sinisa will try to build stager on SUN, Lee will get encp for sun (seems to be available). Lauri will do final set up. Try on Ora3. Use central analysis station with only stager running on ora1 and ora3.           
sam-at-a-glance  Lauri, Diana  Improve for sam-at-a-glance so it runs on ora 1 and provides more up to date information. May require sam user to run on sun OS, or convert to use the name service status info (just ping the stations instead of sam dump). Need to add additional information to database, 1. known down, and also 2 monitoring level: high, medium, and low availability systems (see Lauri's mail describing this in detail).           
unit tests  Lauri, Chris  Produce unit tests for sam user interface. Tied to the sam parser task.    sam_user  v4.1     
clued0  Chris + Sinisa  Continue testing of distributed sam on Clued0. Include implemimenting batch system and load testing with additional desktop node included.  ongoing    v4.0.0.3     
file-status  Lauri, Steve, Diana, Matt  Add crummy file status and needed support features. Could use more enduring name, like unofficial or suspect. Matt's second priority. Needs response to Matt's mail from 11/13/01. Held brain storming session Thurs Jan 17, Diana wrote notes. Storing of 'crummy' half finished files - proposal on how to use status of file. Investigation of what code would need to change in sam store (or whether it is just a little samadmin command you are allowed to do right after the store has succeeeded). Investigate how to deal with --resubmit which wants to overwrite a crummy file - needs to call another samadmin command to first delete the file in pnfs space. Additional thought and discussion indicates that the way we use the current file status is incorrect, and some current statuses should be moved to file@location status. Additional statuses discussed at d0 include :incomplete, obsolete, superseded, user-added, unofficial. May be more or others  in design    v4.1     
app_family + param type/name/value  Steve, Lauri, Carmenita, Diana  Link application name/version with MC param type/name/value to provied way to record generalized processing attributes. Need to know the name, and possibly attributes, of the top level RCP.  needs design  sam_user, sam_db_server, sam_db  v4.1     
Documentation    Look through documentation and fix problems. Need sam quick reference page, to replace the quick start guide that is obsolete. sam get metadata,list definition --keywords, sam create dataset --keyword???, sam run project, sam submit may have problems, mc runjob new metadata, auto dest "sam store --descrip=...", add new phase needs to be documented. need to document metadata for luminosity and archive files, sam batch commands, psusp, files not delivered. python api, new dimensions and examples. Translation of status block . sam toonl should be documented. Sam station starting options through sam_bootstrap startup. new flags need to be documented. Questions about groups need to be answered in docmentation.    sam_doc       
omniORB.py  Steve  continue to understand issues of omniORB.py use with sam . Steve provide detailed list of work to be done. Steve will produce list for discussion 12/04/2001. Steve has made some progress and can describe where he feels the problems are 1/28/2002.  Needs to be written up         
autodest  (Carmenita), Heidi  autodestination with processed files needs to be resolved bug in the server in constructing the path, pulling info from the parent that it should not.  done, needs test on farms  sam_user, sam_db_server  v3.2     
get num copies  Carmenita  get the number of copies for each file from the sam database need to decide where this is kept in sam.  Need ping from online  sam_user     
file_family  Carmenita  Add code to sam autodest so that the proposed path string uses an optional entry for "file_family=..." appended to the stream field. This has been requested by Gerry for the online direction of files to tape. Still some debate, but will provide flexibility for streaming decisions to be made later.  need ping from online  sam_user     
samadmin  Lauri, Diana  mark entire station as down, also might want node down, station down, fss down.  not critical  sam_admin       
Task list formatter    complete tasklist formatting script    sam_shift_tools       
sam manager  Sinisa  possible sam_manager work that may be needed. Pingable client. Check restart option works with --CPID on command line.Also desire to reuse Gabriele's api for ROOT. Gabriele might be able to do this.  eventually, not high priority  sam_manager  v3.2.1     
x-fer Monitor  Sinisa, John, Diana  Work to upgrade the backend of the SC2001 info gathering scripts to load information into the new oracle tables using dcoracle.Maybe some changes to sam_admin tools for mining log files. May also want to break log files daily to avoid long processisng times to extract information. Need to have intra-station transfers included as well as extra-station. John needs to build the oracle tables, some design needed though some preliminary work done.           
Restart    Need to be able to recover projects after station crash. 1. application must be restartable, 2. batch system must coordinate with projects, 3. projects are restarted.           
d0mino-sam  lauri  Add ability for remotely-initiated transfers to use d0mino-sam dedicated interface on d0mino. Do not believe this involves any mods to bbftp.    sam_cp       
data routing,  Sinisa  Igor calls "global data replica work". Need design for ultimate file routing. May include incorporation of FSS into station server which brings other important features like fss cache management and persistency. Refer to Igor's email concerning the topic. Igor sent mail on Mon, 07 Jan 2002 16:46.    sam_station       
db upkeep  diana  continue upkeep and monitoring of d0 db instances           
Q management    batch queue management and restrictions to hold a single user to limited no of jobs  deferred         
Helpdesk Followup  Lauri, Lee  Need to follow up HD tickets assigned to sam and resolve and closeout      ongoing     
TH upgrades  Chris  Improve test Harness to reflect behaviour more consistent with central-analysis. For example, need simulated users to kill their jobs in the middle, and need many 10's of thousands of small files cached and reused many times. This will test the station revival more completely.           
FRH 7.1 on SAM cluster  Chris, Operations group  Need to install RH7.1 on SAM cluster           
SAM CDF  Sinisa (or other)  split sam_config, and sam_boot_strap so can run completely independent db_servers, naming_service, optimizer, and data logger for SAM deployments other than D0.           
Pick events design    design for pick events using existing sam tools, and additional features for pooling requests, caching events, and cataloging.           
               
               
               
Vicky's list    list from Vicky from November. Known issues/operations/testing stuff a) clueD0 and other linux stations strange things with PM,**done** b) restarts - are they working - tell the users how to do it., c) writing out root-tuples at end of input file - tell users how to do it - Jim K was going to write a mail about this - root-tuple writer package needs to catch framework 'event' that input file has been closed, just like sam_manager catches it., d) remote stations getting files through from tape via their own stager need to test,**done** e) stken need to test,**done** f) routing and use of Gb interfaces - needs more discussion and a written, understanding of what we are going to do, g) sam submit - not allowing users to run in Farm-like mode, h) testing from Nikhef - running analysis project on d0mino to use files from SARA robot. Also the inverse - running project there and pulling files from d0mino with bbftp.  working through list