| a_entry | a_owner | a_short_description_of_the_problem | a_status | package | sam_vsn | zdate_added | zdate_finished |
|---|---|---|---|---|---|---|---|
| new station features  | Andrew, Sinisa  | initial delivery unit, limit on cache each project canuse on each node, intrastation transfer preferred locations. Igor asks the question, what problems are we trying to solve? 1. User starts up and abandons a project. a. Project limit exceeded b. Files are locked 2. User starting O(10^3) files per project. a. reqyires info about files from database b. locking of many files and/or prefetch of a lot of files 3. On farm a project is started, too many files are pre-fetched and locked files create potential dead-lock condition. 4. One user starts too many projects. a. Project limit is exceeded. b. Only a limited number of user jobs can run simultaneously in the batch queue and it blocks everyone else. 5. Sam start project and consumer are separated by a long time. Need way to use local batch queue to maintain projects until ready to run. Fixes: 1. Stop a project where files are cached but consumers do not show up. Need to check if jobs are scheduled - ????? 2. Limit pre-fetch a. Do not create all requests at once. New dbserver interface with that gets info for fetching N files. b. Do not initiate too many transfers. Deliver all ?worthy? units. Have opter create smaller units. Limit is per project per node! Optimizers now deals only with enstore (optimizes tape usages), needs to deal with stations, and stations? state. Can change the optimizer code so it does not have to worry about any given station breaking optimizer rules. 3. Limit locking for any given project. Goal for next 2 weeks. Andrew will finish 1 completely, also 2a,. Sinisa will do the db server part. 2b is done, sinisa will change the optimizer. This will be done by Mar25 , tested, ready to be put on d0mino during the April 2 scheduled downtime.  |   | sam_station, Sam_optimizer  | 4.0.0.?  | 03/12/02  |   |
| Dzero-sam "initiative"  | Lee, Mark Sosebee, DCD, Dzero  | For identified d0 sites, benchmark network performance, establish and test working sam stations, systematically move data to these sites and measure performance and bottlenecks. Working with Networking (DCD) group to get started with tools developed by IEPM project (SLAC).  |   |   |   | 02/11/02  |   |
| farm proxy  | Sinisa  | Need to explore and develop a proxy server (or other solution) to enable running sam on distributed systems on a private network, e.g. behind a switch or firewall. Explore IP tunneling and VPNs w/ networking. Part of support for site autonomy.  |   |   |   | 02/11/02  |   |
| pick non-raw events  | Matt  | Fix dimensions to enable picking of not-raw events. Need to fix parentage for files. Involves schema changes if we want to use denormalized approach.  |   |   |   | 02/05/02  |   |
| de-centralized name service  | Andrew (D0 grid)  | Additional infrastructure to de-centralize station operation. This means a station would be more self sufficient and could operate for some ammount of time without access to the outside world. Also, failover to db alternatives to fnal central database system.  |   | sam_nameserver  |   | 02/12/02  |   |
| de-centralized stations  | Andrew (d0 grid)  | Additional infrastructure to de-centralize station operation. This means a station would be more self sufficient and could operate for some ammount of time without access to the outside world. Also, failover to db alternatives to fnal central database system.  |   | sam_station, Sam_db_server  |   | 02/12/02  |   |
| d0mino backend  | Chris, Jason Allen  | bring up sam station to manage new d0mino backend compute servers. Run the station server software on d0mino, requires some changes to project since home areas for jobs in the queue will be on linux and for  |   |   |   | 03/12/02  |   |
| debug cache algo  | Sinisa  | There are indications that the station caching algorithm is not working as desired. Needs to be debugged and fixed if this is true.  |   |   |   | 03/19/02  |   |
| reduce log file  | Sinisa  | Need to open a new sam log file every day. Can use sam log class, or configure so this will work.  |   | sam_log  |   | 03/12/02  |   |
| D0 support  | Lauri + others  |   | ongoing  | NA  | NA  |   |   |
| sam user  | Lauri  | take over sam_user with Carmenita. Cleanup commands, error messages, adding new commands  | ongoing  | sam_user  | NA  |   |   |
| builds  | Lauri  | standardizing builds. requires a lot of design which is the bulk of the job. Benefit is anyone can build any piece of sam easily.  | evolutionary  |   |   |   |   |
| get run  | Lauri  | sam get run command -(needs further specification)  |   | sam_user, sam_db_server  |   |   |   |
| archive logs  | Sinisa, Lee, Lauri  | archiving of log files (waiting on sam on sun). Need stager and encp only, could have station running on other node. Sinisa will try to build stager on SUN, Lee will get encp for sun (seems to be available). Lauri will do final set up. Try on Ora3. Use central analysis station with only stager running on ora1 and ora3.  |   |   |   |   |   |
| sam-at-a-glance  | Lauri, Diana  | Improve for sam-at-a-glance so it runs on ora 1 and provides more up to date information. May require sam user to run on sun OS, or convert to use the name service status info (just ping the stations instead of sam dump). Need to add additional information to database, 1. known down, and also 2 monitoring level: high, medium, and low availability systems (see Lauri's mail describing this in detail).  |   |   |   |   |   |
| unit tests  | Lauri, Chris  | Produce unit tests for sam user interface. Tied to the sam parser task.  |   | sam_user  | v4.1  |   |   |
| clued0  | Chris + Sinisa  | Continue testing of distributed sam on Clued0. Include implemimenting batch system and load testing with additional desktop node included.  | ongoing  |   | v4.0.0.3  |   |   |
| file-status  | Lauri, Steve, Diana, Matt  | Add crummy file status and needed support features. Could use more enduring name, like unofficial or suspect. Matt's second priority. Needs response to Matt's mail from 11/13/01. Held brain storming session Thurs Jan 17, Diana wrote notes. Storing of 'crummy' half finished files - proposal on how to use status of file. Investigation of what code would need to change in sam store (or whether it is just a little samadmin command you are allowed to do right after the store has succeeeded). Investigate how to deal with --resubmit which wants to overwrite a crummy file - needs to call another samadmin command to first delete the file in pnfs space. Additional thought and discussion indicates that the way we use the current file status is incorrect, and some current statuses should be moved to file@location status. Additional statuses discussed at d0 include :incomplete, obsolete, superseded, user-added, unofficial. May be more or others  | in design  |   | v4.1  |   |   |
| app_family + param type/name/value  | Steve, Lauri, Carmenita, Diana  | Link application name/version with MC param type/name/value to provied way to record generalized processing attributes. Need to know the name, and possibly attributes, of the top level RCP.  | needs design  | sam_user, sam_db_server, sam_db  | v4.1  |   |   |
| Documentation  |   | Look through documentation and fix problems. Need sam quick reference page, to replace the quick start guide that is obsolete. sam get metadata,list definition --keywords, sam create dataset --keyword???, sam run project, sam submit may have problems, mc runjob new metadata, auto dest "sam store --descrip=...", add new phase needs to be documented. need to document metadata for luminosity and archive files, sam batch commands, psusp, files not delivered. python api, new dimensions and examples. Translation of status block . sam toonl should be documented. Sam station starting options through sam_bootstrap startup. new flags need to be documented. Questions about groups need to be answered in docmentation.  |   | sam_doc  |   |   |   |
| omniORB.py  | Steve  | continue to understand issues of omniORB.py use with sam . Steve provide detailed list of work to be done. Steve will produce list for discussion 12/04/2001. Steve has made some progress and can describe where he feels the problems are 1/28/2002.  | Needs to be written up  |   |   |   |   |
| autodest  | (Carmenita), Heidi  | autodestination with processed files needs to be resolved bug in the server in constructing the path, pulling info from the parent that it should not.  | done, needs test on farms  | sam_user, sam_db_server  | v3.2  |   |   |
| get num copies  | Carmenita  | get the number of copies for each file from the sam database need to decide where this is kept in sam.  | Need ping from online  | sam_user  | ?  |   |   |
| file_family  | Carmenita  | Add code to sam autodest so that the proposed path string uses an optional entry for "file_family=..." appended to the stream field. This has been requested by Gerry for the online direction of files to tape. Still some debate, but will provide flexibility for streaming decisions to be made later.  | need ping from online  | sam_user  | ?  |   |   |
| samadmin  | Lauri, Diana  | mark entire station as down, also might want node down, station down, fss down.  | not critical  | sam_admin  |   |   |   |
| Task list formatter  |   | complete tasklist formatting script  |   | sam_shift_tools  |   |   |   |
| sam manager  | Sinisa  | possible sam_manager work that may be needed. Pingable client. Check restart option works with --CPID on command line.Also desire to reuse Gabriele's api for ROOT. Gabriele might be able to do this.  | eventually, not high priority  | sam_manager  | v3.2.1  |   |   |
| x-fer Monitor  | Sinisa, John, Diana  | Work to upgrade the backend of the SC2001 info gathering scripts to load information into the new oracle tables using dcoracle.Maybe some changes to sam_admin tools for mining log files. May also want to break log files daily to avoid long processisng times to extract information. Need to have intra-station transfers included as well as extra-station. John needs to build the oracle tables, some design needed though some preliminary work done.  |   |   |   |   |   |
| Restart  |   | Need to be able to recover projects after station crash. 1. application must be restartable, 2. batch system must coordinate with projects, 3. projects are restarted.  |   |   |   |   |   |
| d0mino-sam  | lauri  | Add ability for remotely-initiated transfers to use d0mino-sam dedicated interface on d0mino. Do not believe this involves any mods to bbftp.  |   | sam_cp  |   |   |   |
| data routing,  | Sinisa  | Igor calls "global data replica work". Need design for ultimate file routing. May include incorporation of FSS into station server which brings other important features like fss cache management and persistency. Refer to Igor's email concerning the topic. Igor sent mail on Mon, 07 Jan 2002 16:46.  |   | sam_station  |   |   |   |
| db upkeep  | diana  | continue upkeep and monitoring of d0 db instances  |   |   |   |   |   |
| Q management  |   | batch queue management and restrictions to hold a single user to limited no of jobs  | deferred  |   |   |   |   |
| Helpdesk Followup  | Lauri, Lee  | Need to follow up HD tickets assigned to sam and resolve and closeout  |   |   | ongoing  |   |   |
| TH upgrades  | Chris  | Improve test Harness to reflect behaviour more consistent with central-analysis. For example, need simulated users to kill their jobs in the middle, and need many 10's of thousands of small files cached and reused many times. This will test the station revival more completely.  |   |   |   |   |   |
| FRH 7.1 on SAM cluster  | Chris, Operations group  | Need to install RH7.1 on SAM cluster  |   |   |   |   |   |
| SAM CDF  | Sinisa (or other)  | split sam_config, and sam_boot_strap so can run completely independent db_servers, naming_service, optimizer, and data logger for SAM deployments other than D0.  |   |   |   |   |   |
| Pick events design  |   | design for pick events using existing sam tools, and additional features for pooling requests, caching events, and cataloging.  |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |   |
| Vicky's list  |   | list from Vicky from November. Known issues/operations/testing stuff a) clueD0 and other linux stations strange things with PM,**done** b) restarts - are they working - tell the users how to do it., c) writing out root-tuples at end of input file - tell users how to do it - Jim K was going to write a mail about this - root-tuple writer package needs to catch framework 'event' that input file has been closed, just like sam_manager catches it., d) remote stations getting files through from tape via their own stager need to test,**done** e) stken need to test,**done** f) routing and use of Gb interfaces - needs more discussion and a written, understanding of what we are going to do, g) sam submit - not allowing users to run in Farm-like mode, h) testing from Nikhef - running analysis project on d0mino to use files from SARA robot. Also the inverse - running project there and pulling files from d0mino with bbftp.  | working through list  |   |   |   |   |