Tasklist for V3.0 - SAM System for run start -- and for V4.0 and beyond..... 02/05/2001

(If you are going to edit this file please be aware of the notes).

 
Category of Task WBS# Task SAM packages
to be worked on
SAM
Release
Estimate of amount of person weeks of work
From 01/15/2001
Comments Who? Date Done task
Planned Major Features 10.2 Integration of SAM and batch system station
user
V3 4 weeks In progress Igor
  12.1 Resource management of disks station V4 or V5      
  3.12 Design of file merge/split/streams/luminosity/trigger and how they all fit together db V3 5 1 hour meetings in next 3 weeks + 5 hours * 5 people thinking;
(+ 2 weeks implementation see 10.11)
Must get the conceptual basis for file/run/stream/triggers/applications correct before we start the run. Incorporate needs from 22.10 Matt,
Julie,
Lee,
Vicky,
Heidi,
Igor
  10.3 Resource management - Optimizer crude version to order files in groups according to order on tape (also requires data in db to be fixed to store order on tape) station V3 2 weeks ListAlso requires small changes to fss and db server to store tape location cookie(and to fix tape type). Depends on script to do initial load of location cookies. Sinisa
(Matt)
  12.2 Resource management for tape and network resource control and allocation station
user
V4 or V5   May require changes to sam submit Sinisa,
Igor
  12.2 Use of job and resource description language for Grid station

user

V5 4 months Integration with work done in context of PPDG Igor, 

TBD

  10.4 Sam_manager - fix to use new fss manager V3 10 days work list Sinisa
  ? Sam_manager - further enhancements including ability to ping client for status manager Few days Depends on D0 interactive framework using CORBA Sinisa
  12.3 Handling of pass-through write cache - using allocation of cache space by station and better routing info in db. station
db
db_server
V5 ~ 2+ weeks Requires supporting db and db serverchanges Igor,
Steve
  12.4 Pass-through of file through intermediate station cache to external station - on read of file station
db
db_server
V5   Will survive without this only because all raw and reconstructed data will be on d0mino cache and therefore accessible from there. If we can use bbftp and somewhat control the network interface used then that will also alleviate the need for WBS  
  11.1 or 12.4 Pick Events Server station V4 or V5 2 months    
  10.5 Sam_user command for pickevents - to locate file user V3 3 days Wrapper for query only returns file or files that event(s) are in
Sam -pickevents -evnum=
Sam -pickevents -evlist=filname
Lauri
  11.2 or 12.5 Pick Events access (framework and other user access) user
manager
V4 or V5 1 month    
  10.6 Station for Farm system - remove stationless mode station V3.1 2-3 weeks, combined with
10.6
Need to work with Farms group to figure out how to deal with disk cache and combine this work with enhancements needed for linux analysis clusters Igor
  10.7 Initial Station for multi-node linux analysis system station
user
admin
V3.1 2-3 weeks, combined with 10.5 Move data intra-station always and don't attempt much optimization in terms of where a job runs Igor
  11.3 Full station for multi-node linux analysis system - including folding in resource management of taking data to job or job to data and also looking at migration of jobs, and collection of results from parallelized analysis job.  station
user
admin
V4 Many weeks Combine with resource management and more general studies for the future of how all this work Igor
  10.8 Design of syntax for remote storage location and implementation of interface to different file transfer programs. Need to identify what MSS file is in, decide on namespaces, also way of identifying transfer tool to be used. Release for bbftp, other? Document station
db
db_server
V3 2 weeks This work is in progress with Lyon. For use depends on bbftp infrastructure in xxxx.
Not clear if it needs any changes to db or db server - merely conventions on names?
Igor,
Lyon people
  11.4 New Standard Information services

prototype

infoserver V4 1 month Demonstrate initial prototype of new inforserver and framework Sinisa
  12.7 Design and implement distributed information services and integrate into all components of SAM infoserver

all 

V5 3 months Full information about whole system Sinisa/Lauri
    Export of database (meta-data) to remote institutions + re-synch of metadata   V7+   only do this if really needed  
    Prompt reconstruction pipeline - ? open-ended datasets?   V7+      
               

Online: Development + Testing
and Operations
Plan, Feb 01 Implement logging of data from online twice - to mass store and then directly to d0mino private disk area or to 2 tapes in parallel.   V3 2-3 weeks Will use rcp for local transfer plus changes to online to use -copy=2, setup of autodestination map and testing
Lee,ODS D0online
  23.8
and Online
Event catalog - partitioned and filled by online with unique Run/Event number db V3 Few days Need to drop events table, add constraints for uniqueness and partition by something simple - like a manually controlled Event Partition number ??
Online need to exercise this and ensure that the new constraints on uniqueness of run+ event number do not break something!
Julie,
Diana,
Lee
  22.10 Upload of Run and Run conditions information from online  user
db_server
V3 3 days Any needed changes to run table must be implemented in 3.12. Must be tested in good and error/delay conditions. Jeremy,
Carmenita
    Tests with database server down 
db_server
V3 2 days Julie
Carmenita,
ODS D0 online
  1.3 Python V2 for Online; user
db_server
V3 ? . ODS D0 Online
    Run with 5 streams   V3 1day   ODS D0 online
    Run with 15 streams         ODS D0 online
    Run at expected Rate          
               

Small Developments
11.4 Thumbnail data design, file format, access strategy and implementation of any indicated changes to sam_manager or sam databases db
manager
V4 ???    
  11.5
12.6
ROOT objects and file formats - interface to ROOT? ? V4 or V5   D0 version of ROOT that gets its data from sam? What about ROOT trees that span a set of files.  
               
Bugs and potential bugs in SAM 10.1.1 Db server - nail all unknown exceptions db_server V3     Steve
  10.1.2 Sam_bootstrap -claim that stagers don't always restart on Farms  bootstrap V3   Need to reproduce this or catch it somehow before it can be investigated Lauri False claim - sam bootstrap improperly used
  10.1.3 Zombie projects -deal with/eliminate? station V3   Station cleanup code Igor
  10.1.4 Work around the fnorb enum bug if it cannot be fixed by a new version of fnorb user
station
V3   Need to ping fnorb developers again Carmenita
  10.1.5 Ensure all servers handle db server restart gracefully - online claimed fss did not one time test
_harness
V3   Do explicit tests. Set test harness event to kill db server Sinisa
  10.1.6 Error messages from sam_user must be fixed to give a meaningful error always   V3   While working in sam_user - catch as much as poss. Carmenita
  10.1.7 Impatient-end feature no longerworking station V3   Reported by Heidi on farms. Igor
  10.1.8 the sam find file command gives the wrong instructions. Not a big deal
but not very user friendly either.&

d0bbin> sam find file --filename=%psim01%ttbar
depreciated in sam v2.1, please use 'sam list file' should be sam list
files
d0bbin> sam list file --filename=%psim01%ttbar% 
Cannot list $2. Could list files or definitions. 
d0bbin> sam list files --filename=%psim01%ttbar% 

user V3     Carmenita
  10.1.9 Sam_user - store command looks for parameter file in wrong directory user V3     Carmenita Fixed
               
Dataset/
Project-editor
10.9 Effort to ensure robustness and clarity project_
editor
V3 Ongoing work Fix bugs as they arise - asap Matt
  10.9.1 On the farms we often submit a job and then want to pick up the files that
have arrived since then.

sam create dataset definition --dim=3D"__set__ bigproject and minus __snap__
would be a nice feature.

project_
editor
V3   Requested by Heidi
This will be a very common mode of operation and we must ensure it works and is well documented
Matt
  10.9.2 Manipulation of actual datasets rather than dataset definitions project_
editor
V3   Mail on this from Igor - doesn't this actually work? Matt
  10.9.3 On the farms we often have an undifferentiated project with 800 files, we'd
like to split this up into smaller chunks. General users may wish to do
so as well. The ability to do

sam split project snapshot --num_files=200 which returns a set of snapshots of
given size would be very useful. One could, in principal, run the optimizer on
the request and do the splits in an optimal fashion for that point in time.

project_
editor
V3 ?   Requested by Heidi Matt
  10.9.4 Formal grammar for dataset definition language project_
editor
V3? Few days   Matt
               
Sam_user
enhancements
11.6.1 Samlock/unlock dataset user V4     Steve
  10.10.1 Fully test modes of file_client needed for online where run info and process info are passed in, not run id or process id. user V3     Carmenita
  10.10.2 Samstore - review command and metadata file format in general and deal with all file formats, module for Farms user V3     Carmenita
    Sam store - add a switch to allow metadata only, no file store. Also perhaps designate in metadata file itself? user V3 or V4   Done??  Carmenita
  10.10.3 Review and restructure package as necessary and remove all junk unused files user V3     Carmenita
  11.6.2 Sam command to give statistics on your analysis project - eventually to return luminosity also user V4      
  10.10.4 Use of standard command/parameters XML file to validate sam commands user V3     Vicky
  10.10.5 Much improved test suite needed user V3     Carmenita,
Vicky
  10.10.6 Python module for file_client needs documentation user V3     Carmenita
  10.10.7 Different levels of printout of status block - improved formatting user V3     Carmenita
    Review C++ code and see if it can go away. Compile all python code before distributing product user V3 or V4     Carmenita
               
Database V3.0 10.11.1 Filesplit/merge/trigger/luminosity interfaces Db
Db_server
User
manager
V3 ~ 2 weeks   Julie,
Matt
  11.7.1 Data structures for disk and network pipe resources db V4      
  11.7.2 Data structures for routing of files through station caches db V4      
  10.11.2 Additional attributes on files db
db_server
user
V3   Implement what is needed Julie, Matt
  10.11.3 MC production tables get into use db

db_server

V3   Greg's code + ? Julie,
Matt,
??
  10.11.4 Data structures for batch integration/resource benefits/weights db

db_server

V3     Matt
  10.11.5 Partitioning of several tables - ready for run db V3   Have to give it our best guess to start with Julie,
diana
  10.11.6 DropEvents table and partition prior to putting event cataloging into productiouse. db V3     diana Done- may be done again??
               
Sam_admin tools
and Diagnostics pages
  MC import file store scripts - improve and generalize for use offsite admin ?   Lee thinks these are too specific for d0mino and not generalizable. Now it is more verification. So users will have to write their own sam store scripts  
  10.12.1 Commands to add some of the supporting data - such as new application version, instead of Forms interface admin V3 2 days   Lauri
  10.12.2. Statistics on disk cache usage admin
db_server?
V3 3 days to 3 weeks depending on what we want to see   Lauri
  10.12.3 Tool to update file entries in database with their sortable location cookie on tape (required for Optimizer) admin V3 1 day   Sinisa
  10.12.4 Scripts to find all files that are in SAM and not in Enstore admin V3 2 days We have these scripts in test harness -> admin Sinisa,
Dehong
  10.12.5 Tool and cron job to run to synchronize Tape status between Enstore and SAM and to create reports on actions taken- web page on bad tapes, noaccess tapes. admin V3 1 day   Sinisa,
Lauri
  10.12.6 Mark volumes as NOACCESS or NOTALLOWED or REMOVED (for when WE recycle or lose a volume) admin
db_server
V3 1 week Db server needs to deal with and understand this Lauri,
Steve
    Better way to view log files of a large number of db servers       -> in db server work Steve,
Lauri
    Dump of infoserver statistics into individual web pages and line mode commands? admin ?   If we get time .... Lauri,
Sinisa
    Easier way of locating and viewing the main SAM log and info file - from the web and documentation of where the archived log files are - zipped or whatever.

Automated process to zip and unzip archived log files

admin V4 Few days   Lauri
    Make name server web page show actual alive servers - make sure all servers have base ping method in - cleanup web page periodically. admin V4   Can we clean up naming service itself too?  
  10.12.7 Enstore Statistics admin V3 2 weeks Maintenance when encp/sam/stdout change Sinisa Done
               
Sam_station servers
enhancements
10.8.1 Use Enstore header/text file of errors (and their retry profile) in eworker Station V3     Igor
    All servers to use new exceptions   V4      
    Think about how to implement a backup naming service and register all servers with both   V4      
  10.13 Sam_bootstrap - figure out how to designate different enstore system bootstrap V3     Lauri Done
               
Db server
enhancements
10.14.1 Reconnect to database if lose connection db_server V3 1 day   Steve
    Use new exceptions in all places db_server V4   Can only do this partially in V3 - until all station servers use new exceptions Steve
  10.14.2 Support multiple independent processes with independent conditions and a way to pass the connection params db_server V3.1 2 weeks Figure out cleanup of processes Steve
  10.14.3 Support different db servers for datasets/file queries and for other server transactions, and for online event catalog and make a proper framework for this - involves sam_bootstrap probably db_server V3.1 1 week In general, in conjunction with 10.14.2 display multiple log files for multiple servers Steve/Matt
  10.14.4 Support for `cookie' or equivalent for secure conections using users own name and pw db_server V3 1 week   Steve
  10.14.4 Add above support to db server gen db_server V3 1 week   Steve/Lauri
  10.14.5 Load testing with db servers db_server V3 1 day   Steve
               
Infrastructure 1.3 Python V2   V3.1   Figure out what it means Looks like we will move forward. What are the tasks and how long will they take? Maciej,Matt,

Steve,Carmenita

  1.3 Bbftp   V3 2 weeks Package servers and clients for linux, irix and osf1. Work with Lyon developers. Ensure that IP address and port are configurable.
Put in kits with test package.
Mike
  1.3 Fnorb - chase bug and get new version ?   V3     Carmenita
  1.3 New version of orbacus ?   V4   We are very behind!We need a cookbook for this - the kits product has only the libraries and executables in - no input files ? Carmenita,
Steve
  1.3 Install updated LSF on d0test and d0mino   V3   DONE by Dave Fagan   Still waiting for V4.2? licence on D0test
  19.6 Get extra GB Ethernets installed on d0test   V3   DONE  
  1.5.5 Write up something about fnidl - what? Why?   V4     Lauri
  1.8.5 Code reviews - - of all major servers and sam_user          
  1.5.6 Get $ID in all files; all all     all
  1.5.7 Remove redundant files left over from sam_doc, sam_talks, sam split and in all packages where possible all all     all
  1.9 Figure out how users are to use helpdesk tracking interface. Separate lists of bugs/minor enhancements ? V3   Operational issue + might need some work for usability. Matt,
Vicky,
Lee,
Lauri
               
Test Harness and testing 22.16 Continued testing - and logging of results -work down our list and extend list Test
_harness
V3   Lists of work for testing and for test harness Dehong,
Sinisa
  22.16 Test harness that simulates online/Farm/central_analysis + 2 or more user linux cluster all running with event input rate at > 40 Hz.

Most important the we can get deterministic behavior and that we have enough statistics and measurements to

a) Know that what we see is what we expect

b) Modify input params and observe changes

Test
_harness
V3 2 weeks We should have this able to run continuously by now - even if not at full rate Dehong,
Sinisa
  22.16 Stress testing of db server(s) Test
_harness
V3   Need to understand affects and limitations - in cases of out-of-control queries. -> db server Steve
    Track down files getting rm'd from pnfs space and if necessary change file protections etc. to track down Test
_harness
V3   Might require encp change? Gerry No evidence so far
               
Sam_manager   Track any changes in servers for exception handling manager V4     Sinisa,
Lauri
    Name expander for file merge + metadata for this case manager V4     Sinisa,
Lauri
  10.4.1 Any issues that arise as a result of rcp database and calibration databases in same executable manager V3     Sinisa
               
Documentation 1.9.1 XML definition of commands and parameters -> docs and web page.Gives sam commands quick look doc V3     Vicky
  1.9.2 Consolidated sam users guide doc V3     Vicky,
Matt,
Lauri
  1.9.3 Enhanced sam shifters documentation doc V3     Lee,
all
  1.9.4 Sam system reference guide doc        
  1.9.5 Sam operations and administration guide doc        
  1.9.6 Installing sam at a remote site - improve guide  doc V3   Largely done Lauri
  1.9.7 `Live' tutorial on web doc V4      
  1.9.8 Flesh out FAQ page doc V4      
  2.2 Update glossary of terms doc V4      
               
Web pages and
web servers
1.2.2 Go over all sam browsing pages, add tapes, other fields, refurbish data
_browsing
V3     Matt
  1.2.3 Re-organize for new documents, add registration, etc.  doc V3     Lauri,
Vicky
  1.2.4 Understand why wbs.py is not working in devel for the upteenth time!!! doc       Lauri Done
  1.2.6 Commission d0pilio   After V3      
  1.2.7 Web pages to view all active sam stations - using db and infoserver? admin
data
_browsing
V3 or V4      
  1.2.8 Productionweb server stats don't work   V3     Lauri
Steve
Diana
Done
Robustness, 24X7 3.27 Start thinking about how/if could use data warehouse/ other databases if DB is down as part of 24X7 and failover?   V4      
  3.27 Think about how to deal with backup naming service and register all servants with both.Which servers currently register themselves again with the name service if it goes down Many packages V4      
  3.27 Change station cache file protections so that files not normally visible to users station V3 or V4      
Operations and Support 21.x Ongoing user support and routine data maintenance tasks       all
  21.x` Helpdesk - bug tracking?          
               


You may edit this file only in something that preserves the "Simple HTML". Please view the source before and after you edit to check this. Use xemacs on Unix, Frontpage on NT or another simple editor please.

$Author: lauri $
$Date: 2001/03/07 15:25:57 $

Future Work on SAM Manager
==========================

In addition to getting SAM Manager to work with new fss, there are also
many small details that have to be finished and/or corrected. Here is the
list as I remember it right now:
  1. Release new versions of all IDL products sam_manager uses.
  2. Cleanup makefiles. Note that one will have to solve many small
     problems in order to get things to compile again: new IDL products
     now have a different directory structure than before, many IDL
     structs have moved to different files, some files are gone, etc.
  3. Fix sam_manager to use new fss in the same way as before, i.e.
     restore its ability to store files in a synchronous fashion.
  4. Add the ability to store files in an asynchronous fashion. Extra
     RCP parameter will be needed here.
  5. Add the ability to use environment variables SAM_STATION and
SAM_PROJECT
     instead of rcp parameters.
  6. Write output file metadata into a file.
  7. Try to use autodest instead of RCP parameter for getting a valid
     location for storing files. I do not know yet whether that can be done
at
     this time.
  8. Add ability to handle project restarts. At the moment we have that
     only for the command line interface.
  9. Update and cleanup documentation. There have been several new RCP
     parameters added, probably by Steve, which were never documented.
 10. Look into getting rcpID from framework. This will be used in the future

     for identifying consumers, along with application name and version.
 11. Get demo scripts working with Igor's example for submitting a job
     to the batch system.

Future Work on Optimizer
========================

The work on optimizer for SAM V3 involves removing random order for
authorizing files. This can however be accomplished only after we
insert location cookies for all files in the SAM database, as well
as making appropriate changes in the dbserver, fss, station master and
eworker.
One also needs to worry about not to breaking the existing code and
preserving the backwards compatibility. Here is the list of things I
think need to be done:
  1. Introduce new IDL struct which will contain location cookie in addition

     to the file tape location, as well as new dbserver method that will
     accept that struct.
  2. Get the new the dbserver method working.
  3. Make appropriate changes in the eworker code and parse encp output
     for location cookie.
  4. Make appropriate changes in the fss code and use new method for
     adding new location and location cookie to the file. Make sure
     that call to "pnfs xref" also retrieves location cookie upon file
     store resubmittion.
  5. Develop scripts which insert and/or verify location cookies in the
     database.
  6. Touch the station master code to use the new structs with location
     cookies and pass that to the optimizer.
  7. Finally, use location cookies instead of random numbers for sorting
     files that have to be authorized.