Last update: Thu 06/27/02 11:33
a_entry a_owner a_short_description_of_the_problem a_status package priority zdate
station issues, debug + test  Sinisa, Andrew, Chris  Fix and test SAM station bugs as recorded in release notes. Outstanding major issues (Sinisa May 21, 2002): 1) Problem with small projects overlapping with big ones started first. The small one won't make any progress until the files are delivered for the big one. 2) --route does not work with --constrain-delivery. 3) Unlocked files change group ownership upon station restart. This has been fixed temporarily by hardcoding dzero group as the default for orphans. 4) end of stream does not work if there is a file delivery error after consumer has been established 5) station should not retrieve locations for all files when a project starts, but do it on a need-to-know basis  ongoing       
debug cache algo  Sinisa  (one of station issues) There are indications that the station caching algorithm is not working as desired. Needs to be debugged and fixed if this is true.      03/19/02 
CRC xfer    Check CRC for file transfers. Especially needed for remote transfers, several corrupted files have been found.    sam_user,sam_station   
MC verify cron    Need cron to test files from remote sites for corruption. Especially needed for remote transfers, several corrupted files have been found. Use dump event. to read events. Get crc stuff from enstore.       
SAM dump formats    Review and update sam dump output formats and info.    sam_station   
Job submit to batch    as per sinisa and Chris notes in mail. Here is the proposal(from Sinisa May 8,2002): 1. user submits a job to station master 2. station updates the database, gets project id from the db server, marks it as "in the batch queue", submits user script to the batch system, but does not start the project. 3. batch system worries about scheduling the user job 4. Once the user job start, the first thing our API does is not establishing the consumer, but "establishing the project": this talks to the station master and asks it to start the project. 5. station actually starts the project master, marks it as "running" in the database, and lets user job know that it can proceed 6. user job continues as before. Requirement:batch queue management and restrictions to hold a single user to limited no of jobs    sam_station   
sam_batch_wrapper    Move batch adapter logic out of the station code into a new product, like sam_cp. Have people test what we have and see if it works. This is a low priority item unless needed at some sites.       
Site Optimizer  Sinisa  Develop site optimizer to govern file transfers and tape usage. Could become problem soon, will require time to develop.    sam_optimizer   
FCP  Chris  Using FCP to moderate intrastation file transfers. Chris has this working on clued0, but needs to be integrated into product.       
Reengineer cache management  Sinisa  Missing group and , Station revival, db server work    sam_db_server, sam_station   
x-fer Monitor  Sinisa, Diana, John  Work to upgrade the backend of the SC2001 info gathering scripts to load information into the new oracle tables using dcoracle.Maybe some changes to sam_admin tools for mining log files. May also want to break log files daily to avoid long processisng times to extract information. Need to have intra-station transfers included as well as extra-station. John needs to build the oracle tables, some design needed though some preliminary work done.      2-3   
Monitoring and Info service    Part of the decentalizing the station and having the information local. This is not writing a new information and monitoring server.       
SAM Admin and bootstrap CDF changes  Lauri, Sinisa  What we learned from this is input into the new config plan.  done       
New config Plan  Lauri, Sinisa  As per Lauri's document, or revision therof. Needs to have review setup with external committee, including CDF, D0, PAT, Mengel, ???.       
builds  Lauri  standardizing builds. Goes with new config plan. requires a lot of design which is the bulk of the job. Benefit is anyone can build any piece of sam easily.  evolutionary       
Dynamic Station Installation  Igor +  Ability of SAM to be deployed, setup and dismantled dynamically. Needs to install, and add configuration to database, and run. Needs all libraries, orbacus, etc. Igor will write a more detail plan.       
de-centralize station             
Lyon Interfaces  Lee  Need requirements for interfaces Lyon needs to use SAM.       
Dzero-sam "initiative"  Lee, Mark Sosebee, DCD, Dzero  For identified d0 sites, benchmark network performance, establish and test working sam stations, systematically move data to these sites and measure performance and bottlenecks. Working with Networking (DCD) group to get started with tools developed by IEPM project (SLAC). Send list of station nodes to Frank Nagy so we can monitor network activity to them.        02/11/02 
farm proxy  Sinisa  Need to explore and develop a proxy server (or other solution) to enable running sam on distributed systems on a private network, e.g. behind a switch or firewall. Explore IP tunneling and VPNs w/ networking. Part of support for site autonomy. May have solution with distributed naming service runing on gateway node. Sinisa is working on w/ Princeton and Nijmegen.      1-2  02/11/02 
pick non-raw events  Matt  Fix dimensions to enable picking of not-raw events. Need to fix parentage for files. Involves schema changes if we want to use denormalized approach.      02/05/02 
de-centralized name service  Sinisa  Additional infrastructure to de-centralize station operation. This means a station would be more self sufficient and could operate for some ammount of time without access to the outside world. Also, failover to db alternatives to fnal central database system.  done, testing  sam_nameserver  02/12/02 
station site autonomy  Sinisa, Andrew  Additional infrastructure to de-centralize station operation. This means a station would be more self sufficient and could operate for some ammount of time without access to the outside world. Also, failover to db alternatives to fnal central database system.    sam_station, Sam_db_server    02/12/02 
d0mino backend  Chris,Sinisa  bring up sam station to manage new d0mino backend compute servers. Run the station server software on d0mino, requires some changes to project since home areas for jobs in the queue will be on linux and for  almost done    03/12/02 
reduce log file  Sinisa  Need to open a new sam log file every day. Can use sam log class, or configure so this will work. Master logger is done, new file every day. Still need to do archive.    sam_log    03/12/02 
D0 support  Lauri + others    ongoing  NA  NA   
CDF support  Lauri + Sinisa + others    ongoing  NA  NA   
sam user  Lauri  take over sam_user with Carmenita. Cleanup commands, error messages, adding new commands. Need to understand why so slow on d0mino  ongoing  sam_user  NA   
sam user speedup  Lauri  Need to understand why so slow on d0mino    sam_user  NA   
archive logs  Lee,Lauri,John  archiving of log files (waiting on sam on sun). Need stager and encp only, could have station running on other node. Sinisa will try to build stager on SUN, Lee will get encp for sun (seems to be available). Lauri will do final set up. Try on Ora3. Use central analysis station with only stager running on ora1 and ora3.Just need to come up with the metadata and tier for this.       
sam-at-a-glance  Lauri, Lee  Improve for sam-at-a-glance so it runs on ora 1 and provides more up to date information. May require sam user to run on sun OS, or convert to use the name service status info (just ping the stations instead of sam dump). Need to add additional information to database, 1. known down, and also 2 monitoring level: high, medium, and low availability systems (see Lauri's mail describing this in detail).Lauri suggests turning this into just doing requested dumps.       
unit tests  Lauri,Chris  Produce unit tests for sam user interface. Tied to the sam parser task.    sam_user   
clued0  Chris + Sinisa  Continue testing of distributed sam on Clued0. Include implemimenting batch system and load testing with additional desktop node included.  end in sight     
file-status  Lauri, Steve, Diana, Matt  Add crummy file status and needed support features. Could use more enduring name, like unofficial or suspect. Matt's second priority. Needs response to Matt's mail from 11/13/01. Held brain storming session Thurs Jan 17, Diana wrote notes. Storing of 'crummy' half finished files - proposal on how to use status of file. Investigation of what code would need to change in sam store (or whether it is just a little samadmin command you are allowed to do right after the store has succeeeded). Investigate how to deal with --resubmit which wants to overwrite a crummy file - needs to call another samadmin command to first delete the file in pnfs space. Additional thought and discussion indicates that the way we use the current file status is incorrect, and some current statuses should be moved to file@location status. Additional statuses discussed at d0 include :incomplete, obsolete, superseded, user-added, unofficial. May be more or others  in design     
interum status  Diana, Lauri  Add new column and needed changes to use for file status.         
app_family + param type/name/value  Lee, Heidi, Milanson  Link application name/version with MC param type/name/value to provied way to record generalized processing attributes. Need to know the name, and possibly attributes, of the top level RCP.  needs design  sam_user,sam_db_server, sam_db   
Documentation    Look through documentation and fix problems. Need sam quick reference page, to replace the quick start guide that is obsolete. sam get metadata,list definition --keywords, sam create dataset --keyword???, sam run project, sam submit may have problems, mc runjob new metadata, auto dest "sam store --descrip=...", add new phase needs to be documented. need to document metadata for luminosity and archive files, sam batch commands, psusp, files not delivered. python api, new dimensions and examples. Translation of status block . sam toonl should be documented. Sam station starting options through sam_bootstrap startup. new flags need to be documented. Questions about groups need to be answered in docmentation. Hope CDF can help.    sam_doc   
diagnost page  Lauri  Break up the diagnostics page so it is consistent with dev,int,prd scheme. This makes it easier to maintain the db server and create new installations.  done  sam_web   
omniORB.py  Steve, Sinisa  continue to understand issues of omniORB.py use with sam . Steve provide detailed list of work to be done. Steve will produce list for discussion 12/04/2001. Steve has made some progress and can describe where he feels the problems are 1/28/2002. First step, adding new dbserver gen and dbserverbase. Changes ot Sam bootstrap, will change the way we make db server.  start 6/10/02       
Backend DB situation  steve  (ask steve????)       
autodest  (Carmenita), Heidi, Lee  autodestination with processed files needs to be resolved bug in the server in constructing the path, pulling info from the parent that it should not. Load mapfile is very slow.  done, needs test on farms,  sam_user,sam_db_server  v3.2   
autodest speed  Lauri  Takes too long to load autodest map. Very painful to debug problems in new map entries. Need to fix so it only verifies the new entries.  done  sam_admin   
get num copies  Carmenita  get the number of copies for each file from the sam database need to decide where this is kept in sam.  Need ping from online  sam_user   
file_family  Carmenita  Add code to sam autodest so that the proposed path string uses an optional entry for "file_family=..." appended to the stream field. This has been requested by Gerry for the online direction of files to tape. Still some debate, but will provide flexibility for streaming decisions to be made later.  need ping from online  sam_user   
samadmin  Lauri, Diana  mark entire station as down, also might want node down, station down, fss down.  not critical  sam_admin   
Task list formatter    complete tasklist formatting script    sam_shift_tools   
sam manager  Sinisa  possible sam_manager work that may be needed. Pingable client. Check restart option works with --CPID on command line.Also desire to reuse Gabriele's api for ROOT. Gabriele might be able to do this.  eventually, not high priority  sam_manager   
Restart    Need to be able to recover projects after station crash. 1. application must be restartable, 2. batch system must coordinate with projects, 3. projects are restarted. Restart project known to be broken. User needs too close output file at last file boundary so work is not lost.       
d0mino-sam  fagan, lee  Add ability for remotely-initiated transfers to use d0mino-sam dedicated interface on d0mino. Do not believe this involves any mods to bbftp.Fagan does not know how to solve this yet. If it cannot be solved, then will need to set up an additional sam server dedicated to serving files to remote sites. Need full routing to take full advantage of this, especially to get files in the d0mino cache.       
data routing,  Sinisa, Andrew  Igor calls "global data replica work". Need design for ultimate file routing. May include incorporation of FSS into station server which brings other important features like fss cache management and persistency. Refer to Igor's email concerning the topic. Igor sent mail on Mon, 07 Jan 2002 16:46.    sam_station   
db upkeep  diana  continue upkeep and monitoring of d0 db instances         
Helpdesk Followup  Lauri,Carmenita  Need to follow up HD tickets assigned to sam and resolve and closeout      ongoing   
TH upgrades  Chris  Need documentation!!! Improve test Harness to reflect behaviour more consistent with central-analysis. For example, need simulated users to kill their jobs in the middle, and need many 10's of thousands of small files cached and reused many times. This will test the station revival more completely.  ned doc     
Pick events design  Lee, w/ D0 help  design for pick events using existing sam tools, and additional features for pooling requests, caching events, and cataloging.       
proj report    Need to provide physicists a comprehensive report of files delivered, and not delivered.      1-2   
sam_start_bbftp problem    Evidently on at least some Linux machines (nglas09 being one of them), the output of "ps -fu sam" gets truncated when you pipe it through "grep". This causes the sam_start_bbftp.sh/sam_stop_bbftp.sh scripts to "break" (send mail saying that the daemon is not running, when actually it is; and then trying to restart something that is already running).        06/21/02 
~sam in sam_bootstrap under sh.    I was in the process of restarting one of the db servers on fndaut1 and ran into a little problem that is sh/bash related. I typed this: bash-2.03$ run.sh start dbserver dbs_dev v4_0_3_3& and got this: bash-2.03$ /cdf/ups/prd/sam_bootstrap/v4_1_33/NULL/bin/run.sh: ~sam/private/dbserver__fndaut1__dbs_dev/pid: cannot create /cdf/ups/prd/sam_bootstrap/v4_1_33/NULL/bin/run.sh: ~sam/private/dbserver__fndaut1__dbs_dev/trace: cannot create It created a subdirectory '~sam' under private dir: bash-2.03$ ls -la total 84 drwxr-xr-x 28 sam g023 1024 Jun 4 10:32 . drwxrwxr-x 12 sam g023 512 May 31 10:30 .. drwxr-xr-x 3 sam g023 512 Jun 4 10:32 ~sam drwxr-xr-x 2 sam g023 512 Jun 4 09:07 conf There is some problem on how sh and bash handle the ~sam thing...I didn't look more into this, just used the bash version to start up the servers again.Thanks,Luciano        06/24/02 
    The sam_kerberos_rcp script needs to be updated so that it doesn't try to kinit if it doesn't need to (esp. for linux, according to DjFagan).        06/24/02 
no kinit unless needed    The sam_kerberos_rcp script needs to be updated so that it doesn't try to kinit if it doesn't need to (esp. for linux, according to DjFagan). -- lauri -------- Original Message -------- From: Laurelin of Middle Earth Subject: Re: [Fwd: Re: Kinit's failing on d0cs's (fwd)] To: Chris Jozwiak CC: fagan@fnal.gov Dave, I can modify the way sam_kerberos_rcp does the kinit similar to what you suggest below -- BUT, please tell me: what is the equivalent way of doing this when the script right now contains: unset KRB5CCNAME cmd="kinit -k -t $keytab_path $principal" eval $cmd || { echo "Cannot get kerberos principal"; exit 1; } In other words, since this forcefully unsets KRB5CCNAME, and we're using "kinit -k -t", how do I check the klist stuff? -- lauri Chris Jozwiak wrote: > > -----Forwarded Message----- > > From: David J. Fagan > To: Chris Jozwiak > Cc: Sinisa Veseli , fagan@large.fnal.gov > Subject: Re: Kinit's failing on d0cs's (fwd) > Date: 22 May 2002 14:36:25 -0500 > > So I guess this just means, Welcome to the world of Linux... > We will certainly look but this isn't going to be easy. > > How many did succeed vs the 138 that failed? > > Also I know this is a kludge BUT could you try a > > /usr/krb5/bin/klist -s $KRB5CCNAME > if [ $? -eq 1] > then > sleep ? > /usr/krb5/bin/kinit ...... > then > > In a 1 to 5 loop ? > > ------- Forwarded Message > > Sender: crawdad@gungnir.fnal.gov > To: "David J. Fagan" > Cc: nightwatch@fnal.gov, Chris Jozwiak , > Sinisa Veseli > Message-id: <200205221821.g4MIL4Q11533@gungnir.fnal.gov> > > > STDERR: kinit: Internal file credentials cache error when initializing cache > > I'm going to guess this was on Linux. I've seen Linux run out of > some system resource temporarily - like POSIX lock descriptors > perhaps - and give an error at this point in the Kerberos library. > Look for something configured too small and make it bigger. (Kind of > like VMS all over again.) > > I've also seen non-FRHL Linuces with kernels configured without > POSIX locks at all, which gets you "No locks available when > initializing cache." >        06/24/02 
cleanup sam_user exceptions    On Friday morning I started a sam job (cron) to clean the d0mino station cache. It is still running. (I killed a second one that started 10:15am today, sam-auto got an email; it didn't look too healthy, lots of exceptions, especially (unfortunately) in the error handling stuff). [Aside -- Lee, please add to the "to-do" list: we need to go through sam_user with an eye to exception handling, there are some unhealthy mixtures of old and new exception handling, and I can't tell which is which in all cases... ] Who takes care of the sam status_man impl stuff, and how is it supposed to work?] I do see signs of files being sent to the fss (see central-analysis dump from saag page); I'm not sure that I see a lot of enstore activity though (but I'm not on shift, really, so I haven't checked into it *that* much ...)    sam_user    06/24/02 
    Work on sam_manager and sam-root interface should consist of: 1. separate code that is independent of any experiment/application into a common base library and create a new package (e.g. sam_client_lib) in cdcvs. 2. build a new D0 specific clients that have the same functionality which is now provided by sam_manager and sam-root interface 3. build new CDF specific clients        06/26/02