| a_entry |
a_owner |
a_short_description_of_the_problem |
a_status |
package |
priority |
zdate |
| station issues, debug + test  |
Sinisa, Andrew, Chris  |
Fix and test SAM station bugs as recorded in release notes. Outstanding major issues (Sinisa May 21, 2002):
1) Problem with small projects overlapping with big ones started first.
The small one won't make any progress until the files are delivered for
the
big one.
2) --route does not work with --constrain-delivery.
3) Unlocked files change group ownership upon station restart. This has
been fixed temporarily by hardcoding dzero group as the default for
orphans.
4) end of stream does not work if there is a file delivery error after
consumer has been established
5) station should not retrieve locations for all files when
a project starts, but do it on a need-to-know basis  |
ongoing  |
  |
  |
  |
| debug cache algo  |
Sinisa  |
(one of station issues) There are indications that the station caching algorithm
is not working as desired. Needs to be debugged and fixed if this is true.  |
  |
  |
2  |
03/19/02  |
| CRC xfer  |
  |
Check CRC for file transfers. Especially needed for remote transfers, several corrupted files have been found.  |
  |
sam_user,sam_station  |
3  |
  |
| MC verify cron  |
  |
Need cron to test files from remote sites for corruption. Especially needed for remote transfers, several corrupted files have been found. Use dump event. to read events. Get crc stuff from enstore.  |
  |
  |
1  |
  |
| SAM dump formats  |
  |
Review and update sam dump output formats and info.  |
  |
sam_station  |
3  |
  |
| Job submit to batch  |
  |
as per sinisa and Chris notes in mail.
Here is the proposal(from Sinisa May 8,2002):
1. user submits a job to station master
2. station updates the database, gets project id from the db server,
marks it as "in the batch queue", submits user script to the
batch system, but does not start the project.
3. batch system worries about scheduling the user job
4. Once the user job start, the first thing our API does
is not establishing the consumer, but "establishing the
project": this talks to the station master and asks it to start
the project.
5. station actually starts the project master, marks it as "running"
in the database, and lets user job know that it can proceed
6. user job continues as before. Requirement:batch queue management and restrictions to hold a single user to limited no of jobs  |
  |
sam_station  |
4  |
  |
| sam_batch_wrapper  |
  |
Move batch adapter logic out of the station code into a
new product, like sam_cp. Have people test what we have and see if it works. This is a low priority item unless needed at some sites.  |
4  |
  |
  |
  |
| Site Optimizer  |
Sinisa  |
Develop site optimizer to govern file transfers and tape usage. Could become problem soon, will require time to develop.  |
  |
sam_optimizer  |
2  |
  |
| FCP  |
Chris  |
Using FCP to moderate intrastation file transfers. Chris has this working on clued0, but needs to be integrated into product.  |
  |
  |
2  |
  |
| Reengineer cache management  |
Sinisa  |
Missing group and , Station revival, db server work  |
  |
sam_db_server, sam_station  |
5  |
  |
| x-fer Monitor  |
Sinisa, Diana, John  |
Work to upgrade the backend of the SC2001 info gathering scripts to load information into the new oracle tables using dcoracle.Maybe some changes to sam_admin tools for mining log files. May also want to break log files daily to avoid long processisng times to extract information. Need to have intra-station transfers included as well as extra-station. John needs to build the oracle tables, some design needed though some preliminary work done.  |
  |
  |
2-3  |
  |
| Monitoring and Info service  |
  |
Part of the decentalizing the station and having the information local. This is not writing a new information and monitoring server.  |
  |
  |
5  |
  |
| SAM Admin and bootstrap CDF changes  |
Lauri, Sinisa  |
What we learned from this is input into the new config plan.  |
done  |
  |
  |
  |
| New config Plan  |
Lauri, Sinisa  |
As per Lauri's document, or revision therof. Needs to have review setup with external committee, including CDF, D0, PAT, Mengel, ???.  |
  |
  |
3  |
  |
| builds  |
Lauri  |
standardizing builds. Goes with new config plan.
requires a lot of design which is the bulk of the job. Benefit is anyone can build any piece of sam easily.  |
evolutionary  |
  |
  |
  |
| Dynamic Station Installation  |
Igor +  |
Ability of SAM to be deployed, setup and dismantled dynamically. Needs to install, and add configuration to database, and run. Needs all libraries, orbacus, etc. Igor will write a more detail plan.  |
  |
  |
2  |
  |
| de-centralize station  |
  |
  |
  |
  |
  |
  |
| Lyon Interfaces  |
Lee  |
Need requirements for interfaces Lyon
needs to use SAM.  |
  |
  |
1  |
  |
| Dzero-sam "initiative"  |
Lee, Mark Sosebee, DCD, Dzero  |
For identified d0 sites, benchmark network performance, establish and test working sam stations, systematically move data to these sites and measure performance and bottlenecks. Working with Networking (DCD) group to get started with tools developed by IEPM project (SLAC). Send list of station nodes to Frank Nagy so we can monitor network activity to them.  |
  |
  |
  |
02/11/02  |
| farm proxy  |
Sinisa  |
Need to explore and develop a proxy server (or other solution) to enable running sam on distributed systems on a private network, e.g. behind a switch or firewall. Explore IP tunneling and VPNs w/ networking. Part of support for site autonomy. May have solution with distributed naming service runing on gateway node. Sinisa is working on w/ Princeton and Nijmegen.  |
  |
  |
1-2  |
02/11/02  |
| pick non-raw events  |
Matt  |
Fix dimensions to enable picking of not-raw events. Need to fix parentage for files. Involves schema changes if we want to use denormalized approach.  |
  |
  |
2  |
02/05/02  |
| de-centralized name service  |
Sinisa  |
Additional infrastructure to de-centralize station operation. This means a station would be more self sufficient and could operate for some ammount of time without access to the outside world. Also, failover to db alternatives to fnal central database system.  |
done, testing  |
sam_nameserver  |
1  |
02/12/02  |
| station site autonomy  |
Sinisa, Andrew  |
Additional infrastructure to de-centralize station operation. This means a station would be more self sufficient and could operate for some ammount of time without access to the outside world. Also, failover to db alternatives to fnal central database system.  |
  |
sam_station, Sam_db_server  |
  |
02/12/02  |
| d0mino backend  |
Chris,Sinisa  |
bring up sam station to manage new d0mino backend compute servers. Run the station server software on d0mino, requires some changes
to project since home areas for jobs in the queue will be on linux and for  |
almost done  |
  |
1  |
03/12/02  |
| reduce log file  |
Sinisa  |
Need to open a new sam log file every day. Can use sam log class, or configure so this will work. Master logger is done, new file every day. Still need to do archive.  |
  |
sam_log  |
  |
03/12/02  |
| D0 support  |
Lauri + others  |
  |
ongoing  |
NA  |
NA  |
  |
| CDF support  |
Lauri + Sinisa + others  |
  |
ongoing  |
NA  |
NA  |
  |
| sam user  |
Lauri  |
take over sam_user with Carmenita. Cleanup commands, error messages, adding new commands. Need to understand why so slow on d0mino  |
ongoing  |
sam_user  |
NA  |
  |
| sam user speedup  |
Lauri  |
Need to understand why so slow on d0mino  |
  |
sam_user  |
NA  |
  |
| archive logs  |
Lee,Lauri,John  |
archiving of log files (waiting on sam on sun). Need stager and encp only, could have station running on other node. Sinisa will try to build stager on SUN, Lee will get encp for sun (seems to be available). Lauri will do final set up. Try on Ora3. Use central analysis station with only stager running on ora1 and ora3.Just need to come up with the metadata and tier for this.  |
  |
  |
4  |
  |
| sam-at-a-glance  |
Lauri, Lee  |
Improve for sam-at-a-glance so it runs on ora 1 and provides
more up to date information. May require sam user to run on sun OS, or convert
to use the name service status info (just ping the stations instead of sam dump). Need to add additional information to database, 1. known down, and also 2 monitoring level: high, medium, and low availability systems (see Lauri's mail describing this in detail).Lauri suggests turning this into just doing requested dumps.  |
  |
  |
5  |
  |
| unit tests  |
Lauri,Chris  |
Produce unit tests for sam user interface. Tied to the sam parser task.  |
  |
sam_user  |
4  |
  |
| clued0  |
Chris + Sinisa  |
Continue testing of distributed sam on Clued0. Include
implemimenting batch system and load testing with additional desktop node
included.  |
end in sight  |
  |
1  |
  |
| file-status  |
Lauri, Steve, Diana, Matt  |
Add crummy file status and needed support features. Could use more enduring name, like unofficial or suspect. Matt's second priority.
Needs response to Matt's mail from 11/13/01. Held brain storming session
Thurs Jan 17, Diana wrote notes. Storing of 'crummy' half finished files - proposal on how to use status of file. Investigation of what code would need to change in sam store (or whether it is just a little samadmin command you are allowed to do right after the store has succeeeded). Investigate how to deal with --resubmit
which wants to overwrite a crummy file - needs to call another samadmin
command to first delete the file in pnfs space. Additional thought and
discussion indicates that the way we use the current file status is
incorrect, and some current statuses should be moved to file@location status.
Additional statuses discussed at d0 include :incomplete, obsolete, superseded,
user-added, unofficial. May be more or others  |
in design  |
  |
4  |
  |
| interum status  |
Diana, Lauri  |
Add new column and needed changes to use for file status.  |
  |
  |
  |
  |
| app_family + param type/name/value  |
Lee, Heidi, Milanson  |
Link application name/version with MC param type/name/value to provied way to record generalized processing attributes. Need to know the name, and possibly attributes, of the top level RCP.  |
needs design  |
sam_user,sam_db_server, sam_db  |
1  |
  |
| Documentation  |
  |
Look through documentation and fix problems. Need sam quick reference page, to replace the quick start guide that is obsolete.
sam get metadata,list definition --keywords, sam create dataset --keyword???, sam run project, sam submit may have problems, mc runjob new metadata, auto dest "sam store --descrip=...", add new phase needs to be documented.
need to document metadata for luminosity and archive files, sam batch
commands, psusp, files not delivered. python api, new dimensions and
examples. Translation of status block . sam toonl should be documented.
Sam station starting options through sam_bootstrap startup. new flags need to be documented. Questions about groups need to be answered in docmentation. Hope CDF can help.  |
  |
sam_doc  |
4  |
  |
| diagnost page  |
Lauri  |
Break up the diagnostics page so it is consistent with dev,int,prd scheme. This makes it easier to maintain the db server and create new installations.  |
done  |
sam_web  |
1  |
  |
| omniORB.py  |
Steve, Sinisa  |
continue to understand issues of omniORB.py use with sam .
Steve provide detailed list of work to be done. Steve will produce list for
discussion 12/04/2001. Steve has made some progress and can describe where he feels the problems are 1/28/2002.
First step, adding new dbserver gen and dbserverbase. Changes ot Sam bootstrap, will change the way we make db server.  |
start 6/10/02  |
  |
  |
  |
| Backend DB situation  |
steve  |
(ask steve????)  |
  |
  |
1  |
  |
| autodest  |
(Carmenita), Heidi, Lee  |
autodestination with processed files needs to be resolved
bug in the server in constructing the path, pulling info from the
parent that it should not. Load mapfile is very slow.  |
done, needs test on farms,  |
sam_user,sam_db_server  |
v3.2  |
  |
| autodest speed  |
Lauri  |
Takes too long to load autodest map. Very painful to debug problems in new map entries. Need to fix so it only verifies the new entries.  |
done  |
sam_admin  |
1  |
  |
| get num copies  |
Carmenita  |
get the number of copies for each file from the sam database
need to decide where this is kept in sam.  |
Need ping from online  |
sam_user  |
?  |
  |
| file_family  |
Carmenita  |
Add code to sam autodest so that the proposed path string
uses an optional entry for "file_family=..." appended to the stream field.
This has been requested by Gerry for the online direction of files to tape.
Still some debate, but will provide flexibility for streaming decisions
to be made later.  |
need ping from online  |
sam_user  |
?  |
  |
| samadmin  |
Lauri, Diana  |
mark entire station as down, also might want node down, station down, fss down.  |
not critical  |
sam_admin  |
4  |
  |
| Task list formatter  |
  |
complete tasklist formatting script  |
  |
sam_shift_tools  |
4  |
  |
| sam manager  |
Sinisa  |
possible sam_manager work that may be needed. Pingable
client. Check restart option works with --CPID on command line.Also desire to reuse Gabriele's api for ROOT. Gabriele might be able to do this.  |
eventually, not high priority  |
sam_manager  |
5  |
  |
| Restart  |
  |
Need to be able to recover projects after station crash.
1. application must be restartable, 2. batch system must coordinate with projects, 3. projects are restarted. Restart project known to be broken. User needs too close output file at last file boundary so work is not lost.  |
  |
  |
4  |
  |
| d0mino-sam  |
fagan, lee  |
Add ability for remotely-initiated transfers to use d0mino-sam dedicated interface on d0mino. Do not believe this involves any mods to bbftp.Fagan does not know how to solve this yet. If it cannot be solved, then will need to set up an additional sam server dedicated to serving files to remote sites. Need full routing to take full advantage of this, especially to get files in the d0mino cache.  |
  |
  |
1  |
  |
| data routing,  |
Sinisa, Andrew  |
Igor calls "global data replica work". Need design for ultimate file routing. May include incorporation of FSS into station server which brings other important features like fss cache management and persistency. Refer to Igor's email concerning the topic. Igor sent mail on Mon, 07 Jan 2002 16:46.  |
  |
sam_station  |
4  |
  |
| db upkeep  |
diana  |
continue upkeep and monitoring of d0 db instances  |
  |
  |
  |
  |
| Helpdesk Followup  |
Lauri,Carmenita  |
Need to follow up HD tickets assigned to sam and resolve and closeout  |
  |
  |
ongoing  |
  |
| TH upgrades  |
Chris  |
Need documentation!!! Improve test Harness to reflect behaviour more consistent with central-analysis. For example, need simulated users to kill their jobs in the middle, and need many 10's of thousands of small files cached and reused many times. This will test the station revival more completely.  |
ned doc  |
  |
3  |
  |
| Pick events design  |
Lee, w/ D0 help  |
design for pick events using existing sam tools, and additional
features for pooling requests, caching events, and cataloging.  |
  |
  |
3  |
  |
| proj report  |
  |
Need to provide physicists a comprehensive report of files delivered, and not delivered.  |
  |
  |
1-2  |
  |
| sam_start_bbftp problem  |
  |
Evidently on at least some Linux machines (nglas09
being one of them), the output of "ps -fu sam" gets truncated
when you pipe it through "grep".
This causes the sam_start_bbftp.sh/sam_stop_bbftp.sh scripts
to "break" (send mail saying that the daemon is not running,
when actually it is; and then trying to restart something
that is already running).  |
  |
  |
  |
06/21/02  |
| ~sam in sam_bootstrap under sh.  |
  |
I was in the process of restarting one of the db servers on fndaut1 and
ran into a little problem that is sh/bash related. I typed this:
bash-2.03$ run.sh start dbserver dbs_dev v4_0_3_3&
and got this:
bash-2.03$ /cdf/ups/prd/sam_bootstrap/v4_1_33/NULL/bin/run.sh:
~sam/private/dbserver__fndaut1__dbs_dev/pid: cannot create
/cdf/ups/prd/sam_bootstrap/v4_1_33/NULL/bin/run.sh:
~sam/private/dbserver__fndaut1__dbs_dev/trace: cannot create
It created a subdirectory '~sam' under private dir:
bash-2.03$ ls -la
total 84
drwxr-xr-x 28 sam g023 1024 Jun 4 10:32 .
drwxrwxr-x 12 sam g023 512 May 31 10:30 ..
drwxr-xr-x 3 sam g023 512 Jun 4 10:32 ~sam
drwxr-xr-x 2 sam g023 512 Jun 4 09:07 conf
There is some problem on how sh and bash handle the ~sam thing...I didn't
look more into this, just used the bash version to start up the servers
again.Thanks,Luciano  |
  |
  |
  |
06/24/02  |
|   |
  |
The sam_kerberos_rcp script needs to be updated so that
it doesn't try to kinit if it doesn't need to (esp.
for linux, according to DjFagan).  |
  |
  |
  |
06/24/02  |
| no kinit unless needed  |
  |
The sam_kerberos_rcp script needs to be updated so that
it doesn't try to kinit if it doesn't need to (esp.
for linux, according to DjFagan).
-- lauri
-------- Original Message --------
From: Laurelin of Middle Earth
Subject: Re: [Fwd: Re: Kinit's failing on d0cs's (fwd)]
To: Chris Jozwiak
CC: fagan@fnal.gov
Dave,
I can modify the way sam_kerberos_rcp does the kinit
similar to what you suggest below -- BUT, please tell me:
what is the equivalent way of doing this when the
script right now contains:
unset KRB5CCNAME
cmd="kinit -k -t $keytab_path $principal"
eval $cmd || { echo "Cannot get kerberos principal"; exit 1; }
In other words, since this forcefully unsets KRB5CCNAME,
and we're using "kinit -k -t", how do I check the klist
stuff?
-- lauri
Chris Jozwiak wrote:
>
> -----Forwarded Message-----
>
> From: David J. Fagan
> To: Chris Jozwiak
> Cc: Sinisa Veseli , fagan@large.fnal.gov
> Subject: Re: Kinit's failing on d0cs's (fwd)
> Date: 22 May 2002 14:36:25 -0500
>
> So I guess this just means, Welcome to the world of Linux...
> We will certainly look but this isn't going to be easy.
>
> How many did succeed vs the 138 that failed?
>
> Also I know this is a kludge BUT could you try a
>
> /usr/krb5/bin/klist -s $KRB5CCNAME
> if [ $? -eq 1]
> then
> sleep ?
> /usr/krb5/bin/kinit ......
> then
>
> In a 1 to 5 loop ?
>
> ------- Forwarded Message
>
> Sender: crawdad@gungnir.fnal.gov
> To: "David J. Fagan"
> Cc: nightwatch@fnal.gov, Chris Jozwiak ,
> Sinisa Veseli
> Message-id: <200205221821.g4MIL4Q11533@gungnir.fnal.gov>
>
> > STDERR: kinit: Internal file credentials cache error when
initializing cache
>
> I'm going to guess this was on Linux. I've seen Linux run out of
> some system resource temporarily - like POSIX lock descriptors
> perhaps - and give an error at this point in the Kerberos library.
> Look for something configured too small and make it bigger. (Kind of
> like VMS all over again.)
>
> I've also seen non-FRHL Linuces with kernels configured without
> POSIX locks at all, which gets you "No locks available when
> initializing cache."
>  |
  |
  |
  |
06/24/02  |
| cleanup sam_user exceptions  |
  |
On Friday morning I started a sam job (cron) to clean the d0mino
station cache.
It is still running. (I killed a second one that started 10:15am
today, sam-auto got an email; it didn't look too healthy, lots of
exceptions, especially (unfortunately) in the error handling stuff).
[Aside -- Lee, please add to the "to-do" list: we need to go
through sam_user with an eye to exception handling, there are
some unhealthy mixtures of old and new exception handling, and
I can't tell which is which in all cases... ] Who takes care
of the sam status_man impl stuff, and how is it supposed to
work?]
I do see signs of files being sent to the fss (see central-analysis
dump from saag page); I'm not sure that I see a lot of enstore
activity though (but I'm not on shift, really, so I haven't checked
into it *that* much ...)  |
  |
sam_user  |
  |
06/24/02  |
|   |
  |
Work on sam_manager and sam-root interface
should consist of:
1. separate code that is independent of any experiment/application into
a common base library and create a new package (e.g. sam_client_lib)
in cdcvs.
2. build a new D0 specific clients that have the same functionality
which is now provided by sam_manager and sam-root interface
3. build new CDF specific clients  |
  |
  |
  |
06/26/02  |
|   |
  |
  |
  |
  |
  |
  |
|   |
  |
  |
  |
  |
  |
  |
|   |
  |
  |
  |
  |
  |
  |
|   |
  |
  |
  |
  |
  |
  |
|   |
  |
  |
  |
  |
  |
  |