|
SAM Triage Page |
|
Notes about this page:
Check your status from the SAM registration page or the SAM data browsing page (on the appropriate web server -- d0db.fnal.gov or d0db-dev.fnal.gov).
Check your status from the SAM registration page or the SAM data browsing page.
Query the SAM data browsing page for valid dataset definitions and versions.
Query the SAM data browsing page for valid application families, names, and versions.
Check that the destination for the store is a valid /pnfs directory. Check on the protection settings for that directory and the tape file family associated with it. Soon this type of error should go away since all destinations will be mapped to /pnfs space by SAM, based on the file meta-data provided.
Look at instructions for Using SAM in the D0 Framework.
Look at instructions for Using SAM in the D0 Framework.
Look at instructions for Using SAM in the D0 Framework.
On d0mino (or another 'stable' SAM station) execute:
- $ setup [-qdev] sam
- $ sam locate foo
where the option in square brackets is needed only for a development version. This command tests the NameServer, the DbServer, and the database itself. You should see a message saying something similar to "No such file: foo", which indicates that the NameServer is running, the DbServer is running, and the database is up.
Use the SAM diagnostics page to inspect the Naming Service. If there is no Naming Service operational, then nothing else will work. Restart it, following the directions on the SAM diagnostics page. All other servers should re-register themselves within a few minutes. There seems to be frequent trouble with the database server re-registering so you should probably also restart the database server manually --- see the instructions on the SAM shift guide and diagnostics pages.
Use the SAM data browsing page to inspect the database in question. Issue a simple query for anything. If the database is down the error message will be clear. Send mail toods-dba , and/or thehelpdesk .
Use SAM At A Glance to inspect the DbServer in question. Scroll down to the "SAMDServers" section and look for the name of the server. There should be a green "dot" in front of the name and the entry should look like this:
Server Host:Port Version Up Since SAMDbServer.cdfsamstore_prd:SAMDbServer (Using: samdbs@cdfofprd) fcdfdata064.fnal.gov:33641 v7_5_0 11 Apr 2006 21:34:07
If there is no "SAMDbServer", follow the instructions for sam_bootstrap to restart it (and to determine why it failed).
View a history of work done by the File Storage Server using the command "sam dump fss" (which seems to take a while -- you may want to dump to a file and then mess with the file with grep, tail etc). In this dump output you can find errors returned by Enstore - look for "ERR" and "error".
Example, on d0mino:
- $ setup [-qdev] sam
- $ echo $SAM_STATION (make sure it is set to the SAM station in question)
- $ sam dump fss
where the option in square brackets is needed only for a development version. Good output (i.e., output that indicates a file storage server IS running) will look similar to:
- d0mino:> export SAM_STATION="central-analysis"
- d0mino:> sam dump fss
- Next Generation FSS at station central-analysis running on d0mino.fnal.gov 8 days 21 hours 4 minutes 27 seconds
- Configuration for operation retrial (count, interval/timeout)
- DBS contact: 3, 1 hours
- Opter contact: 1, 1 hours
- Authorization receipt:1, 1 hours
- Stager contact: 1, 1 hours
- Transfer (retrials upon timeout and upon failure): 3, 6 hours
- Relay (multi-stage routing only): 3, 1 hours
- File Storage Server Dump:
- Stagers are known at nodes: d0mino.fnal.gov
- No requests ever submitted
If there is NO file storage server, you will see something similar to:
- d0mino:> export SAM_STATION="bogus-sam-station"
- d0mino:> sam dump fss
- FSS at station "bogus-sam-station" is not registered with the Naming Service
- Check your station name ($SAM_STATION), then contact d0sam-admin@fnal.gov
- Fnorb.cos.naming.CosNaming.NotFound exception: {'why': missing_node, 'rest_of_name': [<Fnorb.cos.naming.CosNaming.NameComponent instance at 10538e88>]}
Things to check:
- Is the station name valid?
Check the SAM data browsing web page for the environment in question (prd, dev or int). If the station name is not valid, you will need to add it to the database.- Is the file storage server registered with the naming service?
From SAM At A Glance, click on the name of the station in question. On the resulting page, you should see that an NGFSS:Sewer is registered. If there is no file storage server running, you will need to figure out why. Go to more detailed investigations of SAM store troubles.
If a station is generating this error message, it is in bad shape and needs to be restarted. A restart will usually clear this kind of error, but only try restarting it once. If the restart fails to clear the error message, expert action is needed as additional restarts will not help.
When you use the "sam submit" command (or the "sam run project" command, which calls "sam submit" for you), a project is started on your behalf. Once the project has been started, your executable to process the project files is placed in a SUSPENDED state (PSUSP) in the sam_lo or sam_hi batch queue, until there are actual files in the cache for your job to process. Your batch job is resumed when [at least some of] the files are available.If you start a job where the project definition includes NO files that are already in the cache, then your job will sit in the PSUSP state until some of the files can be staged. Conversely, if you start a project where some of the files are already in the cache, it should start to run pretty quickly (subject to queue parameters such as maximum number of running jobs, etc.).
If you are in the PSUSP state because the files weren't already in the cache, and then the files cannot be delivered for some reason (tape problems, corrupt files, etc.), your job will ultimately fail. (The error messages seen are not very clear on exactly WHY the job failed, however, and this is something that we need to work on -- getting better "feedback" information to you on WHY the files were not available, or even that this was the cause of the project's ultimate failure).
Check the Enstore status pages.
Are there pending encp jobs in the queue, including one which might account for any hangs in "sam store" or project file delivery? This information is available from the encp History page.
Look at the enstore system-at-a-glance page. If it looks like Enstore is hung, movers or library managers have timed out, or there are irrecoverable errors reading and writing a tape then send mail toenstore-admin ,enstore , andd0sam-admin .
Check the enstore tape inventory page. For people writing to the null device, the SAM shift person can add more volumes via something similar to (on d0test):
- $ setup encp
- $ enstore vol --add <volume_name> <library> <storage_group> \ <file_family> <wrapper> <media_type> <volume_byte_capacity>
- (example:)
- $ enstore vol --add NUL100 samnull D0 none none null 20113227776
Check the listserv archives of these mailing lists: d0sam-admin, sam-auto, and sam-design.
On the machine where the problem is reported, issue the command "ls /pnfs/sam". If you cannot see this directory, then no operations involving Enstore will succeed because the pnfs file system is not mounted.
Look in "/etc/fstab".If "/pnfs" is listed in "/etc/fstab" (i.e., should normally be mounted), contact the system administrator for that machine:
d0mino, d0test, other central d0 machines d0-admin DZero farms farms-admin DZero sam cluster sam-design d0ola, d0olb, d0olc d0-online-admin If "/pnfs" is not configured to be mounted, contact
d0sam-admin andenstore-admin to request the rights to do so. Please be aware that not all machines will be granted the right of direct access to Enstore and the robots.
Try "ls /pnfs" from another machine where pnfs space is normally mounted (e.g., d0mino, d0test, d0bbin, d0ola, d0olb, etc.).
See the enstore triage instructions.
Each file store is done in the context of a particular SAM station. Although a SAM "station" is a concept which may span many machines, station servers for a station run on one particular machine. This is configured by the sam_bootstrap ups product. The sam_bootstrap product will start/stop/restart the necessary servers on behalf of a particular station on each machine where SAM servers are running.[Need a discussion, preferably in the sam_bootstrap documentation, about which servers are needed, and why; in particular, when do you need a station, stager, and/or fss server.]
The python scripts still may be touchy about file names and special characters, although in V1.5 and beyond we have supposedly fixed this by inclusion of some special python modules.
Click on the name of the DbServer in question from the SAM At A Glance page. You should see at least one "Stager" running. If no Stager is running the SAM shifter might have to restart one (see the sam_bootstrap documentation for details on how to restart stagers and how to figure out why the stager did not restart itself in the first place).
Has a problem been reported about a specific project on a particular station? Here are some useful tools to try. In all cases, you'll need to get into the appropriate SAM environment first:
- $ setup [-qdev] sam
- $ echo $SAM_STATION (make sure it is set to the correct sam station)
where the option in square brackets is needed only for a development version. If you know the name of the project, you can look at its status:
- $ setenv SAM_PROJECT proj_name
- $ sam dump project
If you don't know the actual project name, but only know something about the project definition, the user, the group, etc., go to the SAM data browsing web pages and try to track down the project to see what you can find out about it.
Use the SAM Project Editor interface to browse existing dataset definitions. Check for valid username/group pairs, unique project names; make sure that the database server and NameServer are up (basic SAM system triage).
Is the SAM_STATION environment variable set correctly? Has "setup sam" been executed?
Not sure about this, need to figure out how to find out... this is in the process of changing...
| d0ora1_server_list.txt | list of servers to be started by sam_bootstrap |
| conf | used by sam_bootstrap to track history of server revisions |
| log | sam_bootstrap log files |
| dbserver__d0ora1__dev | log files for dbserver running on d0ora1, development environment |
| dbserver__d0ora1__dlsam_dev | log files for dbserver running on d0ora1, dlsam_dev environment |
| dbserver__d0ora1__vldb_dev | log files for DbServer running on d0ora1, vldb_dev environment |
| infoserver__d0ora1__dev | log files for infoserver running on d0ora1, development environment |
| nameservice__d0ora1__dev | log files for nameservice running on d0ora1, development environment |
| optimizer__d0ora1__dev | log files for optimizer running on d0ora1, development environment |
| logger__d0ora1__dev | master log files of everything except dbserver running on d0ora1, development environment |
View the appropriate log files, either using the diagnostics page, or directly from the ~sam area on the server node. Look at the end of the trace file or the dbg* files, try to figure out what is causing the crash.
... need some words...
... need more words...
... and even more words...
If a station is generating this error message, it is in bad shape and needs to be restarted. A restart will usually clear this kind of error, but only try restarting it once. If the restart fails to clear the error message, expert action is needed as additional restarts will not help.
[OUTDATED - need to update!]
This is a rather confusing display, and it is being improved for future releases of SAM. For now, however, let me translate for the first project:
"project 64443_sam_ ... still wants ... 25/10 ... files"
The project currently has access to 10 files in cache which it has not released yet. None of these can be removed if space is needed until after the project consumes and releases them.
The project was running successfully, but then something went wrong on the job side that killed the job. Since it was killed, it no longer shows up when executing "bjobs". It was killed in such a way, however, that the station was not able to notice, so the station thinks the job is still alive and it continues to try to satisfy the project. After 24 hours, the station will decide that the job has been inactive for too long and will kill the project. In other words, the "job" (the thing submitted to the batch system) died, but the "project" is still alive within the station. The user should stop the project if this situation is noticed.