{The SAM Toucan}

SAM Triage Page

{The SAM Toucan}

This checklist/troubleshooting list is divided into the following sections:

  1. SAM FAQ page
  2. SAM Documentation
  3. SAM User Registration
  4. SAM User Triage
  5. SAM System Triage
  6. SAM Batch Job Triage
  7. Enstore Triage
  8. Diagnosing File Store troubles
  9. Diagnosing User Project and file delivery troubles
  10. What went wrong -- fishing into logs, etc.
  11. Running jobs and wanted/used files.
  12. Some obscure error messages and what we have learned that they (usually) mean

Notes about this page:


Where can I find the SAM FAQ page?

The SAM FAQ page is at http://<this_machine>/sam_faq/cgi/faq, where <this_machine> is the name of the node that appears in the address of the page you are currently viewing.

Where can I find documentation about SAM?

All of the SAM documentation can be found under http://d0db.fnal.gov/sam.

How do I become a registered SAM user?

In order to use most of the features of SAM, each user needs to be registered with the system. To register, fill in the SAM AutoRegistration Form.

SAM User Triage

  1. Have you executed "setup sam"?
  2. Are you registered with the SAM system in the database environment you are trying to use (production or development)?
    Check your status from the SAM registration page or the SAM data browsing page (on the appropriate web server -- d0db.fnal.gov or d0db-dev.fnal.gov).
  3. Are you registered in the group you are trying to use via the --group=<groupName> command option?
    Check your status from the SAM registration page or the SAM data browsing page.
  4. Are you trying to start a project using a valid dataset definition?
    Query the SAM data browsing page for valid dataset definitions and versions.
  5. Are you trying to specify an invalid Application Family or Version?
    Query the SAM data browsing page for valid application families, names, and versions.
  6. Are you trying to store data to tape to an invalid pnfs destination?
    Check that the destination for the store is a valid /pnfs directory. Check on the protection settings for that directory and the tape file family associated with it. Soon this type of error should go away since all destinations will be mapped to /pnfs space by SAM, based on the file meta-data provided.
  7. Has your program been linked with sam_manager?
    Look at instructions for Using SAM in the D0 Framework.
  8. Does your RCP file for SAM specify that SAM is to be used for input or output files?
    Look at instructions for Using SAM in the D0 Framework.
  9. Do you understand how to use RCP and have you configured your RCP database correctly?
    Look at instructions for Using SAM in the D0 Framework.

SAM System Triage

The instructions in this section are educational for all users in diagnosing a problem; however, only SAM administrators or SAM shift persons will actually be able to take any of the actions to correct the problem (e.g., restart servers, modify databases, etc.).

  1. First Level Triage: sam locate foo
    On d0mino (or another 'stable' SAM station) execute:

    • $ setup [-qdev] sam
    • $ sam locate foo

    where the option in square brackets is needed only for a development version. This command tests the NameServer, the DbServer, and the database itself. You should see a message saying something similar to "No such file: foo", which indicates that the NameServer is running, the DbServer is running, and the database is up.
  2. Is the CORBA Naming Service up for the environment you are working in?
    Use the SAM diagnostics page to inspect the Naming Service. If there is no Naming Service operational, then nothing else will work. Restart it, following the directions on the SAM diagnostics page. All other servers should re-register themselves within a few minutes. There seems to be frequent trouble with the database server re-registering so you should probably also restart the database server manually --- see the instructions on the SAM shift guide and diagnostics pages.
  3. Is the Database up for the environment you are looking at (development, integration or production)?
    Use the SAM data browsing page to inspect the database in question. Issue a simple query for anything. If the database is down the error message will be clear. Send mail to ods-dba, and/or the helpdesk.
  4. Is the Database server up for the environment you are looking at?
    Use SAM At A Glance to inspect the DbServer in question. Scroll down to the "SAMDServers" section and look for the name of the server. There should be a green "dot" in front of the name and the entry should look like this:

    Server Host:Port Version Up Since
    SAMDbServer.cdfsamstore_prd:SAMDbServer (Using: samdbs@cdfofprd)    
    fcdfdata064.fnal.gov:33641 v7_5_0 11 Apr 2006 21:34:07



    If there is no "SAMDbServer", follow the instructions for sam_bootstrap to restart it (and to determine why it failed).
  5. Is there a File Storage Server running for the SAM 'station' where trouble is reported?
    View a history of work done by the File Storage Server using the command "sam dump fss" (which seems to take a while -- you may want to dump to a file and then mess with the file with grep, tail etc). In this dump output you can find errors returned by Enstore - look for "ERR" and "error".
    Example, on d0mino:

    • $ setup [-qdev] sam
    • $ echo $SAM_STATION      (make sure it is set to the SAM station in question)
    • $ sam dump fss

    where the option in square brackets is needed only for a development version. Good output (i.e., output that indicates a file storage server IS running) will look similar to:

    • d0mino:> export SAM_STATION="central-analysis"
    • d0mino:> sam dump fss
    • Next Generation FSS at station central-analysis running on d0mino.fnal.gov 8 days 21 hours 4 minutes 27 seconds
    • Configuration for operation retrial (count, interval/timeout)
    • DBS contact: 3, 1 hours
    • Opter contact: 1, 1 hours
    • Authorization receipt:1, 1 hours
    • Stager contact: 1, 1 hours
    • Transfer (retrials upon timeout and upon failure): 3, 6 hours
    • Relay (multi-stage routing only): 3, 1 hours
    •  
    • File Storage Server Dump:
    • Stagers are known at nodes: d0mino.fnal.gov
    • No requests ever submitted

    If there is NO file storage server, you will see something similar to:

    • d0mino:> export SAM_STATION="bogus-sam-station"
    • d0mino:> sam dump fss
    • FSS at station "bogus-sam-station" is not registered with the Naming Service
    • Check your station name ($SAM_STATION), then contact d0sam-admin@fnal.gov
    • Fnorb.cos.naming.CosNaming.NotFound exception:  {'why': missing_node, 'rest_of_name': [<Fnorb.cos.naming.CosNaming.NameComponent instance at 10538e88>]}

    Things to check:

    1. Is the station name valid?
      Check the SAM data browsing web page for the environment in question (prd, dev or int). If the station name is not valid, you will need to add it to the database.
    2. Is the file storage server registered with the naming service?
      From SAM At A Glance, click on the name of the station in question. On the resulting page, you should see that an NGFSS:Sewer is registered. If there is no file storage server running, you will need to figure out why. Go to more detailed investigations of SAM store troubles.
  6. Are you getting an error message like "Failed to complete work on head"?
    If a station is generating this error message, it is in bad shape and needs to be restarted. A restart will usually clear this kind of error, but only try restarting it once. If the restart fails to clear the error message, expert action is needed as additional restarts will not help.

SAM Batch Job Triage

  1. Why does my batch job sometimes sit in PSUSP state for a long time, while other times it starts right away?
    When you use the "sam submit" command (or the "sam run project" command, which calls "sam submit" for you), a project is started on your behalf. Once the project has been started, your executable to process the project files is placed in a SUSPENDED state (PSUSP) in the sam_lo or sam_hi batch queue, until there are actual files in the cache for your job to process. Your batch job is resumed when [at least some of] the files are available.

    If you start a job where the project definition includes NO files that are already in the cache, then your job will sit in the PSUSP state until some of the files can be staged. Conversely, if you start a project where some of the files are already in the cache, it should start to run pretty quickly (subject to queue parameters such as maximum number of running jobs, etc.).

    If you are in the PSUSP state because the files weren't already in the cache, and then the files cannot be delivered for some reason (tape problems, corrupt files, etc.), your job will ultimately fail. (The error messages seen are not very clear on exactly WHY the job failed, however, and this is something that we need to work on -- getting better "feedback" information to you on WHY the files were not available, or even that this was the cause of the project's ultimate failure).


Enstore Triage

  1. Is enstore available?
    Check the Enstore status pages.
  2. Can you see completed encp requests? Do they indicate errors? If so, what type -- tape read or write?
    Are there pending encp jobs in the queue, including one which might account for any hangs in "sam store" or project file delivery? This information is available from the encp History page.
  3. Can you see indications of mover or library manager having timed out ?
    Look at the enstore system-at-a-glance page. If it looks like Enstore is hung, movers or library managers have timed out, or there are irrecoverable errors reading and writing a tape then send mail to enstore-admin, enstore, and d0sam-admin.
  4. "no new volumes available" error message:
    Check the enstore tape inventory page. For people writing to the null device, the SAM shift person can add more volumes via something similar to (on d0test):
    • $ setup encp
    • $ enstore vol --add <volume_name> <library> <storage_group> \ <file_family> <wrapper> <media_type> <volume_byte_capacity>
    • (example:)
    • $ enstore vol --add NUL100 samnull D0 none none null 20113227776

Trouble When Storing Files

Are you having trouble storing files into SAM? Follow these steps to try to diagnose and fix your problem.

  1. Has there been an email message to d0sam-admin, sam-auto, or sam-design indicating Enstore or one of the robots is down?
    Check the listserv archives of these mailing lists: d0sam-admin, sam-auto, and sam-design.
  2. Is the Enstore name space catalog operational?
    On the machine where the problem is reported, issue the command "ls /pnfs/sam". If you cannot see this directory, then no operations involving Enstore will succeed because the pnfs file system is not mounted.
  3. Is pnfs filesystem configured to be mounted on this machine?
    Look in "/etc/fstab".

    If "/pnfs" is listed in "/etc/fstab" (i.e., should normally be mounted), contact the system administrator for that machine:
    d0mino, d0test, other central d0 machines d0-admin
    DZero farms farms-admin
    DZero sam cluster sam-design
    d0ola, d0olb, d0olc d0-online-admin

    If "/pnfs" is not configured to be mounted, contact d0sam-admin and enstore-admin to request the rights to do so. Please be aware that not all machines will be granted the right of direct access to Enstore and the robots.

  4. Are there unexplained failures from one machine?
    Try "ls /pnfs" from another machine where pnfs space is normally mounted (e.g., d0mino, d0test, d0bbin, d0ola, d0olb, etc.).
  5. Is the Enstore system operational ?
    See the enstore triage instructions.
  6. Is the environment variable SAM_STATION set to the correct station?
    Each file store is done in the context of a particular SAM station. Although a SAM "station" is a concept which may span many machines, station servers for a station run on one particular machine. This is configured by the sam_bootstrap ups product. The sam_bootstrap product will start/stop/restart the necessary servers on behalf of a particular station on each machine where SAM servers are running.

    [Need a discussion, preferably in the sam_bootstrap documentation, about which servers are needed, and why; in particular, when do you need a station, stager, and/or fss server.]

  7. Look carefully at error messages from the "sam store" command.
    The python scripts still may be touchy about file names and special characters, although in V1.5 and beyond we have supposedly fixed this by inclusion of some special python modules.
  8. Look for an appropriate Stager.
    Click on the name of the DbServer in question from the SAM At A Glance page. You should see at least one "Stager" running. If no Stager is running the SAM shifter might have to restart one (see the sam_bootstrap documentation for details on how to restart stagers and how to figure out why the stager did not restart itself in the first place).

Trouble in running a Project or accessing files from a Project

  1. Clues in tracking problems in particular projects
    Has a problem been reported about a specific project on a particular station? Here are some useful tools to try. In all cases, you'll need to get into the appropriate SAM environment first:
    • $ setup [-qdev] sam
    • $ echo $SAM_STATION      (make sure it is set to the correct sam station)

    where the option in square brackets is needed only for a development version. If you know the name of the project, you can look at its status:

    • $ setenv SAM_PROJECT proj_name
    • $ sam dump project

    If you don't know the actual project name, but only know something about the project definition, the user, the group, etc., go to the SAM data browsing web pages and try to track down the project to see what you can find out about it.

  2. Is there trouble creating a dataset definition, or using a definition to create a dataset?
    Use the SAM Project Editor interface to browse existing dataset definitions. Check for valid username/group pairs, unique project names; make sure that the database server and NameServer are up (basic SAM system triage).
  3. Is there trouble starting up a project?
    Is the SAM_STATION environment variable set correctly? Has "setup sam" been executed?
  4. Is a tape broken or otherwise not available?
    Not sure about this, need to figure out how to find out... this is in the process of changing...

What went wrong? Fishing into logs, dumps, etc.

Log files for all SAM servers are kept in separate directories under the ~sam account, typically (but not always) in ~sam/private. The directory names should be fairly self-explanatory; for example, on d0db-dev (a.k.a. d0ora1), at the present time we have the following files/directories:

d0ora1_server_list.txt list of servers to be started by sam_bootstrap
conf used by sam_bootstrap to track history of server revisions
log sam_bootstrap log files
dbserver__d0ora1__dev log files for dbserver running on d0ora1, development environment
dbserver__d0ora1__dlsam_dev log files for dbserver running on d0ora1, dlsam_dev environment
dbserver__d0ora1__vldb_dev log files for DbServer running on d0ora1, vldb_dev environment
infoserver__d0ora1__dev log files for infoserver running on d0ora1, development environment
nameservice__d0ora1__dev log files for nameservice running on d0ora1, development environment
optimizer__d0ora1__dev log files for optimizer running on d0ora1, development environment
logger__d0ora1__dev master log files of everything except dbserver running on d0ora1, development environment

  1. Did a server crash?
    View the appropriate log files, either using the diagnostics page, or directly from the ~sam area on the server node. Look at the end of the trace file or the dbg* files, try to figure out what is causing the crash.
  2. Is the File Storage Server hung, crashed, or stuck somehow?
    ... need some words...
  3. Did a ProjectMaster disappear?
    ... need more words...
  4. Is a ProjectMaster running but hung?
    ... and even more words...
  5. Are there error messages like "Failed to complete work on head"?
    If a station is generating this error message, it is in bad shape and needs to be restarted. A restart will usually clear this kind of error, but only try restarting it once. If the restart fails to clear the error message, expert action is needed as additional restarts will not help.

Running jobs and wanted/used files

  1. Some have asked about the meaning of this sort of message:

    • project 64443_sam_(64443) user avdhesh.dzero started 16 May 13:54:54 UNIX
    • pid 29052978 still wants/currently uses 25/10 still unused(unlocked) 0 files
    • project 64500_sam_(64500) user avdhesh.dzero started 17 May 00:05:02 UNIX
    • pid 31804826 still wants/currently uses 30/0 still unused(unlocked) 0 files
    • project 64593_sam_(64593) user avdhesh.dzero started 17 May 10:34:33 UNIX
    • pid 44261956 still wants/currently uses 30/0 still unused(unlocked) 0 files

    [OUTDATED - need to update!]
    This is a rather confusing display, and it is being improved for future releases of SAM. For now, however, let me translate for the first project:

    "project 64443_sam_ ... still wants ... 25/10 ... files"

    The project currently has access to 10 files in cache which it has not released yet. None of these can be removed if space is needed until after the project consumes and releases them.
  2. Some then ask why this first job (64443_sam_) is still running, and why it has no corresponding job listing from "bjobs".
    The project was running successfully, but then something went wrong on the job side that killed the job. Since it was killed, it no longer shows up when executing "bjobs". It was killed in such a way, however, that the station was not able to notice, so the station thinks the job is still alive and it continues to try to satisfy the project. After 24 hours, the station will decide that the job has been inactive for too long and will kill the project. In other words, the "job" (the thing submitted to the batch system) died, but the "project" is still alive within the station. The user should stop the project if this situation is noticed.

Various obscure error messages

This part not yet written

Send comments to sam-design
Last updated: $Date: 2007/03/22 18:33:02 $ by $Author: bellavan $