SAM Shift Guide

This guide is divided into the following sections:

The contents of this guide can vary depending on the experiment. The current information is for the experiment indicated by the radio button to the right. To see a version for a different experiment, click on the button next to the experiment name.

Page Legend:
  • Text that is printed on a terminal screen is displayed in code font. Filenames, usernames, and code package names are displayed in this font as well.
  • Text to be entered on a command line is displayed in kbd font (in most browsers this is identical to code font), and each line begins with a "$" symbol to represent the prompt. The symbol should not be typed.
  • Words surrounded by angle brackets ("<" and ">") are placeholders where the user needs to insert the appropriate parameter. The angle brackets should not be typed.
  • Words surrounded by square brackets ("[" and "]") are optional and may be omitted depending on the desired behavior. The square brackets should not be typed.
  • Email addresses are shown in and are in the domain unless otherwise indicated.
  • This page is optimized for display using Firefox, but other browsers should work as well. The page contains dynamic content and should be viewed with scripting enabled.

What You Need to Know to Start

The SAM shift person, in order to be able to diagnose and/or fix problems, must first make sure that the following particulars are satisfied:

  1. The shifter should know the URL of the main SAM page: http://projects.fnal.gov/samgrid
  2. The shifter should know the URL of the main Data Handling page: http://www-cdf.fnal.gov/upgrades/computing/dh/cdfdh_main.html
  3. The shifter should be fully familiar with the CDF SAM User Documentation
  4. Shift schedule

  5. The shifter should be subscribed to at least the following mailing lists:

    Additional mailing lists may be required depending on the experiment. Mailing lists are archived on the FNAL LISTSERV server. If the mail traffic is heavy on the lists, the shifter may choose to temporarily disable the mail delivery between shifts. Please review the LISTSERV Instructions for Users to see how to subscribe and for additional information.

    Note for DZero: Users should send email regarding SAM problems to . Messages sent to this list are processed through the plone Issue Tracker before being forwarded to all shifters. If a SAM-related problem email appears on one of the other lists, the shifter should

    1. email the user asking him/her to use in the future;
    2. enter the message into the Issue Tracker, or get the user to do so.

    Also note: Sometimes CDF users send queries to or instead of the CDF-specific lists ( and ). The general names are at present (4/10/05) aliased to the DØ lists. (We hope this will change soon.) For the time being, if a CDF query comes to the DØ list, email the user with the name of the correct CDF list to use.

  6. The shifter must be able to log in as username 'sam' on all of the "sam-critical" nodes. This involves having your username added to the .k5login file on the appropriate machines. NOTE: New shifters will not be added to the .k5login files for the sam account until they have become familiar with the system, the requirements, etc. If you are a new shifter, and cannot log in to the sam account yet, please ask an experienced shifter for assistance with steps to be performed as user 'sam'.

    The shifter must also obtain a special "username/root" principal in addition to your normal principal by filling out the FNAL kerberos form or by sending email to and asking for a special "username/root" principal. Make sure that your "username/root" principal has been added to the .k5login file for the sam account on the relevent nodes. To use this account, ssh (do not telnet using a cryptocard) to any kerberized node in the Fermilab domain with your normal username. Then:

    • $ kinit -F <username>/root

    where <username> is your normal user name. When prompted, enter your "username/root" password. To activate root priviledges on the machine you are logged in to, type:

    • $ ksu sam

    To log in to another node with root priviledges, type:

    • $ ssh sam@<nodename>

    where <nodename> is the name of the machine you want to log in to.

  7. The shifter should have an oracle username/password for SAM-related databases in order to add/modify database information. For CDF, this is the Offline Production database "cdfofprd". To request an account, send an email to (CC: ) which includes the following information:

    • User name
    • Email address
    • Phone number
    • User type (= SAM shifter)
    • Database (= cdfofdev, cdfofint, and cdfofprd)
    • Application (= SAM)

    The shifter should have an oracle username/password for SAM-related databases in order to add/modify database information. For DZero, this is the Offline Production database "d0ofprd1". To request an account, fill in the D0 Database User Access form, selecting:

    • User Type: Production User
    • Which Database?: D0 Offline Production
    • Which Application?: SAM

    The shifter should have an oracle username/password for SAM-related databases in order to add/modify database information. The method of making a request for an account varies by experiment.

    Once an account is created, you should change the initial password to one of your own creation as soon as possible. This can be done using a program called sqlplus. On an appropriate machine type:

    where <username> is your database user name and <dbname> is the name of the appropriate database machine. When prompted, enter your current password. Inside the sqlplus program, type this line to change your password:

    where <username> is your user name again, and <new_password> is the new password you want to use, and <old_password> is the old pasword you are replacing. To exit sqlplus, type "quit". This process must be repeated for each database you are given an account upon.

  8. The shifter should register his/her personal kerberos principle for read/write access to the "cdcvs" CVS repository (sometimes also referred to as the "olscvs" CVS repository). This access is needed both to update documentation and to add certificates to the SAM gridmap file. The steps to getting access to the repository are:
    Read access
    Send email to the with your Kerberos principle requesting access to the "cdcvs" repository. Include the name of your supervisor or sponsor.
    Write access
    Send email to the SAM representative for your experiment explaining that you are a SAM shifter and need write access for SAM packages.
  9. The shifter should request an account for the SAMGrid Issue Tracker. Send email to requesting an account.
  10. A new shifter should get write access to the "quality_data" user group in order to run RecoCert jobs. Send email to Supria Jian () or Alan Jonckheere () to be added to the group list.

Back to the Top


Shifter's Responsibilities

When encountering a problem on shift, first consult SAM Triage for assistance. Also try searching the SAM FAQ page, the dzero sam-admin archives, and the cdf sam-admin archives for precedents. Try to find out as much as possible about the problem so that others can help you more easily, and so that similar occurances in the future can be solved more easily.

Shifters have several responsibilities, including monitoring many aspects of the system, as outlined here:

  1. Keep a log of your shift activity and email copies to the appropriate shifter list for your experiment.

  2. Attend the brief, semi-daily teleconference on CDF SAM operations (Monday through Friday except Tuesday, 8:45am Chicago time) and Tuesday's 9:30am general SAM operations meeting (FCC1). The teleconference can be attended via either EVO or ESNET. For EVO, a meeting room has been set up in the 'CDF' community called 'SAM DH Operations'. For ESNET, use the meeting number 88SAMDH (8872634; the 'DH' stands for 'data handling'). For a simple telephone connection, dial +1-510-883-7860, wait for a signal, and then dial 88SAMDH. If these options do not work, send an email with an exact description of the problem to and .

  3. Know who is the current on-call SAM expert by checking the On-Call List.

  4. Monitor the mailing lists and the Issue Tracker for reports of problems or requests for information from users. Attempt to help when you can, especially in pointing people to existing documentation. If there is a problem you cannot fix or question you cannot answer, please provide an expert with as detailed a report as possible.

    We would like to see a shifter response to every issue on the list within some reasonable window. Some messages are user questions about how SAM works. Some messages are reports of problems. The educational aspect of answering these emails promptly and correctly is very important. Even if your response is that you do not know the answer or solution and that you need to send it "over to an expert", answer the email.

    If you do not know an answer immediately, please try to research the problem using the existing documentation and tools. If that fails and you turn the problem over to an expert, please follow up on it:

  5. Post important information regarding SAM operations to the user mailing lists. The content of such notices could be that some portion of the system is down, that there is a problem that is known and being worked on, that there is a downtime in another system that is affecting SAM, etc.

  6. Monitor the following web pages:

    SAM At A Glance
    The page displays the status of the NameServer, DbServer, and individual stations.
    Under the "SAMDbServers" section of the SAM-At-A-Glance page, some server listings are labelled with the connection machine they are "Using:" while others are labelled as "Dispatching" to other servers. In most cases, the "receiving" servers share the same name as the "dispatching" server with the addition of the string "_srv<N>_", where <N> is an integer number. In rare cases you will have to use the long way to find out which "receiving" servers the "dispatching" DbServers talk to.
    1. Note which machine the DbServer is running on (under the "Host:Port" heading).
    2. Log in to the executing node as user "sam" and do the usual setups.
    3. Look in the "private/*server_list*" file for that machine to get the "sam_config" configuration name ("<configname>") for the DbServer you are interested in. Each configuration name available is on its own line that begins with "dbs". Unfortunately there is no easy way to tell which configuration type a particular DbServer is using. Use the "ups inguire" command to see if the DbServer you are interested in is listed as "SAM_DB_SERVER_NAME" for a particular <configname>:
      • $ ups inquire sam_config -q <configname> | grep SAM_DB_SERVER_NAME
    4. Use the "ups inquire" command to find the "SAM_DB_SERVER_CONFIG_FILE" name ("<configfile>"):
      • $ ups inquire sam_config -q <configname> | grep SAM_DB_SERVER_CONFIG_FILE
    5. Look in the "private/conf/dbserver/<configfile> " file to find the servers that are being dispatched to. They are listed as "remoteServers":
      • $ more private/conf/dbserver/<configfile> | grep remoteServers

    A similar procedure can be used to find the log files for a station ("<stationname>").
    1. Note the "Host:Port" machine name ("<machinename>") for the station you are interested in.
    2. Log in to that machine as "sam".
    3. Look at the contents of the "private/<machinename>_server_list.txt " file to find the line beginning with "station" that has <stationname> as the fourth word in the line.
    4. The second word in that "station" line is the station log tag ("<logtag>"). The log files are therefore in the directory "station__<machinename>__<logtag>__<stationname>".

    NetStat
    A summary, by node, of the count of recent established network connections. We do not really have an example of what this plot should look like. Just keep an eye out for trends that do not return to the "average".

    In case of problems, try to:

  7. Update the SAM database as necessary. Usually requests come through email to the mailing list and consist of adding new nodes, new stations, new application families, new users (should auto-registration fail), etc. If there is a request which can not be done with the sam_admin commands and you are unfamiliar with the other ways of changing the database, do not do it!

  8. Restart web servers as necessary.

  9. Monitor the status of Enstore for your experiment.

  10. Monitor and assist in the Recocert histogram verification procedure. The executable that generates certification plots is currently being run automatically on the production farms. There are three parts to the shifter's responsibility:

    1. Submit batch jobs via the script recocertSAMMerg.sh which merges ROOT output files into a single file per run number.
    2. When a batch job finishes, create overlay plots for the resulting output.
    3. Scan the overlay plots and report any observations.

    Please reference this instructional page for more details.

  11. Users create and check sam datasets via the command line or the web. The user_prd DbServers, the web_prd DbServers, and the web servers must be functioning to enable this. There are two items to check:

    1. The links below go through the SAMzilla server to display the log files for each DbServer. Check each to see if there are recent entries at the end of the log.

      If the log files do not have recent entries, try executing a dataset command. If the command fails, restart the appropriate server.

    2. Try to load the Dataset Definition Editor. If the page does not load, or cannot find a known dataset (for example, "testdde"), then restart the web server.

  12. Every 2 to 3 hours, check the disk space in the SAM logfile area. To do this:

  13. Update this documentation or at least point out to experts when documents are needed, out-of-date, etc.

  14. Additional monitoring tools are available on the SAM diagnostics page.

    - SAM Plots and Statistics
    The link sits just under the page title and leads to "SAM Production Plots" and "SAM Consumption Plots". These plots are useful for monitoring the behavior of SAM over time.
    - CORBA Name Service Polling
    This is in the third box on the left and is a quick way of having the naming service check the status of any individual registered station (through the drop-down text box), or multiple stations grouped by type (through the links).

    There are times, however, when these pages will indicate SAM problems when the real culprit is something in the web server configuration itself. If the web-based tools from the diagnostics page indicate a serious problem, it is good to double-check using a different tool (e.g., run an appropriate SAM command on a relatively stable system) before attempting any corrective action.

Back to the Top


DbServers and Critical SAM Stations

A shifter should check that the DbServers and the critical SAM stations (labeled as "Monitor Level: Critical") are running using Sam At A Glance.

Check the last update time of the page (third line from the top) to be sure the information is up-to-date. You will need to scroll down to see all of the DbServers. You can click on a station name to bring up a statuspage for the processes running on that machine. Check the status of the Station Master and/or FSS components of the critical systems listed. They should be green.

On each station summary page, click on the "Master:Station" link to see a dump of the station and check the line that begins with "*** BEGIN DUMP STATION" to see how long the station has been running (you may need to scroll the screen to the right). If it has restarted recently, try to figure out why. If the station is not running, check the trace file of the corresponding DbServer to begin tracking down the problem. Take steps to restart the DbServer. If the bootstrap recovery does not work, contact an expert or mentor shifter to further diagnose the problem and restart things.

A comment about the "normal" stations -- These stations require minimal support. The local station administrator should handle the problems, not the SAM shifter. The administrator(s) of a given station can be found by issuing the commands:

where <station_name> is the name of the station you want information about.

Back to the Top


Restarting a Web Server

The webserver for each experiment is different. To check on the machine, look at SAM At A Glance, making sure to refresh the page so that you are not viewing a cached version. If the page has problems loading, it is likely that the 'apache' web server needs to be restarted. Other indications of trouble include reports of problems with the the autoRegister cgi script, or blatant hangs, crashes, etc., of related web sites.

If you have to restart a web server, follow these instructions:

The above steps usually solve the problem, but if the output contains one or more java stack dumps complaining about "socket already in use", the java server is hosed. Usually a problem java server ends up hogging a port and is unable to kill itself. If this happens, the shifter must manually kill the remaining processes and explicitly restart apache.

Please see the SAM Webservers documentation for further details.

Back to the Top


samTV Tasks

samTV is a great way to check if SAM file delivery is operating normally. Most objects on the page are hot (i.e. "clickable") to get additional information. Investigate nodes with a lot of red/errors. An example error diagnosis can be found here.

DZero SAM shifters should contact the experts if a problem is suspected with samTV.

The "samTV" monitoring web page runs on the web server machine and sometimes needs to be restarted even if the machine itself is okay. To check on the status of the process:

  1. Log in to the web server machine for your experiment.

  2. Type the following setup commands:

    • $ source ~/setups.sh
    • $ setup ups
    • $ setup sam v6_0
  3. To determine the samTV process ID, type:

    • $ ups inquire samTV

    and look near the end of the resulting printout.

  4. Type:

    • $ ps -efl | grep <pid> | grep py

    If grep found anything, then samTV is running and may just be slow. You can use

    • $ ups tailLog samTV

    to see what samTV is up to, but note that the log file may sit for many minutes (e.g. 10) without updating.

  5. If grep did not return anything, then samTV has died and needs to be restarted.

To restart samTV (after having checked on the status of the process as described above):

  1. First, make a copy of the log file so that experts can look at it if necessary:

    • $ ups inquire samTV    (to look for the log directory)
    • $ cp <logdir>/*.log <yourdir>

    where <logdir> is the log directory provided by the "inquire" command, and <yourdir> is a personal directory where you have some spare disk room.

  2. To restart the samTV process, type:

    • $ ups start samTV

    You will see a warning message about the process having disappeared. This is normal. Type and same command again, and you should see samTV starting up.

  3. Wait about a minute, and then

    • $ ups inquire samTV

    should give a new process ID. You can watch the log file using "ups tailLog samTV", but note that it may sit at the same place for many minutes.

  4. It generally takes about 10 miutes before the web page is ready. When you see something similar to "--- At Wed Jul 30 13:07:22 2003 Waiting for 3000 seconds ---" as the last line in the log file, then the page is ready.

Occasionally a new SAM station is created that needs to be monitored via samTV. To add a station to samTV, follow these steps:

  1. Follow the "setup" instructions listed above under "check on the status"

  2. Type:

    • $ setup samTV
  3. Go to the current product version directory:

    • $ cd /home/sam/products/upsdb/samTV/<version>

    where <version> is the current version number of samTV. Use can use "ups list samTV" to see the available version numbers.

  4. Edit the configuration file "<nodename>_sam_config_prd.table.dat" where <nodename> is the full name of the web server machine. Add the new station name to the list for parameter "SAMTV_STATIONS", using a colon (":") as a separator between station names. Save the edited file.

  5. Restart samTV as described above.

Back to the Top


SAM Stations, DbServers, the Naming Service, and the Optimizer

Problems with the SAM Stations or DbServers should be detectable from SAM At A Glance. Entries that have a red dot in front of them might need to be restarted. If a problem is suspected with a specific Station or DbServer, examine the corresponding log files to get further information. To find the log files:

  1. Look at SAM At A Glance - the first column contains a Station or DbServer name, the second column contains the name of the node where the Station or DbServer runs.

  2. As user "sam", log in to the node you looked up in the previous step. Note: only a few people have permission to access this machine. If you try to log in and get a permission denied message, forward the information about the problem you are trying to solve to the experts and they will take care of it.

  3. Go to the ~sam/private directory, and look in the *_server_list.txt file.

  4. On the line which starts with "station", the second word is the <server_name> and the fourth word is the <station_name>. The log files are therefore in the directory "station__<machine_name>__<server_name>__<station_name>", where <machine_name> is the name of the node you are logged in to, and the other two you just looked up. The log file itself is called "trace".

  5. A similar procedure can be used to look up DbServer logs.

If you need to look up the name of an administrator for a specific station, you can use the Stations Group Report.

Note that the cdf-fncdfsrv0 (farm) station is structured differently and you must read the file "HOWTO.readme" as sam@fncdfsrv0 for more information.

Even if all entries on the SAM-At-A-Glance page have a green indicator, there might still be a problem. If the monitoring system loses contact with a node for too long, that node's entry will simply disappear off of the SAM-At-A-Glance page. To check for this, compare the current page with the example snapshot to see if any nodes are missing.

Listed DbServers with a green dot might still have problems, so you need to look at the entire entry line. For example:

Server Host:Port Version Up Since
  fake-server.prd:12345 (Using: some-database@dbmachine)          
  home2.fnal.gov:67890   v5_1_1   05 May 2007 12:20:39  

is a working DbServer. If the comment in parentheses says that it is not connected, then the DbServer needs to be restarted. If you are not sure about this, there are several SAM commands which can be found under samDbServerInfo that can be used for further testing.

Also, CDF Data Handling At A Glance shows the number of DbServers not responding. Click on that number to get 24-hour history plots of connection activity and timing information (with points taken at 5 minute intervals).


Sometimes the SAM At A Glance page will have red text across the top indicating that information on the page has not been updated recently. First, refresh your browser window to make sure the message is not just from your own machine losing connection. Upon refreshing, if the message is still there it indicates either the naming service is not working, or the cron job for one or more nodes is not working. You have to determine which type of process is broken. Go to the NameService page and check for red dots down the left side of the page. Just a few dots indicates a problem with one or more cron jobs. An expert must be contacted to resolve a cron job problem. If there is a red dot in front of all of the DbServers, it is more likely that the naming service is not running and you have to restart it. Before restarting the naming service, please double check that it is dead by trying a command like "sam locate foo" (which will fail if there is a naming service problem).


To check the status of the optimizer, use the command

A working optimizer will respond with "alive". If the optimizer is in a bad state, you will get a CORBA exception which should give you some information about what is wrong. An expert must be contacted to resolve an optimizer problem.


Restarting Stations, DbServers, and the Name Service

If a DbServer or the Name Service has been inactive for more than half an hour, or there are errors in the log file, it should probably be restarted. Use the SAM-At-A-Glance page to find which processes are running on which machines.

Note to DZero shifters: Do NOT restart the farm DB server (D0FarmDbServer.prd:D0FarmDbServer) unless explicitly requested to do so by Mike Diesburg of Daniel Wicke. There are some long queries that must be run periodically that make the server appear to be bogged down. Restarting this server without need disrupts running jobs and queries.

To restart one of the DbServers and/or the Name Service:

  1. Log in as user 'sam' on the appropriate machine where the server is running.

  2. Execute the following:

    where <date_stamp> is the current date plus your initials in the format "YYMMDDHHII" where "YY" is the two-digit year, "MM" is the month, "DD" is the day, "HH" is the hour (out of 24), and "II" are your initials.

  3. Edit the server_list file with your favorite text editor, commenting out the station, server, or optimizer you need to restart by inserting a hash symbol ("#") at the beginning of its line. Make a note of the full name of the server as it appears in the server_list file to use in place of the variable <fullname> in the next step.

  4. Save a copy of the log/trace file by executing the following:

    Do not forget to substitute the correct value for <date_stamp>.

  5. Update the scripts that run the servers by executing:

    This should kill any instances of the process you are trying to restart.

  6. Edit the server_list file again, this time removing the comment hash symbol ("#") you added before.

  7. Update the scripts again to restart the processes by executing:

The procedure for restarting a SAM station is similar to that for restarting a DbServer, except the commented/uncommented line in the server_list file should contain the keyword "station".

If the naming service does not restart, there might be a problem with its cached information. Check the trace file using, for example, the less command:

If the trace file contains a line beginning with "IOR" followed by a lot of numbers and letters, then the cache is ok. However, if the trace file contains something like:


	   System exception `COMM_FAILURE'
	   Reason: no such host is known
	   Completed: no
	   Minor code: 1330577422 (gethostbyname() failed)
	   Killed process: 29754
	 

then you might have a corrupted cache. In this case -- and only in this case -- delete the file called "log" which is found in the "sam" directory and restart the naming service as described above. It is recommended that you also restart the optimizer, the critical SAM station(s), and all the DbServers a few minutes after restarting the naming service. After deleting the log file and restarting the naming service all DbServers and stations are disconnected from the naming service and all queued jobs are lost, so be careful when doing this. After an hour or so, the stations and DbServers will have re-registered themselves with the naming service and things should be back to normal.


Station Setup and Configuration

Occasionally you may be asked to help set up a new station or re-configure an old station. Instructions for station setup and configuration are available here.

A new page is under development here.

Back to the Top


Monitoring Available Enstore Tapes

You can find enstore status information from the Fermilab Mass Storage System on the Enstore home page. Links exist for each experiment, and from the experiment page there is a link to tape inventory information.

If you are outside of the fnal.gov domain, the links on the Enstore pages may not work by simply clicking on them. However, you can still view static versions of the pages (generated every hour) by explicitly typing in the page addresses. The URLs for the pages mentioned above are:

The enstore group now adds new tapes so shifters should not need to worry about this, but if you notice problems (such as tapes marked "NOACCESS" on the Data-Handling-At-A-Glance page) send email to enstore-admin and helpdesk (CC: ).

If large numbers of tapes (more than ~10) become "NOACCESS" in the space of a couple hours or less, it could mean there is a serious hardware failure in one of the tape robots. The helpdesk should be called and asked to page the Enstore primary expert. Request that the expert keep you informed of the situation by email or phone, and you need to do the same for the admins of any affected experiments. If you call the helpdesk after hours, be sure to tell the message taker the keyword "" (spell it for them).

If a tape stays "NOACCESS" for more than one weekday and there is no comment in the detailed listing (page reached by clicking on the "Bad Tape" link, comment is the last column on the right), send email to enstore-admin (CC: ) asking about its status. A comment in the text listing indicates that an expert has looked at the problem at least once since they are the only ones who can enter the comments.

If a tape is marked as "NOTALLOWED", then it has been investigated by experts and needs to have more work done before it is usable again. Unless a tape has been in this state for a long time and users are demanding access to it, tapes in the "NOTALLOWED" state can be ignored.

Another serious but very rare tape status problem occurs when the encp synch cron job becomes corrupted and begins marking all tapes with state "DELETED". This would be signaled on the same web page as above (and also by catastrophic file delivery errors from projects accessing the tape!). If this happens, it is necessary to contact the on-call expert to get the cron job turned off and the tape status corrected.

Back to the Top


Monitoring and Restarting Calibration Servers

There are a number of calibration DbServers, each running a separate user and farm instance.

  • CftDbServer.user_prd
  • CftDbServer_oracle.user_prd
  • ConfigDbServer.user_prd
  • CpsDbServer.user_prd
  • D0DbServer.user_prd
  • MuonMDTDbServer.user_prd
  • MuonMSCDbServer.user_prd
  • MuonPDTDbServer.user_prd
  • SmtDbServer.user_prd
  • SmtDbServer_oracle.user_prd
  • CftDbServer.farm_prd
  • CftDbServer_oracle.farm_prd
  • ConfigDbServer.farm_prd
  • CpsDbServer.farm_prd
  • D0DbServer.farm_prd
  • MuonMDTDbServer.farm_prd
  • MuonMSCDbServer.farm_prd
  • MuonPDTDbServer.farm_prd
  • SmtDbServer.farm_prd
  • SmtDbServer_oracle.farm_prd
  • ConfigDbServer (kept for backward compatibility)
  • TriggerDbServer

(Note: there are two servers for the CFT and SMT: a fast cache and a DbServer (whose name includes oracle). If there are problems with these, first restart the oracle server. If that does not work, restart the cache server.)

SAM NameService: Registered DbServers gives the status of these servers in addition to the regular SAM DbServers. Check that there is no red indicator beside the trigger server, and that all other calibration server indicators are green. If any red or yellow indicators persist for more than a couple of minutes, follow the procedures at " Stop and Start DØ DbServers" to restart the affected servers. Make sure that you go to the appropriate node to perform the restart operation! When some nodes have problems, some servers 'failover' to run on another node. The appropriate node that needs to be restarted is indicated in the NameService list under "Host".

If you receive user complaints that a particular server is not functioning correctly, or if there are problems with farm reconstruction which may be caused by a calibration server, then here is a test program that can be run on any machine with a DØ code release:

where <recent_release> is an optional release version number and <server_type> is either "--user" or "--farm". If the program hangs on a particular server, or exits with an error, it is likely you have found the problem.

Notes:

Log files for the calibration servers can be found at D0db dbServer Diagnostics. If the server shows no activity for some time, or there is no sign of any successful queries in the logfile then restart it.

If you are unable to resolve the problem then contact an expert at .

Shifters may ignore servers that are not standard user or production servers (i.e. that are not listed on SAM-At-A-Glance) or that are running on offsite hosts.

Before you can perform this duty effectively, please use the Stop and Start DØ DbServers instructions to practice stopping/starting the servers on d0dbsrv7.fnal.gov (the development node), logging in as user "d0db". After you succeed in doing so, you should let the mailing list know and ask to have your root Kerberos principal added to the appropriate places so that you will have the authorization to stop/start any calibration server when it is necessary.

Back to the Top


SAM GridFTP

This section contains information for those who are allowed to approve requests for a service certificate. THIS IS NOT TRUE FOR ALL SAM SHIFTERS. The ones who are allowed to do the approval will know that they have given this privilege.


How to register a new SAM GridFTP installation

GridFTP clients and servers use an X509-based Security Infrastructure (GSI) to establish a security context. Clients and servers have an X509 identity associated with them: their sam server certificate. Servers rely upon a local file (grid-mapfile) that lists the identity of the clients authorized to transfer files. At Fermilab, we maintain a central copy of all the known sam_gridftp clients in the sam_gsi_config CVS repository (see CVS repository for more info); the sam_gsi_config product provides tools for the SAM station administrator to securely install this central list at their local GridFTP servers.

Instructions for requesting a certificate are available on the " Installation of gridftp" webpage, under "Certificate renewal". Obviously if this is a new installation you can ignore all the instructions about the 'old certificate'. There are instructions for testing that sam_gsi_config and the certificate are installed properly, at " Test your certificate" and " Test of sam gridftp".

The installer of sam_gridftp is asked to send the identity of their new clients to . The identity of the client is defined by a string called the "certificate subject". To add a new client certificate subject (your default login is often sufficient) follow these steps:

  1. Set up cdcvs (whatever that entails for the machine you are using). Note: On clued0, a previously executed kinit <username>/root causes problems with CVS access. Either logout and log back in, or use kinit -f when returning to your default account.

  2. Execute this command:

  3. Edit the grid-security/grid-mapfile for your experiment by appending a new line of the form:

    where <full.machine.name> is the full alias of the new machine (so that a DNS can address it). It is VERY IMPORTANT that the certificate subject string be in quotes. The word "sam" at the end of the line means 'allow account file transfers for UID sam'.

  4. Execute the command:

A cron job for the user 'sam' will make the new entries available for distribution to remote institutions every hour, or you can follow the directions below for updating sam_gsi_config.


How to re-register a SAM GridFTP installation when the service certificate has expired

When a SAM service certificate for a certain host expires, the station administrator can submit a request for a renewal certificate. In this case, follow the same instructions as for "How to register a new SAM GridFTP installation", but replace the old certificate subject with the new one instead of simply appending a new line.


How to approve a service certificate

The following is the procedure that DOEGrids and SAM-Grid have agreed upon to approve a SAM service certificate:

  1. Contact the sam service certificate requester and verify that s/he has initiated the request. Valid communication mechanisms are: If the requester and the activities are known (for example, the requester attends regular meetings), then a simple email communication is sufficient.
  2. Record the details of the communication with the requester in an email to the helpdesk that approves the service certificate. The email must be digitally signed.

How to update the version of sam_gsi_config

Certificate renewal will sometimes require that the version of sam_gsi_config be upgraded. To do this, log onto the machine that needs to be upgraded and do the usual setups. Then execute:

where <new_version> is the new version number to install. If you just want the newest version number, this parameter can be omitted. Next:

This should complete the sam_gsi_config installation.

Back to the Top


Other Tasks


Updating Documentation

Please note that the following does not apply to SAM FAQs which can be edited by clicking the "editable" link at the top of that webpage.

As a shifter, if you notice some incorrect, outdated, or missing documentation regarding SAM, you should update the documentation package appropriately. To do so, you must have access to the "cdcvs" code repository as described in What You Need to Know to Start. Remember, these actions should be done under your normal personal kerberos principle (not a root principle). Create and change into a directory on the Linux machine of your choice into which you will download the files to be modified. Then execute these commands:

The ls command will show you the files available for editing. Most names are self-explanitory and all web page files carry the 'html' extension. In addition, the file name can be checked against what is displayed in your browser's address window. Edit files using your favorite program. When you are done with your changes, execute the following:

where <your_comments> should be a summary of the changes that were made and the reason(s) for making them, and <file_list> is a white-space-seperated list of the files to be affected by the CVS command. If you omit the list of files, all files in and below the current directory that have been modified will be updated in CVS. If you omit the -m option, CVS will open a text editor for comment entry when you execute the commit command. Saving and exiting from the editor will complete the updating. Exiting without saving will abort the updating. You can specify which text editor you want CVS to use by setting the environment variable "CVSEDITOR". The tag command identifies the updated files as available for general use. New versions of documents will automatically appear on the web servers within 24 hours (many appear within one hour).

For more information about CVS, please see the CVS home page. Additional information is also available on the SAM Webservers page.

Back to the Top


Remote Station Registration Procedures

Command line interfaces are available through the "sam_admin" package for most common requests.

See the sam_admin documentation for more details.

An email request for a new remote station must include the following information:

  1. Station Name
  2. Full names of the nodes that will be included in the station
  3. Operating system of each node (usually "linux")
  4. Hardware type of each node (usually "pc")
  5. Full name of the node that will host the Station Master
  6. Username(s) for the station administrator(s)

Many times a request will only include the node name(s) and station name. It is pretty safe to assume the operating system is "linux", the hardware is "pc", and the administrator is whomever sent the email. If it is not specified, you can pick which node will host the Station Master, and it just requires a little research to find a person's SAM username.

When issuing samadmin commands, you will be prompted for a "username" and "password". Be sure to use the username and password you normally use on the production database machine.

Here are the steps to register a new station:

  1. Use the following samadmin command from a capable machine to register the station:

    where <station_name> is the name of the new station, <user_names> is a space-seperated list of the new administrator username(s), and <description> is a brief description of the station. As an example, if someone from 'X University' is requesting the station, just use "XU Station" as a description. Note that the quotation marks ("...") are necessary for each option's arguement.

  2. If there is more than one machine in the new station, you have to resigter each and every node individually. To register a node in the production database, use:

    where <hardware> is the hardware type (probably "pc"), <operating_system> is the operating system type (probably "linux"), and <node_name> is the name of a machine in the new station.

  3. Set up an upload cache area on d0rsam01 for the new station by logging in to any clued0 machine as 'sam' running:

    This creates the upload directories on d0rsam01. The add_d0rsam01_store_dir script should print out the rest of the commands which need to be run, or you can continue following these instructions. Add the location of the new upload area to the database with the following two commands:

  4. For a station to be able to transfer files from Fermilab, the station routing options must be set in a text-based configuration file that will be read by "sam_bootstrap". Make sure the new station's admin includes the following options in the "station" entry of that file:

    --routing-station=<file_pattern>::<router_name>
    where <file_pattern> should be a file location pattern written as a basic regular expression and <router_name> is the name of the proxy router the station should be using.
    --routing-group=<group_name>
    where <group_name> is the name of the experimental group who owns the station.
    --routing-user=<dummy_username>
    where <dummy_username> is a dummy user 'account' name under which the router will hold intermediate files. It should be a name that identifies the station.

    The following settings are optional (i.e. there is a pre-defined default):

    --routing-station-metrics=<router_name>::<concurrent_files>
    where <router_name> is again the name of the proxy router the station should be using, and <concurrent_files> is the number of files to be transferred concurrently from the router (the default is 1).
    --routing-station-group=<router_name>::<dummy_groupname>
    where <router_name> is the same as above and <dummy_groupname> is a dummy 'account' group under which the router will hold intermediate files. The dummy user 'account' listed above must belong to this dummy group, and this group overrides the file's original group(?).
    --routing-public-node=<stager_node>
    where <stager_node> is the name of a node in the station to be used to stage intermediate files. The default is any available node.

    Here is a fictional example:

  5. Remote stations should use gridftp for file transfers. The use of bbftp is no longer supported.

  6. SAM stations in Germany should use "d0karlsruhe" as a proxy.


Installation instructions for the end-user are available here:

Additional information can be found on the CDF Grid Computing page

Back to the Top


Marking SAM files as bad

Sometimes you will get a request for some files in SAM to be marked as "bad". This is usually a legitimate request, but if you are unsure, contact a SAM expert for approval before you modify anything. There are three scripts available here with which you can modify a SAM file's status. It is recommended that you use scripts instead of typing commands directly on the command line because it is more secure for your database password. Just make sure that the permissions on the script file itself is set so that only you can read it (e.g. "-rwx------"), and do not leave your password in the file unless you are actively using it.

In each script, there are parameters which must be updated before each use of the script. The parameters are described in the comments within the file. To save a script to a local directory, you can use one of two methods:

Here are the scripts:

declare_sam_file_bad-singlefile.sh
Script for setting the "status" of a SAM file to "bad". Use this version when you have a single file.
declare_sam_file_bad-multifiles.sh
Script for setting the "status" of a SAM file to "bad". Use this version when you have a short list of files.
declare_sam_file_bad-listfile.sh
Script for setting the "status" of a SAM file to "bad". Use this version when you have a text file containing a list of the file names to be modified.

Back to the Top


How to add a new application version in SAM

To add a new application version in SAM you need the following input from the user: appVersion, appFamily and appName. Execute the following commands after having done 'kinit -F username/root':

Where "xxx", "yyy" and "zzz" is replaced with the infor from the user. The output will be the new applicationFamilyId. (For an example, see Issue #3550 in the SAMGrid Issue Tracker.) To check that the new application version is really there you can search for the applicationFamilyId on " this page".

Back to the Top


Original Author: Kin Yip
Last modified: Fri Sep 28 12:18:07 CST 2007
$Author: kinyip $
$Revision: 1.178 $