SAM Shift Guide |
| The contents of this guide can vary depending on the experiment. The current information is for the experiment indicated by the radio button to the right. To see a version for a different experiment, click on the button next to the experiment name. |
|
Page Legend:
|
|
The SAM shift person, in order to be able to diagnose and/or fix problems, must first make sure that the following particulars are satisfied:
The shifter should be subscribed to at least the following mailing lists:
Additional mailing lists may be required depending on the experiment. Mailing lists are archived on the FNAL LISTSERV server. If the mail traffic is heavy on the lists, the shifter may choose to temporarily disable the mail delivery between shifts. Please review the LISTSERV Instructions for Users to see how to subscribe and for additional information.
Note for DZero: Users should send email regarding SAM problems to d0sam-admin. Messages sent to this list are processed through the plone Issue Tracker before being forwarded to all shifters. If a SAM-related problem email appears on one of the other lists, the shifter should
Also note: Sometimes CDF users send queries to sam-admin or sam-users instead of the CDF-specific lists (cdfsam-admin and cdfsam-users). The general names are at present (4/10/05) aliased to the DØ lists. (We hope this will change soon.) For the time being, if a CDF query comes to the DØ list, email the user with the name of the correct CDF list to use.
The shifter must be able to log in as username
'sam' on all of the "sam-critical" nodes. This
involves having your username added to the .k5login file
on the appropriate machines.
NOTE: New shifters will not be added to the .k5login
files for the sam account until they have become familiar with the
system, the requirements, etc. If you are a new shifter, and cannot
log in to the sam account yet, please ask an experienced shifter
for assistance with steps to be performed as user
'sam'.
The shifter must also obtain a special "username/root"
principal in addition to your normal principal by filling out the
FNAL kerberos form or by sending email to
compdiv and asking for a special
"username/root" principal. Make sure that your
"username/root" principal has been added to the
.k5login file for the sam account on
the relevent nodes. To use this account, ssh
(do not telnet using a cryptocard) to any
kerberized node in the Fermilab domain with your normal username.
Then:
where <username> is your normal
user name. When prompted, enter your "username/root"
password. To activate root priviledges on the machine
you are logged in to, type:
To log in to another node with root priviledges, type:
where <nodename> is the name of
the machine you want to log in to.
The shifter should have an oracle username/password for SAM-related
databases in order to add/modify database information. For CDF,
this is the Offline Production database "cdfofprd".
To request an account, send an email to
cdfdb-support
(CC: cdf_dh_help) which
includes the following information:
The shifter should have an oracle username/password for SAM-related
databases in order to add/modify database information.
For DZero, this is the Offline Production database
"d0ofprd1". To request an account, fill in the
D0 Database
User Access form, selecting:
The shifter should have an oracle username/password for SAM-related databases in order to add/modify database information. The method of making a request for an account varies by experiment.
Once an account is created, you should change
the initial password to one of your own creation as soon as possible.
This can be done using a program called sqlplus. On
an appropriate machine type:
where <username> is your database user
name and <dbname> is the name of the
appropriate database machine. When prompted, enter your current
password. Inside the sqlplus program, type this line to change your
password:
where <username> is your user name
again, and <new_password> is the new
password you want to use, and
<old_password> is the old pasword you
are replacing. To exit sqlplus, type "quit".
This process must be repeated for each database you
are given an account upon.
- Read access
- Send email to the helpdesk with your Kerberos principle requesting access to the "cdcvs" repository. Include the name of your supervisor or sponsor.
- Write access
- Send email to the SAM representative for your experiment explaining that you are a SAM shifter and need write access for SAM packages.
When encountering a problem on shift, first consult SAM Triage for assistance. Also try searching the SAM FAQ page, the dzero sam-admin archives, and the cdf sam-admin archives for precedents. Try to find out as much as possible about the problem so that others can help you more easily, and so that similar occurances in the future can be solved more easily.
Shifters have several responsibilities, including monitoring many aspects of the system, as outlined here:
Keep a log of your shift activity and email copies to the appropriate shifter list for your experiment.
Attend the brief, semi-daily teleconference on CDF SAM operations (Monday through Friday except Tuesday, 8:45am Chicago time) and Tuesday's 9:30am general SAM operations meeting (FCC1). The teleconference can be attended via either EVO or ESNET. For EVO, a meeting room has been set up in the 'CDF' community called 'SAM DH Operations'. For ESNET, use the meeting number 88SAMDH (8872634; the 'DH' stands for 'data handling'). For a simple telephone connection, dial +1-510-883-7860, wait for a signal, and then dial 88SAMDH. If these options do not work, send an email with an exact description of the problem to sam-design and cdf_dh_help.
Know who is the current on-call SAM expert by checking the On-Call List.
Monitor the mailing lists and the Issue Tracker for reports of problems or requests for information from users. Attempt to help when you can, especially in pointing people to existing documentation. If there is a problem you cannot fix or question you cannot answer, please provide an expert with as detailed a report as possible.
We would like to see a shifter response to every issue on the list within some reasonable window. Some messages are user questions about how SAM works. Some messages are reports of problems. The educational aspect of answering these emails promptly and correctly is very important. Even if your response is that you do not know the answer or solution and that you need to send it "over to an expert", answer the email.
If you do not know an answer immediately, please try to research the problem using the existing documentation and tools. If that fails and you turn the problem over to an expert, please follow up on it:
Post important information regarding SAM operations to the user mailing lists. The content of such notices could be that some portion of the system is down, that there is a problem that is known and being worked on, that there is a downtime in another system that is affecting SAM, etc.
Monitor the following web pages:
- SAM At A Glance
- The page displays the status of the NameServer, DbServer, and individual stations.
Under the "SAMDbServers" section of the SAM-At-A-Glance page, some server listings are labelled with the connection machine they are "Using:" while others are labelled as "Dispatching" to other servers. In most cases, the "receiving" servers share the same name as the "dispatching" server with the addition of the string "_srv<N>_", where<N>is an integer number. In rare cases you will have to use the long way to find out which "receiving" servers the "dispatching" DbServers talk to.
- Note which machine the DbServer is running on (under the "Host:Port" heading).
- Log in to the executing node as user "
sam" and do the usual setups.- Look in the "
private/*server_list*" file for that machine to get the "sam_config" configuration name ("<configname>") for the DbServer you are interested in. Each configuration name available is on its own line that begins with "dbs". Unfortunately there is no easy way to tell which configuration type a particular DbServer is using. Use the "ups inguire" command to see if the DbServer you are interested in is listed as "SAM_DB_SERVER_NAME" for a particular<configname>:
- $ ups inquire sam_config -q <configname> | grep SAM_DB_SERVER_NAME
- Use the "
ups inquire" command to find the "SAM_DB_SERVER_CONFIG_FILE" name ("<configfile>"):
- $ ups inquire sam_config -q <configname> | grep SAM_DB_SERVER_CONFIG_FILE
- Look in the "
private/conf/dbserver/<configfile>" file to find the servers that are being dispatched to. They are listed as "remoteServers":
- $ more private/conf/dbserver/<configfile> | grep remoteServers
A similar procedure can be used to find the log files for a station ("<stationname>").
- Note the "Host:Port" machine name ("
<machinename>") for the station you are interested in.- Log in to that machine as "
sam".- Look at the contents of the "
private/<machinename>_server_list.txt" file to find the line beginning with "station" that has<stationname>as the fourth word in the line.- The second word in that "
station" line is the station log tag ("<logtag>"). The log files are therefore in the directory "station__<machinename>__<logtag>__<stationname>".
- NetStat
- A summary, by node, of the count of recent established network connections. We do not really have an example of what this plot should look like. Just keep an eye out for trends that do not return to the "average".
In case of problems, try to:
Update the SAM database as necessary.
Usually requests come through email to the
sam-users
mailing list and consist of adding new nodes, new stations,
new application families, new users (should auto-registration fail),
etc. If there is a request which can not be done with the
sam_admin
commands and you are unfamiliar with the other ways of changing the
database, do not do it!
Restart web servers as necessary.
Monitor the status of Enstore for your experiment.
Check the "Enstore System Summary" and see if the status dots are green. These dots are useful and can tell you if the system is in trouble. Lots of red and yellow is bad, however an occasional red mover is not uncommon. If the system is in very bad shape for more than 20 minutes, or minor problems persist for more than 4 hours during the business day, then send an email to enstore-admin (CC: cdf_dh_help) describing the problem. Enstore support has its own monitoring and often catches without our help. If you send email do not be surprised if you get a reply saying they are already working on it. An unnecessary email is preferable to having a problem go unnoticed, but also try to be patient.
Check the "encp History" page, under "Files Transferred" (scroll half way down the page), and make sure requests are being processed (hit the browser's refresh button and see if the entry at the top of the list changes). You can investigate any errors by clicking on the error for additional information before deciding how to proceed. When in doubt about an error, you can try sending email to enstore-admin with a CC to cdf_dh_help.
You should expect to see transfers from the following nodes:
d0olf-gb-1 to
<volumename>d0rsam01 to/from
<volumename>d0srv<nnn> to/from
<volumename>cabsrv1 &
cabsrv2 disk caches)If a category is missing from the list, check the Enstore SAM-At-A-Glance page to see if the "Servers" and "Library Managers" are green. Also check the "Movers". If more than one or two non-crossed-out movers are yellow or red, contact the Enstore primary expert (detailed instructions here).
A few CDF tapes are in the public enstore area, which is called "STKen". Other experiments have much larger amounts of data on this system and they tend to monitor it pretty well. Still, you should check that requests are being processed (you have to hit the refresh button to make sure you are seeing the latest information). Anything that has been in the "dequeued" state for more than an hour indicates a problem and should be reported via a call/email to the helpdesk (if open) and cdf_dh_help or contact the on-call SAM expert.
Monitor and assist in the Recocert histogram verification procedure. The executable that generates certification plots is currently being run automatically on the production farms. There are three parts to the shifter's responsibility:
recocertSAMMerg.sh
which merges ROOT output files into a single file per run number.
Please reference this instructional page for more details.
Users create and check sam datasets via the command line or the web. The user_prd DbServers, the web_prd DbServers, and the web servers must be functioning to enable this. There are two items to check:
The links below go through the SAMzilla server to display the log files for each DbServer. Check each to see if there are recent entries at the end of the log.
If the log files do not have recent entries, try executing a dataset command. If the command fails, restart the appropriate server.
Try to load the
Dataset Definition Editor.
If the page does not load, or cannot find a known dataset (for
example, "testdde"), then
restart the web server.
Every 2 to 3 hours, check the disk space in the SAM logfile area. To do this:
Log in to d0ora2.fnal.gov as user
"sam".
Execute the following commands:
From the "df" command, you should see something
like this:
| Filesystem | Size | Used | Avail | Use% | Mounted on |
| /dev/dsk/c5t0d2s0 | 32G | 17G | 15G | 52% | /system/disk36 |
If the "Use%" gets to 100% on this disk, SAM
worldwide will halt, and the cleanup procedure is tedious.
If you see the "Use" approaching 100%, take action!
From the directory ~sam/private, try:
to show directories with more than 1G of log files. Delete the oldest files first as these have probably already been transfered to tape for storage during the nightly purges. Files from the most recent three days are generally kept on disk for debugging purposes, but if the disk is in imminent danger of filling up the earliest log files are the most dispensible.
Update this documentation or at least point out to experts when documents are needed, out-of-date, etc.
Additional monitoring tools are available on the SAM diagnostics page.
- - SAM Plots and Statistics
- The link sits just under the page title and leads to "SAM Production Plots" and "SAM Consumption Plots". These plots are useful for monitoring the behavior of SAM over time.
- - CORBA Name Service Polling
- This is in the third box on the left and is a quick way of having the naming service check the status of any individual registered station (through the drop-down text box), or multiple stations grouped by type (through the links).
There are times, however, when these pages will indicate SAM problems when the real culprit is something in the web server configuration itself. If the web-based tools from the diagnostics page indicate a serious problem, it is good to double-check using a different tool (e.g., run an appropriate SAM command on a relatively stable system) before attempting any corrective action.
A shifter should check that the DbServers and the critical SAM stations (labeled as "Monitor Level: Critical") are running using Sam At A Glance.
Check the last update time of the page (third line from the top) to be sure the information is up-to-date. You will need to scroll down to see all of the DbServers. You can click on a station name to bring up a statuspage for the processes running on that machine. Check the status of the Station Master and/or FSS components of the critical systems listed. They should be green.
On each station summary page, click on the
"Master:Station" link to see a dump of the station and check the line
that begins with "*** BEGIN DUMP STATION"
to see how long the station has been running (you may need to scroll
the screen to the right). If it has restarted recently, try to
figure out why. If the station is not running, check the trace file of
the corresponding DbServer to begin tracking down the problem. Take
steps to restart the DbServer. If the bootstrap recovery does not work,
contact an expert or mentor shifter to further diagnose the problem and
restart things.
A comment about the "normal" stations -- These stations require minimal support. The local station administrator should handle the problems, not the SAM shifter. The administrator(s) of a given station can be found by issuing the commands:
where <station_name> is the name of the station you want information about.
The webserver for each experiment is different. To check on the machine, look at SAM At A Glance, making sure to refresh the page so that you are not viewing a cached version. If the page has problems loading, it is likely that the 'apache' web server needs to be restarted. Other indications of trouble include reports of problems with the the autoRegister cgi script, or blatant hangs, crashes, etc., of related web sites.
If you have to restart a web server, follow these instructions:
The above steps usually solve the problem, but if the output contains
one or more java stack dumps complaining about
"socket already in use", the java server is hosed. Usually a
problem java server ends up hogging a port and is unable to kill itself.
If this happens, the shifter must manually kill the remaining processes
and explicitly restart apache.
Please see the SAM Webservers documentation for further details.
samTV is a great way to check if SAM file delivery is operating normally. Most objects on the page are hot (i.e. "clickable") to get additional information. Investigate nodes with a lot of red/errors. An example error diagnosis can be found here.
DZero SAM shifters should contact the experts if a problem is suspected with samTV.
The "samTV" monitoring web page runs on the web server
machine and sometimes needs to be restarted even if the machine itself
is okay. To check on the status of the process:
Log in to the web server machine for your experiment.
Type the following setup commands:
To determine the samTV process ID, type:
and look near the end of the resulting printout.
Type:
If grep found anything, then samTV is
running and may just be slow. You can use
to see what samTV is up to, but note that the log file
may sit for many minutes (e.g. 10) without updating.
If grep did not return anything, then
samTV has died and needs to be restarted.
To restart samTV (after having checked on the status of the process as described above):
First, make a copy of the log file so that experts can look at it if necessary:
where <logdir> is the log directory
provided by the "inquire" command, and
<yourdir> is a personal directory where
you have some spare disk room.
To restart the samTV process, type:
You will see a warning message about the process having
disappeared. This is normal. Type and same command again, and
you should see samTV starting up.
Wait about a minute, and then
should give a new process ID. You can watch the log file using
"ups tailLog samTV", but note that it may sit at the
same place for many minutes.
It generally takes about 10 miutes before the web page is ready.
When you see something similar to
"--- At Wed Jul 30 13:07:22 2003 Waiting for 3000
seconds ---" as the last line in the log file, then the
page is ready.
Occasionally a new SAM station is created that needs to be monitored
via samTV. To add a station to samTV,
follow these steps:
Follow the "setup" instructions listed above under "check on the status"
Type:
Go to the current product version directory:
where <version> is the current version
number of samTV. Use can use
"ups list samTV" to see the available version
numbers.
Edit the configuration file
"<nodename>_sam_config_prd.table.dat"
where <nodename> is the full
name of the web server machine. Add the new station name to
the list for parameter "SAMTV_STATIONS", using a colon
(":") as a
separator between station names. Save the edited file.
Restart samTV as described above.
Problems with the SAM Stations or DbServers should be detectable from SAM At A Glance. Entries that have a red dot in front of them might need to be restarted. If a problem is suspected with a specific Station or DbServer, examine the corresponding log files to get further information. To find the log files:
Look at SAM At A Glance - the first column contains a Station or DbServer name, the second column contains the name of the node where the Station or DbServer runs.
As user "sam", log in to the node you looked up in the
previous step.
Note: only a few people have permission to access this machine.
If you try to log in and get a permission denied message,
forward the information about the problem you are trying to solve
to the experts
and they will take care of it.
Go to the ~sam/private directory, and
look in the *_server_list.txt file.
On the line which starts with "station", the second
word is the <server_name> and the
fourth word is the <station_name>.
The log files are therefore in the directory
"station__<machine_name>__<server_name>__<station_name>",
where <machine_name> is the name of the
node you are logged in to, and the other two you just looked up. The
log file itself is called "trace".
A similar procedure can be used to look up DbServer logs.
If you need to look up the name of an administrator for a specific station, you can use the Stations Group Report.
Note that the cdf-fncdfsrv0 (farm) station is
structured differently and you must read the file
"HOWTO.readme" as sam@fncdfsrv0
for more information.
Even if all entries on the SAM-At-A-Glance page have a green indicator, there might still be a problem. If the monitoring system loses contact with a node for too long, that node's entry will simply disappear off of the SAM-At-A-Glance page. To check for this, compare the current page with the example snapshot to see if any nodes are missing.
Listed DbServers with a green dot might still have problems, so you need to look at the entire entry line. For example:
| Server | Host:Port | Version | Up Since | ||||
| fake-server.prd:12345 (Using: some-database@dbmachine) | |||||||
| home2.fnal.gov:67890 | v5_1_1 | 05 May 2007 12:20:39 | |||||
is a working DbServer. If the comment in parentheses says that it is not connected, then the DbServer needs to be restarted. If you are not sure about this, there are several SAM commands which can be found under samDbServerInfo that can be used for further testing.
Also, CDF Data Handling At A Glance shows the number of DbServers not responding. Click on that number to get 24-hour history plots of connection activity and timing information (with points taken at 5 minute intervals).
Sometimes the SAM At A Glance page
will have red text across the top indicating that information on the page
has not been updated recently. First, refresh your browser window to make
sure the message is not just from your own machine losing connection.
Upon refreshing, if the message is still there it indicates either the
naming service is not working, or the cron job for one or more
nodes is not working. You have to determine which type of process is
broken. Go to the
NameService
page
and check for red dots down the left side of the page. Just a few dots
indicates a problem with one or more cron jobs. An
expert must be contacted
to resolve a cron job problem. If there is a red dot in front of all of
the DbServers, it is more likely that the naming service is not running
and you have to restart it. Before restarting the naming service, please
double check that it is dead by trying a command like
"sam locate foo"
(which will fail if there is a naming service problem).
To check the status of the optimizer, use the command
A working optimizer will respond with "alive". If the
optimizer is in a bad state, you will get a CORBA exception which should
give you some information about what is wrong. An
expert must be contacted
to resolve an optimizer problem.
If a DbServer or the Name Service has been inactive for more than half an hour, or there are errors in the log file, it should probably be restarted. Use the SAM-At-A-Glance page to find which processes are running on which machines.
Note to DZero shifters: Do NOT restart the farm DB server
(D0FarmDbServer.prd:D0FarmDbServer) unless explicitly
requested to do so by Mike Diesburg of Daniel Wicke. There are some
long queries that must be run periodically that make the server appear
to be bogged down. Restarting this server without need disrupts
running jobs and queries.
To restart one of the DbServers and/or the Name Service:
Log in as user 'sam' on the appropriate machine where the server is running.
Execute the following:
where <date_stamp> is the current date plus your initials in the format "YYMMDDHHII" where "YY" is the two-digit year, "MM" is the month, "DD" is the day, "HH" is the hour (out of 24), and "II" are your initials.
Edit the server_list file with your favorite text
editor, commenting out the station, server, or optimizer you need
to restart by inserting a hash symbol ("#") at the beginning of its
line. Make a note of the full name of the server as it appears in
the server_list file to use in place of the variable
<fullname> in the next step.
Save a copy of the log/trace file by executing the following:
Do not forget to substitute the correct value for <date_stamp>.
Update the scripts that run the servers by executing:
This should kill any instances of the process you are trying to restart.
Edit the server_list file again, this time removing
the comment hash symbol ("#") you added before.
Update the scripts again to restart the processes by executing:
The procedure for restarting a SAM station is similar to that for
restarting a DbServer, except the
commented/uncommented line in the server_list file should
contain the keyword "station".
If the naming service does not restart, there might be a problem with
its cached information. Check the trace file using, for example, the
less command:
If the trace file contains a line beginning with "IOR" followed by a lot of numbers and letters, then the cache is ok. However, if the trace file contains something like:
System exception `COMM_FAILURE' Reason: no such host is known Completed: no Minor code: 1330577422 (gethostbyname() failed) Killed process: 29754
then you might have a corrupted cache. In this case -- and
only in this case -- delete the file called
"log" which is found in the "sam" directory
and restart the naming service as described above. It is recommended
that you also restart the optimizer, the critical SAM station(s), and
all the DbServers a few minutes after restarting the naming service.
After deleting the log file and restarting the naming
service all DbServers and stations are disconnected from the naming
service and all queued jobs are lost,
so be careful when doing this. After an hour or so, the stations and
DbServers will have re-registered themselves with the naming service
and things should be back to normal.
Occasionally you may be asked to help set up a new station or re-configure an old station. Instructions for station setup and configuration are available here.
A new page is under development here.
You can find enstore status information from the Fermilab Mass Storage System on the Enstore home page. Links exist for each experiment, and from the experiment page there is a link to tape inventory information.
If you are outside of the fnal.gov domain, the links on the Enstore pages may not work by simply clicking on them. However, you can still view static versions of the pages (generated every hour) by explicitly typing in the page addresses. The URLs for the pages mentioned above are:
The enstore group now adds new tapes so shifters should not need to worry about this, but if you notice problems (such as tapes marked "NOACCESS" on the Data-Handling-At-A-Glance page) send email to enstore-admin and helpdesk (CC: cdf_dh_help).
If large numbers of tapes (more than ~10) become "NOACCESS" in the space of a couple hours or less, it could mean there is a serious hardware failure in one of the tape robots. The helpdesk should be called and asked to page the Enstore primary expert. Request that the expert keep you informed of the situation by email or phone, and you need to do the same for the admins of any affected experiments. If you call the helpdesk after hours, be sure to tell the message taker the keyword "" (spell it for them).
If a tape stays "NOACCESS" for more than one weekday and there is no comment in the detailed listing (page reached by clicking on the "Bad Tape" link, comment is the last column on the right), send email to enstore-admin (CC: cdf_dh_help) asking about its status. A comment in the text listing indicates that an expert has looked at the problem at least once since they are the only ones who can enter the comments.
If a tape is marked as "NOTALLOWED", then it has been investigated by experts and needs to have more work done before it is usable again. Unless a tape has been in this state for a long time and users are demanding access to it, tapes in the "NOTALLOWED" state can be ignored.
Another serious but very rare tape status problem occurs when the encp synch cron job becomes corrupted and begins marking all tapes with state "DELETED". This would be signaled on the same web page as above (and also by catastrophic file delivery errors from projects accessing the tape!). If this happens, it is necessary to contact the on-call expert to get the cron job turned off and the tape status corrected.
There are a number of calibration DbServers, each running a separate user and farm instance.
|
|
|
(Note: there are two servers for the CFT and SMT: a fast cache
and a DbServer (whose name includes oracle). If there
are problems with these, first restart the oracle
server. If that does not work, restart the cache server.)
SAM NameService:
Registered DbServers gives the status of these servers in
addition to the regular SAM DbServers. Check that there is no red
indicator beside the trigger server, and that all other calibration
server indicators are green. If any red or yellow indicators persist
for more than a couple of minutes, follow the procedures at
"
Stop and Start DØ DbServers" to restart the affected
servers.
Make sure that you go to the appropriate node to perform the
restart operation!
When some nodes have problems, some servers 'failover' to run on
another node. The appropriate node that needs to be restarted is
indicated in the NameService list under "Host".
If you receive user complaints that a particular server is not functioning correctly, or if there are problems with farm reconstruction which may be caused by a calibration server, then here is a test program that can be run on any machine with a DØ code release:
where <recent_release> is an optional
release version number and <server_type>
is either "--user" or "--farm". If the
program hangs on a particular server, or exits with an error, it is
likely you have found the problem.
Notes:
/d0dist/dist/releases/ for the available
choices for <recent_release>.Connected to" line for the CPS or prints a few
error messages before querying the CPS data then this is normal
and can be ignored.Log files for the calibration servers can be found at D0db dbServer Diagnostics. If the server shows no activity for some time, or there is no sign of any successful queries in the logfile then restart it.
If you are unable to resolve the problem then contact an expert at d0db-support.
Shifters may ignore servers that are not standard user or production servers (i.e. that are not listed on SAM-At-A-Glance) or that are running on offsite hosts.
Before you can perform this duty
effectively, please use the
Stop and Start DØ DbServers instructions to practice
stopping/starting the servers on d0dbsrv7.fnal.gov (the
development node), logging in as user "d0db". After you
succeed in doing so, you should let the
css-dsg
mailing list know and ask to have your root Kerberos principal added to
the appropriate places so that you will have the authorization to
stop/start any calibration server when it is necessary.
This section contains information for those who are allowed to approve requests for a service certificate. THIS IS NOT TRUE FOR ALL SAM SHIFTERS. The ones who are allowed to do the approval will know that they have given this privilege.
GridFTP clients and servers use an X509-based Security Infrastructure
(GSI) to establish a security context. Clients and servers have an X509
identity associated with them: their sam server certificate.
Servers rely upon a local file (grid-mapfile) that lists the identity of
the clients authorized to transfer files. At Fermilab, we maintain a
central copy of all the known sam_gridftp clients in the
sam_gsi_config CVS repository (see
CVS repository for more info);
the sam_gsi_config product provides tools for the SAM
station administrator to securely install this central list at their
local GridFTP servers.
Instructions for requesting a certificate are available on the
"
Installation of gridftp" webpage, under "Certificate renewal".
Obviously if this is a new installation you can ignore all the
instructions about the 'old certificate'.
There are instructions for testing that sam_gsi_config and
the certificate are installed properly, at
"
Test your certificate" and
"
Test of sam gridftp".
The installer of sam_gridftp is asked to send the identity
of their new clients to
their experiment's sam-admin list.
The identity of the client is defined by a string called the "certificate
subject". To add a new client certificate subject (your default login is
often sufficient) follow these steps:
Set up cdcvs (whatever that entails for the machine you are using).
Note: On clued0, a previously executed
kinit <username>/root causes problems
with CVS access. Either logout and log back in, or use
kinit -f when returning to your default account.
Execute this command:
Edit the grid-security/grid-mapfile for your experiment by appending a new line of the form:
where <full.machine.name> is the full
alias of the new machine (so that a DNS can address it). It is
VERY IMPORTANT that the certificate subject string be in
quotes. The word "sam" at the end of the line means
'allow account file transfers for UID sam'.
Execute the command:
A cron job for the user 'sam' will make the new entries
available for distribution to remote institutions every hour, or you
can follow the
directions below for updating
sam_gsi_config.
When a SAM service certificate for a certain host expires, the station administrator can submit a request for a renewal certificate. In this case, follow the same instructions as for "How to register a new SAM GridFTP installation", but replace the old certificate subject with the new one instead of simply appending a new line.
The following is the procedure that DOEGrids and SAM-Grid have agreed upon to approve a SAM service certificate:
Certificate renewal will sometimes require that the version
of sam_gsi_config be upgraded. To do this, log onto the
machine that needs to be upgraded and do the usual setups. Then
execute:
where <new_version> is the new version number to install. If you just want the newest version number, this parameter can be omitted. Next:
This should complete the sam_gsi_config installation.
Please note that the following does not apply to SAM FAQs which can be edited by clicking the "editable" link at the top of that webpage.
As a shifter, if you notice some incorrect, outdated, or missing documentation regarding SAM, you should update the documentation package appropriately. To do so, you must have access to the "cdcvs" code repository as described in What You Need to Know to Start. Remember, these actions should be done under your normal personal kerberos principle (not a root principle). Create and change into a directory on the Linux machine of your choice into which you will download the files to be modified. Then execute these commands:
The ls command will show you the files available for
editing. Most names are self-explanitory and all web page files carry the
'html' extension. In addition, the file name can be checked
against what is displayed in your browser's address window. Edit files
using your favorite program. When you are done with your changes, execute
the following:
where <your_comments> should be a summary
of the changes that were made and the reason(s) for making them, and
<file_list> is a white-space-seperated
list of the files to be affected by the CVS command. If you omit the
list of files, all files in and below the current directory that have
been modified will be updated in CVS. If you omit the -m
option, CVS will open a text editor for comment entry when you execute
the commit command. Saving and exiting from the editor will
complete the updating. Exiting without saving will abort the updating.
You can specify which text editor you want CVS
to use by setting the environment variable "CVSEDITOR". The
tag command identifies the updated files as available for
general use. New versions of documents will automatically appear on the
web servers within 24 hours (many appear within one hour).
For more information about CVS, please see the CVS home page. Additional information is also available on the SAM Webservers page.
Command line interfaces are available through the
"sam_admin" package for most common requests.
See the sam_admin documentation for more details.
An email request for a new remote station must include the following information:
Many times a request will only include the node name(s) and station name. It is pretty safe to assume the operating system is "linux", the hardware is "pc", and the administrator is whomever sent the email. If it is not specified, you can pick which node will host the Station Master, and it just requires a little research to find a person's SAM username.
When issuing samadmin commands, you will
be prompted for a "username" and "password". Be sure to use the
username and password you normally use on the
production database machine.
Here are the steps to register a new station:
Use the following samadmin command from a capable
machine to register the station:
where <station_name> is the name of the
new station, <user_names> is a
space-seperated list of the new administrator username(s), and
<description> is a brief description
of the station. As an example, if someone
from 'X University' is requesting the station, just use
"XU Station" as a description. Note that the quotation
marks ("...") are necessary for each option's arguement.
If there is more than one machine in the new station, you have to resigter each and every node individually. To register a node in the production database, use:
where <hardware> is the hardware type
(probably "pc"),
<operating_system> is
the operating system type (probably "linux"), and
<node_name> is the name of a machine in
the new station.
Set up an upload cache area on d0rsam01 for the new
station by logging in to any clued0 machine as 'sam'
running:
This creates the upload directories on d0rsam01. The
add_d0rsam01_store_dir script should print out the rest
of the commands which need to be run, or you can continue following
these instructions. Add the location of the new upload area to the
database with the following two commands:
For a station to be able to transfer files from Fermilab, the
station routing options must be set in a text-based configuration
file that will be read by "sam_bootstrap". Make sure the
new station's admin includes the following options in the
"station" entry of that file:
--routing-station=<file_pattern>::<router_name>- where
<file_pattern>should be a file location pattern written as a basic regular expression and<router_name>is the name of the proxy router the station should be using.--routing-group=<group_name>- where
<group_name>is the name of the experimental group who owns the station.--routing-user=<dummy_username>- where
<dummy_username>is a dummy user 'account' name under which the router will hold intermediate files. It should be a name that identifies the station.
The following settings are optional (i.e. there is a pre-defined default):
--routing-station-metrics=<router_name>::<concurrent_files>- where
<router_name>is again the name of the proxy router the station should be using, and<concurrent_files>is the number of files to be transferred concurrently from the router (the default is 1).--routing-station-group=<router_name>::<dummy_groupname>- where
<router_name>is the same as above and<dummy_groupname>is a dummy 'account' group under which the router will hold intermediate files. The dummy user 'account' listed above must belong to this dummy group, and this group overrides the file's original group(?).--routing-public-node=<stager_node>- where
<stager_node>is the name of a node in the station to be used to stage intermediate files. The default is any available node.
Here is a fictional example:
Remote stations should use gridftp for file transfers. The use of bbftp is no longer supported.
SAM stations in Germany should use "d0karlsruhe" as
a proxy.
Installation instructions for the end-user are available here:
Additional information can be found on the CDF Grid Computing page
Sometimes you will get a request for some files in SAM to be marked as
"bad". This is usually a legitimate request, but if you are unsure,
contact a SAM expert for approval before you modify anything. There
are three scripts available here with which you can modify a SAM file's
status. It is recommended that you use scripts instead of typing
commands directly on the command line because it is more secure for
your database password. Just make sure that the permissions on the script
file itself is set so that only you can read it (e.g.
"-rwx------"),
and do not leave your password in the file unless you are actively using
it.
In each script, there are parameters which must be updated before each use of the script. The parameters are described in the comments within the file. To save a script to a local directory, you can use one of two methods:
Here are the scripts:
To add a new application version in SAM you need the following input from the user: appVersion, appFamily and appName. Execute the following commands after having done 'kinit -F username/root':
Where "xxx", "yyy" and "zzz" is replaced with the infor from the user. The output will be the new applicationFamilyId. (For an example, see Issue #3550 in the SAMGrid Issue Tracker.) To check that the new application version is really there you can search for the applicationFamilyId on " this page".