Shutdown Procedures for UNDHEP Cluster

For a normal shutdown:

We have a special id on our VAX cluster for the purpose of shutting
the system down in a normal fashion by an unprivileged user.  This id
is a captive id in that it never gets to the DCL prompt.  It has all
the privileges, but ^Y is controlled so that it is ignored until the
shutdown actually starts, and from there on it logs off the id.  This
is so that a shutdown can be aborted if the user changes his mind.
The shutdown prompts you with several questions; see below.  There is 
no way to use the id for any other purpose but system shutdown.  

The id will work only from any LAT terminal, VWS VT220 window, local 
DECnet, or either console terminal beside the tape drives.  The id is 
SYSHUTDOWN and the password is UNDHEP.  Jim will be notified whenever 
the id is used.  Please use this id if you need to shut any node down 
and the system is otherwise normal, e.g., air conditioner failure, disk 
problem, or other major hardware failure.  This will not work on the 
DECwindows login, since it is a captive account, but can be used by 
SET HOST 0 from a terminal window on any cluster node.

The HEP Operator account may also be used for shutdown.  It is also
restricted to local terminal use.  Either say $ SHUTDOWN and answer
the questions, or do one of the following: $ @ND:SATREBOOT or
@ND:SATSHUTDOWN.  From this account on the boot node, any and all
satellites may be shutdown using TELL or SYSMAN and these ND: command
files. 

For a full cluster shutdown, shut down the satellite VAXstations first, 
then the boot node, usually UNDHD0. 

For this normal shutdown, respond  to all questions, except for
the reason, to which reply with the real reason for the shutdown.  On
each system, use the REMOVE_NODE and SAVE_FEEDBACK shutdown
options, and on the boot node, use the CLUSTER_SHUTDOWN and
SAVE_FEEDBACK options if the whole cluster is going down or
REMOVE_NODE and SAVE if only the boot node is going down and coming
back shortly.  Do not shut the boot node down, other than for a
reboot, unless all satellites are shut down first.  CLUSTER_SHUTDOWN
does not complete the shutdown until all satellites have completed
their shutdowns.  (You only need the first 3 characters of any
option).  

When the system says SHUTDOWN COMPLETE, use the system halt switch or
power switch to halt the system.  On the VAXstation 3200, the Halt switch on
the control panel latches in, but does not halt until it is pressed
again to pop it out.  Then turn the system off (if desired) with the
front panel switch.  To fully turn off UNDHE4, also turn off the power
switch on the expansion Qbus box, immediately below the UNDHE4 processor 
box, as well as the associated disk and tape box switches.  On the 
VAXstation 3520, the Halt switch on the control panel does not latch and 
halts the processor on one push.  On the VAXstation 2000s, the Halt switch 
is a small pinhead push switch against the top of the unit at the back, 
next to the cable connections.  On the VAXstation 4000-90, the halt switch
is behind a door on the front panel.  Boot the system as described in the 
VAX Restart Procedures note. 


For a shutdown to recover from a system lock-up:

On the VAXstations, when the system is nominally up, but the keyboard
and mouse are locked/ignored, check the consoles to see if it is
giving any reason for the failure of the VAXstation.  Also check to
see if LAT sessions on that node are still active.  If any session is
still alive, and you can log in, use the above procedure to shut down.
Also check the affected system from another node with the VMS command 
SHOW SYSTEM/NODE=nodename.  Try to capture this on the hardcopy console of
UNDHD0 or with Cntl-Print Screen on the UNDHE4 console, if possible. 
If nothing is active, use the Halt button and then reboot. 

In peculiar circumstances such as repeated hangs of the same sort, try to 
get a dump of the system in question.  Refer to the emergency shutdown 
commands in the back of the UNDHEP console manual.  These commands must 
be entered on the workstation console at the >>> prompt.  

The Restart switches on the 3200s and 3520 cause the processor to
execute the power-on self-tests and then attempt to boot.  This switch
may sometimes be used instead of Halt and Boot.

				JMB, 7/15/93