Recovery Procedures for VMS System Disk Crash
							10/29/98, JMB

The following procedures were necessary for recovery from system disk
crashes on the UNDHEP VAX system on 8/24/98 and 10/27/98.

Turn off UNDHD0, UNDHE4 and UNDHE9.

Power on UNDHD0, and system disk, CD and tape drive.

At prompt check that system disk, CD and tape are visible.
	>>> sho dev

Boot StandAlone Backup from the VMS V6.2 Binaries CD
	>>> b dka400

Restore system disk from most recent backup tape.
	Check saveset filename on backup printout in drawer in 402a.
	$ backup/image/init mka500:undhd0dka0.bck dka0:

Check default boot device, and reset if necessary.
	>>> show boot
	>>> set boot dka0

Turn on all other disks.

Boot system.
	>>> b

Log in as SYSTEM.

Check for queue manager running
	$ show queue/all

Start queues & manager
	$ enable autostart/queues
	$ show queue/all

If not started yet:
	$ start/queue/manager
	$ show queue/all

If still not started:
	$ start/queue/manager/new
	$ show queue/all

If started new queue manager and no queues present:

Check and Define Batch queues
	$ nd
	$ e startbatq.com
	$ @ND:startbatq
	$ enable autostart	! may be necessary if new queue manager

Configure Internet
	$ @UCX$config
		Do all steps of config, selecting enable if offered.
		change no configuration params.
		At end, select Start UCX
		This starts lpd and smtp queues

Start Appletalk and queues
	$ @sys$startup:msa$startup

Check & define Print queues (must be done after UCX & MSA queues)
	$ show queue/all
	$ e startprnt.com
	$ @ND:startprnt
	$ show queue/all		! all queues should be present now.

Check & change NCD startup disk
	$ sd sys$startup
	$ e ncd_systartup.com

Check & change ND disk logical names (DU00 & DU01)
	$ nd
	$ e ndlognams.com

Reboot UNDHD0
	$ @nd:satreboot

Boot rest of cluster
	Power on UNDHE9 and UNDHE4
	Explicitly boot UNDHE9 (Auto boot fails for some sort of scsi error)
	>>> b

Check for all queues running
	$ sho queue/all
    or  $ bqa
	$ prq

Start any stopped queues (may be done on any node)
	$ start/queue queuename

Check logical names DU00, DU01, SYS0, and SYS1 on all systems
	$ sysman
	SYSMAN> set env/clus
	SYSMAN> set prof/priv=all
	SYSMAN> do sho log du00
		If translations are not consistent or correct, redo on
		all affected systems, checking values given in
		ND:ndlognams.com, and copy definition line(s) from it:
	SYSMAN> do DEFINE/sys/exec name translation

Verify that server processes are running.
	$ ss #		! where # is last digit on node name
     or on each node:
	$ sho sys
		Look for NCD_XDM_D and NCD_FS_D
		Look for MSA and UCX process names

Check that all disks are mounted on each node:
	$ sho dev d

Resubmit the timed jobs RESTARTLJ and RESTARTSMTP.
	$ @nd:restartlj sub
	$ @nd:restartsmtp sub
		These resubmit themselves each 6 hrs or day.

Submit disk rebuild job, which will run tomorrow morning.
	$ @nd:rebuilddisks sub
Tomorrow set the resubmitted disk rebuild job to hold:
(It needs to run after each system crash of any node, and can be 
	left in hold until needed.)
	$ bq t		! displays timed jobs
	$ set entry/hold ###

At this point cluster should be up with all normal functions available.