Troubleshooting
Figuring out what exactly isn’t working
To narrow down the failing component, use xctrl status
to see if any
MagAO-X apps are not running. The typical MagAO-X app is started by
xctrl startup
based on a line in a config file in
/opt/MagAOX/config/proclist_$MAGAOX_ROLE.txt
. This proclist
determines which application to start and which config file from
/opt/MagAOX/config
should be supplied as the -n
option (see
Standard options). It also
uses sudo
to run the process as user xsup
, regardless of which
user called xctrl startup
.
Many, if not all, MagAO-X apps are intended to run “forever” (i.e until
shutdown). If the process is dead
, you can attach to the tmux
session that’s the parent of the process in question with
xctrl inspect PROCNAME
(where PROCNAME
is the name of the failed
process). This will occasionally reveal error messages that did not get
to the log.
For example, if trippLitePDU
is started by xctrl startup
with
config specified by -n pdu0
and there’s a syntax error in
/opt/MagAOX/config/pdu0.conf
preventing startup, you can attach to
the tmux session with
yourlogin$ xctrl inspect pdu0
The errors before exit, if any, will be in the log. The last few lines
of the log can be checked with logdump -f pdu0
. The command that
started the app will be of the form
/opt/MagAOX/bin/$appName -n $configName
. You can use the up-arrow
key in the tmux session to retrieve it from the shell history and try to
relaunch once you’ve corrected whatever error was preventing startup.
Addressing specific issues
Loop failing to close for no apparent reason and/or intermittent failures of CACAO calibration process
Believe it or not, this can be a sign of insufficient disk space.
Consult df -h
and see if any of the filesystems have Use%
of
100%. This can also be checked in INDI with the sysMonitor process for the relevant compute (sysMonRTC, sysMonICC).
Lockup / Missing GPUs / nvidia-smi
errors
Our computers with PCIe expansion cards will occassionally lock up, or will lose a GPU (GPU has fallen off
the bus
errors). Sometimes running nvidia-smi
fails with Unable to determine the device handle for GPU 0000:8C:00.0: GPU is lost. Reboot the system to recover this GPU.
GPU telemetry will also disappear from the monitoring dashboard.
If the system is responding:
If you were using the system, rest any attached hardware and begin camera warmup. (You don’t have to wait for them to reach the warmup temperature.) (For RTC: woofer, tweeter, ttmmod, ttmpupil, and camwfs.)
Shutdown (requires sudo)
[user@exaoN ~]$ sudo shutdown -h now
Now “press the power button” using the Moxa IO unit (see the ICC or RTC Power-On section for that computer in the System Power On procedure)
If the system is not responding, GPUs continue to fall off the bus, or
nvidia-smi
errors persist after following the procedure above:If you can, perform steps 1.1 and 1.2 above to bring the system down in an orderly fashion.
Power down
pdu0.comprtc
orpdu.compicc
(e.g. withpwrGUI
)Wait at least 10 seconds.
Now perform all of the ICC or RTC Power-On steps from the System Power On procedure.
OCAM connectivity / bad data
OCAM connects over two CameraLink connections. CameraLink #1 carries
serial communication with the detector, so if you’re able to command the
camera but your data appear bad in rtimv camwfs
, the culprit is
likely the CameraLink #2 cable. Reseat, on ICC do
xctrl restart camwfs
, and restart rtimv
.
Alpao DM not responding
Make sure it has been initialized. There is an initialize_alpao
systemd unit that runs at boot and initializes the interface card.
Successful execution looks like this in
systemctl status initialize_alpao
output:
$ systemctl status initialize_alpao
● initialize_alpao.service - Initialize Alpao interface card
Loaded: loaded (/opt/MagAOX/config/initialize_alpao.service; enabled; vendor preset: disabled)
Active: active (exited) since Sun 2019-09-29 11:18:34 MST; 20min ago
Process: 4449 ExecStart=/opt/MagAOX/config/initialize_alpao.sh (code=exited, status=0/SUCCESS)
Main PID: 4449 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/initialize_alpao.service
Sep 29 11:18:34 exao3.as.arizona.edu systemd[1]: Started Initialize Alpao interface card.
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: ====================================================================
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: Ref.ID | Model | RSW1 | Type | Device No.
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: --------------------------------------------------------------------
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: 1 | PEX-292144 | 0 | DI | 17
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: --------------------------------------------------------------------
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: 2 | PEX-292144 | 0 | DO | 18
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: ====================================================================
The script is saved at /opt/MagAOX/config/initialize_alpao.sh
, if
you want to see what it’s doing. Note that executing it again will
appear to fail with a message about not finding cards to initialize if
the cards have been previously initialized.
DM Latency and Communication Troubleshooting
There are various ways that the shared memory interprocess communication between the deformable mirrors, loop control(s), and the hardware control processes can stop functioning properly.
Examples with known fixes:
Inability to set or zero flat or test from the dm control gui
This likely points to a bad semaphore. Simply release DM, then re-initialize, and it usually clears. If not, go to more general steps below.
Excessive latency, occurs especially for ALPAOs
This usually requires a power cycle of the driver itself. Release the DM, then use the power control GUI to turn off, then on the DM driver.
Skipped commands
This is possibly caused by collisions on a semaphore, meaning more than one process is monitoring a given semaphore. This can be diagnosed with
streamCTRL
. If this is not the case, a full software shutdown (both cacao and magao-x) and clearing the /milk/shm and /dev/shm directories (rm *), then restarting, should clear the problem. See step 5 below.
General Troubleshooting
General troubleshooting steps, in order of severity (try the lower ones
first if you don’t have a clear idea what the problem is): 1) release,
then initialize from the dmCtrl
GUI 2) release, then restart the DM
controller software, e.g. for the woofer:
rtc$ xctrl restart dmwoofer
restart the CACAO process that combines the DM shmims:
first stop the DM controller (see above)
restart
dmcomb
(or testbed equivalent) usingfpsCTRL
run
fpsCTRL
select process to restart with arrow keys
hit lower-case
r
to stop the processhit upper-case
R
to start it again
restart the DM controller (see above)
Note: this may cause problems in some other processes due to shmim recreation.
Power cycle the DM
release from the
dmCtrl
GUIturn off the power with the
pwrCtrl
GUI, then turn it back onif it doesn’t happen automatically, initialize the DM from the GUI when it has power
if this does not fix the problem, try steps 1-3 again.
Full Software Restart
Place all hardware controlled from this computer in a safe condition
rest
modttm
andttmpupil
start camera warmup (in case you can’t get software back up)
release all DMs controlled from this computer
Shutdown all software with:
rtc$ xctrl shutdown rtc$ tmux kill-server # for cacao processes not managed by xctrl
Clear all shared memory:
rtc$ cd /milk/shm rtc$ sudo rm * rtc$ cd /dev/shm rtc$ sudo rm *
Now restart software and restore hardware to operating condition
Reboot
This is a last resort. This may be necessary if a problem has developed in the device driver for instance.
Follow procedure for computer reboot. Ensure all hardware is in a safe condition, including powered-off if needed, before rebooting.
EDT Framegrabber Problems (camwfs)
The EDT PCIe framegrabber occassionally stops responding. The main
symptom of this is no data from camwfs
, and no response on the
serial over camera link.
If camwfs
stops responding on serial (evident in logs, probably frame corruption), first
shutdown the controlling application.
$ xctrl shutdown camwfs
You will next need to switch from user xsup
to yourself:
$ su <your-user-name>
<password>
then do these steps to reload the EDT driver:
$ cd /opt/EDTpdv
$ sudo ./edt_unload
$ sudo ./edt_load
This will reset the kernel module and restore operation. Now return to xsup
and restart the
controlling application:
$ exit
$ xctrl startup camwfs #<-change if a different camera
After this occurs, you will need to re-start the CACAO loop processes so they re-connect to the camwfs shmim.
Camsci1/2 not responding
If camsci1
and/or camsci2
stop responding, first attempt to restart the control software with xctrl restart
. If this does not restore operation, the PICam library needs to be reset. Perform the following steps:
Turn power off for both cameras. Note that you will not be able to verify detector temperature but this can not be avoided.
Stop both
camsci
control processes. Either use xctrl or go to the tmux session and use ctrl-c.In a terminal on ICC, go to
/opt/MagAOX/source/MagAOX/apps/picamCtrl
and run the scriptcleanPI.sh
as root. This removes lock files.Re-start both control processes.
Power up both cameras
Killing INDI zombies
If the indiserver
crashes uncleanly (itself a subprocess of
xindiserver), the associated xindidriver
processes may become
orphans (i.e. reparented to PID 1 (init)). This will prevent
xindiserver from starting again until these processes have been
killed. (There will be output in logdump suggesting you
kill the zombies
.)
xctrl
includes a built-in zombie hunter, and should do this for you.
Should you still be plagued by zombies, the manual version follows.
The following shell command will kill them:
$ kill $(ps -elf | awk '{if ($5 == 1){print $4" "$5" "$15}}' | grep MagAOX/drivers | awk '{print $1}')
To check if any remain use
$ ps -elf | awk '{if ($5 == 1){print $4" "$5" "$15}}' | grep MagAOX/drivers
Difficulties with NVIDIA proprietary drivers
When installing, ensure you have
systemctl set-default multi-user.target
and a display is connected only to the VGA header provided by the motherboardIf NVIDIA graphical output did work, and now doesn’t: Your kernel may have been updated, requiring a rebuild of the NVIDIA driver. Having
dkms
installed should prevent needing to do this, but an uninstall and reinstall over SSH will also remedy it.Runfile installs can be uninstalled with
/usr/local/cuda/bin/cuda-uninstaller
. This may leave a vestigial/usr/local/cudaXX.YY
folder (whereXX.YY
is a version number) that can most likely be safely removed. (It’s probably just some temporary files that the installer didn’t create and is too polite to remove.)
Computer Fails to Boot
There may be several reasons for this.
Examples with known fixes:
Startup screen frozen at “initalizing” and Q-Code A9
This probably means that the BIOS has lost its setup, and is trying to use a GPU for video display
Shutdown and fully power down.
If you have a new mobo CR2032 battery, replace it now
Remove GPUs (i.e. by disconnecting the PCIe expansion cable from the host card on the mobo).
Install the VGA cable on the mobo (see manual for location)
Alternatively, you may be able to plug a monitor into the GPU
Boot, and press the del key over and over again until you see “Enter Setup” in the lower right corner.
Follow the BIOS setup guide
Reboot (F10, save settings).
Now shutdown, fully power down, and reinstall/reconnect all GPUS.
Reboot.
USB Device Communication Problems
If USB controlled devices, such as filter wheels, focus stages, and rotation stages, have errors such as:
ERRNO: -42001 [Unknown error -42001] >TTY: tcgetattr returned error
or:
USB Device 0403:6001:A9EF0AMU not found in udev
or similar, try these things:
Note
As of 2024A we are seeing occasional near-total scrambling of USB communications at LCO, probably due to grounding problems. If many, essentially all, USB devices appear to be having problems skip to step 3.
Power cycle the problem device.
Note that not all USB devices have power control. In this case skip to step 2.
Be sure to power cycle both main power and the USB power if necessary
If power cycling the device did not fix it (or it doesn’t have power control), next restart the software controller. This may be necessary after power-cycling if the USB device was re-enumerated on the motherboard.
Use xctrl restart xxxx where xxxx is the name of the device
watch the logs to see if the device is “found in udev”
If the above steps do not work, the USB hub associated with the device may need to be reset.
The following devices are not on the main USB hub, but plugged directly into the computer
rhtweeter (RTC)
ttmpupil (RTC)
usbdu0 (RTC)
rhncpc (ICC)
temprack: lower and upper (ICC)
usbdu1 (ICC)
For these devices you can try unplugging and replugging their USB cables directly on the motherboard
If the above direct connection devices are not fixed by re-plugging, the computer will have to be rebooted. Follow the procedure for doing so.
Most USB devices are connected to the main 16-port USB hub. This can be remotely power cycled to reboot it.
Power off dcpwr from the pdu using pwrGUI. Wait a couple seconds, and power it back on.
This will cause all of the USB devices to get new addresses/tty numbers, so the software will have to be restarted.It’s probably easiest at this point to use xctrl restart all on ICC instead of restarting them one-by-one.