Troubleshooting

Illustration of MagAO-X with motto: "Having more things just means more things can go right"

Figuring out what exactly isn’t working

To narrow down the failing component, use xctrl status to see if any MagAO-X apps are not running. The typical MagAO-X app is started by xctrl startup based on a line in a config file in /opt/MagAOX/config/proclist_$MAGAOX_ROLE.txt. This proclist determines which application to start and which config file from /opt/MagAOX/config should be supplied as the -n option (see Standard options). It also uses sudo to run the process as user xsup, regardless of which user called xctrl startup.

Many, if not all, MagAO-X apps are intended to run “forever” (i.e until shutdown). If the process is dead, you can attach to the tmux session that’s the parent of the process in question with xctrl inspect PROCNAME (where PROCNAME is the name of the failed process). This will occasionally reveal error messages that did not get to the log.

For example, if trippLitePDU is started by xctrl startup with config specified by -n pdu0 and there’s a syntax error in /opt/MagAOX/config/pdu0.conf preventing startup, you can attach to the tmux session with

yourlogin$ xctrl inspect pdu0

The errors before exit, if any, will be in the log. The last few lines of the log can be checked with logdump -f pdu0. The command that started the app will be of the form /opt/MagAOX/bin/$appName -n $configName. You can use the up-arrow key in the tmux session to retrieve it from the shell history and try to relaunch once you’ve corrected whatever error was preventing startup.

Addressing specific issues

Shared memory image problems with “No space left on device” errors

When starting MagAO-X apps or CACAO apps that use shared memory images, the ImageStreamIO library will try to create shared memory images on /milk/shm. This can fail with an error like:

ERROR [ FILE: /opt/MagAOX/source/cacao/src/ImageStreamIO/ImageStreamIO.c   FUNCTION: ImageStreamIO_createIm_gpu   LINE: 521 ]
C Error: No space left on device

Indeed, if you use df -h, you’ll see that /milk/shm is full:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
[...]
tmpfs            63G   63G     0 100% /milk/shm
[...]

The solution is to shut down and then clear /milk/shm.

you$ xsupify
xsup$ cd /milk/shm
xsup$ rm *

If rerunning df -h still doesn’t show any space available, something is probably holding a reference to the files. (See this SuperUser question.) You should reboot the computer with sudo reboot (having already shut down / rested any hardware).

Loop failing to close for no apparent reason and/or intermittent failures of CACAO calibration process

Believe it or not, this can be a sign of insufficient disk space. Consult df -h and see if any of the filesystems have Use% of 100%. This can also be checked in INDI with the sysMonitor process for the relevant compute (sysMonRTC, sysMonICC).

Lockup / Missing GPUs / nvidia-smi errors

Our computers with PCIe expansion cards will occassionally lock up, or will lose a GPU (GPU has fallen off the bus errors). Sometimes running nvidia-smi fails with Unable to determine the device handle for GPU 0000:8C:00.0: GPU is lost.  Reboot the system to recover this GPU. GPU telemetry will also disappear from the monitoring dashboard.

  1. If the system is responding:

    1. If you were using the system, rest any attached hardware and begin camera warmup. (You don’t have to wait for them to reach the warmup temperature.) (For RTC: woofer, tweeter, ttmmod, ttmpupil, and camwfs.)

    2. Shutdown (requires sudo)

      [user@exaoN ~]$ sudo shutdown -h now
      
    3. Now “press the power button” using the Moxa IO unit (see the ICC or RTC Power-On section for that computer in the System Power On procedure)

  2. If the system is not responding, GPUs continue to fall off the bus, or nvidia-smi errors persist after following the procedure above:

    1. If you can, perform steps 1.1 and 1.2 above to bring the system down in an orderly fashion.

    2. Power down pdu0.comprtc or pdu.compicc (e.g. with pwrGUI)

    3. Wait at least 10 seconds.

    4. Now perform all of the ICC or RTC Power-On steps from the System Power On procedure.

OCAM connectivity / bad data

OCAM connects over two CameraLink connections. CameraLink #1 carries serial communication with the detector, so if you’re able to command the camera but your data appear bad in rtimv camwfs, the culprit is likely the CameraLink #2 cable. Reseat, on ICC do xctrl restart camwfs, and restart rtimv.

Alpao DM not responding

Make sure it has been initialized. There is an initialize_alpao systemd unit that runs at boot and initializes the interface card. Successful execution looks like this in systemctl status initialize_alpao output:

$ systemctl status initialize_alpao
● initialize_alpao.service - Initialize Alpao interface card
   Loaded: loaded (/opt/MagAOX/config/initialize_alpao.service; enabled; vendor preset: disabled)
   Active: active (exited) since Sun 2019-09-29 11:18:34 MST; 20min ago
  Process: 4449 ExecStart=/opt/MagAOX/config/initialize_alpao.sh (code=exited, status=0/SUCCESS)
 Main PID: 4449 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/initialize_alpao.service

Sep 29 11:18:34 exao3.as.arizona.edu systemd[1]: Started Initialize Alpao interface card.
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: ====================================================================
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: Ref.ID | Model                          | RSW1 |  Type | Device No.
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: --------------------------------------------------------------------
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: 1 | PEX-292144                     |    0 |    DI |    17
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: --------------------------------------------------------------------
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: 2 | PEX-292144                     |    0 |    DO |    18
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: ====================================================================

The script is saved at /opt/MagAOX/config/initialize_alpao.sh, if you want to see what it’s doing. Note that executing it again will appear to fail with a message about not finding cards to initialize if the cards have been previously initialized.

DM Latency and Communication Troubleshooting

There are various ways that the shared memory interprocess communication between the deformable mirrors, loop control(s), and the hardware control processes can stop functioning properly.

Examples with known fixes:

  • Inability to set or zero flat or test from the dm control gui

    • This likely points to a bad semaphore. Simply release DM, then re-initialize, and it usually clears. If not, go to more general steps below.

  • Excessive latency, occurs especially for ALPAOs

    • This usually requires a power cycle of the driver itself. Release the DM, then use the power control GUI to turn off, then on the DM driver.

  • Skipped commands

    • This is possibly caused by collisions on a semaphore, meaning more than one process is monitoring a given semaphore. This can be diagnosed with streamCTRL. If this is not the case, a full software shutdown (both cacao and magao-x) and clearing the /milk/shm and /dev/shm directories (rm *), then restarting, should clear the problem. See step 5 below.

General Troubleshooting

General troubleshooting steps, in order of severity (try the lower ones first if you don’t have a clear idea what the problem is): 1) release, then initialize from the dmCtrl GUI 2) release, then restart the DM controller software, e.g. for the woofer:

rtc$ xctrl restart dmwoofer
  1. restart the CACAO process that combines the DM shmims:

    • first stop the DM controller (see above)

    • restart dmcomb (or testbed equivalent) using fpsCTRL

      • run fpsCTRL

      • select process to restart with arrow keys

      • hit lower-case r to stop the process

      • hit upper-case R to start it again

    • restart the DM controller (see above)

    Note: this may cause problems in some other processes due to shmim recreation.

  2. Power cycle the DM

    • release from the dmCtrl GUI

    • turn off the power with the pwrCtrl GUI, then turn it back on

    • if it doesn’t happen automatically, initialize the DM from the GUI when it has power

    • if this does not fix the problem, try steps 1-3 again.

  3. Full Software Restart

    • Place all hardware controlled from this computer in a safe condition

      • rest modttm and ttmpupil

      • start camera warmup (in case you can’t get software back up)

      • release all DMs controlled from this computer

    • Shutdown all software with:

      rtc$ xctrl shutdown
      rtc$ tmux kill-server  # for cacao processes not managed by xctrl
      
    • Clear all shared memory:

      rtc$ cd /milk/shm
      rtc$ sudo rm *
      rtc$ cd /dev/shm
      rtc$ sudo rm *
      
    • Now restart software and restore hardware to operating condition

  4. Reboot

    • This is a last resort. This may be necessary if a problem has developed in the device driver for instance.

    • Follow procedure for computer reboot. Ensure all hardware is in a safe condition, including powered-off if needed, before rebooting.

EDT Framegrabber Problems (camwfs)

The EDT PCIe framegrabber occassionally stops responding. The main symptom of this is no data from camwfs, and no response on the serial over camera link.

If camwfs stops responding on serial (evident in logs, probably frame corruption), first shutdown the controlling application.

$ xctrl shutdown camwfs

You will next need to switch from user xsup to yourself:

$ su <your-user-name>
<password>

then do these steps to reload the EDT driver:

$ cd /opt/EDTpdv
$ sudo ./edt_unload
$ sudo ./edt_load

This will reset the kernel module and restore operation. Now return to xsup and restart the controlling application:

$ exit
$ xctrl startup camwfs #<-change if a different camera

After this occurs, you will need to re-start the CACAO loop processes so they re-connect to the camwfs shmim.

Camsci1/2 not responding

If camsci1 and/or camsci2 stop responding, first attempt to restart the control software with xctrl restart. If this does not restore operation, the PICam library needs to be reset. Perform the following steps:

  1. Turn power off for both cameras. Note that you will not be able to verify detector temperature but this can not be avoided.

  2. Stop both camsci control processes. Either use xctrl or go to the tmux session and use ctrl-c.

  3. In a terminal on ICC, go to /opt/MagAOX/source/MagAOX/apps/picamCtrl and run the script cleanPI.sh as root. This removes lock files.

  4. Re-start both control processes.

  5. Power up both cameras

Killing INDI zombies

If the indiserver crashes uncleanly (itself a subprocess of xindiserver), the associated xindidriver processes may become orphans (i.e. reparented to PID 1 (init)). This will prevent xindiserver from starting again until these processes have been killed. (There will be output in logdump suggesting you kill the zombies.)

xctrl includes a built-in zombie hunter, and should do this for you. Should you still be plagued by zombies, the manual version follows.

The following shell command will kill them:

$ kill $(ps -elf | awk '{if ($5 == 1){print $4" "$5" "$15}}' | grep MagAOX/drivers | awk '{print $1}')

To check if any remain use

$ ps -elf | awk '{if ($5 == 1){print $4" "$5" "$15}}' | grep MagAOX/drivers

Difficulties with NVIDIA proprietary drivers

  1. When installing, ensure you have systemctl set-default multi-user.target and a display is connected only to the VGA header provided by the motherboard

  2. If NVIDIA graphical output did work, and now doesn’t: Your kernel may have been updated, requiring a rebuild of the NVIDIA driver. Having dkms installed should prevent needing to do this, but an uninstall and reinstall over SSH will also remedy it.

  3. Runfile installs can be uninstalled with /usr/local/cuda/bin/cuda-uninstaller. This may leave a vestigial /usr/local/cudaXX.YY folder (where XX.YY is a version number) that can most likely be safely removed. (It’s probably just some temporary files that the installer didn’t create and is too polite to remove.)

Computer Fails to Boot

There may be several reasons for this.

Examples with known fixes:

  • Startup screen frozen at “initalizing” and Q-Code A9

    • This probably means that the BIOS has lost its setup, and is trying to use a GPU for video display

    • Shutdown and fully power down.

    • If you have a new mobo CR2032 battery, replace it now

    • Remove GPUs (i.e. by disconnecting the PCIe expansion cable from the host card on the mobo).

    • Install the VGA cable on the mobo (see manual for location)

    • Alternatively, you may be able to plug a monitor into the GPU

    • Boot, and press the del key over and over again until you see “Enter Setup” in the lower right corner.

    • Follow the BIOS setup guide

    • Reboot (F10, save settings).

    • Now shutdown, fully power down, and reinstall/reconnect all GPUS.

    • Reboot.

USB Device Communication Problems

If USB controlled devices, such as filter wheels, focus stages, and rotation stages, have errors such as:

ERRNO: -42001 [Unknown error -42001] >TTY: tcgetattr returned error

or:

USB Device 0403:6001:A9EF0AMU not found in udev

or similar, try these things:

Note

As of 2024A we are seeing occasional near-total scrambling of USB communications at LCO, probably due to grounding problems. If many, essentially all, USB devices appear to be having problems skip to step 3.

  1. Power cycle the problem device.

    • Note that not all USB devices have power control. In this case skip to step 2.

    • Be sure to power cycle both main power and the USB power if necessary

  2. If power cycling the device did not fix it (or it doesn’t have power control), next restart the software controller. This may be necessary after power-cycling if the USB device was re-enumerated on the motherboard.

    • Use xctrl restart xxxx where xxxx is the name of the device

    • watch the logs to see if the device is “found in udev”

  3. If the above steps do not work, the USB hub associated with the device may need to be reset.

    • The following devices are not on the main USB hub, but plugged directly into the computer

      • rhtweeter (RTC)

      • ttmpupil (RTC)

      • usbdu0 (RTC)

      • rhncpc (ICC)

      • temprack: lower and upper (ICC)

      • usbdu1 (ICC)

      For these devices you can try unplugging and replugging their USB cables directly on the motherboard

    • If the above direct connection devices are not fixed by re-plugging, the computer will have to be rebooted. Follow the procedure for doing so.

    • Most USB devices are connected to the main 16-port USB hub. This can be remotely power cycled to reboot it.

      • Power off dcpwr from the pdu using pwrGUI. Wait a couple seconds, and power it back on.

      • This will cause all of the USB devices to get new addresses/tty numbers, so the software will have to be restarted.It’s probably easiest at this point to use xctrl restart all on ICC instead of restarting them one-by-one.