Troubleshooting

Illustration of MagAO-X with motto: "Having more things just means more things can go right"

Figuring out what exactly isn’t working

To narrow down the failing component, use xctrl status to see if any MagAO-X apps are not running. The typical MagAO-X app is started by xctrl startup based on a line in a config file in /opt/MagAOX/config/proclist_$MAGAOX_ROLE.txt. This proclist determines which application to start and which config file from /opt/MagAOX/config should be supplied as the -n option (see Standard options). It also uses sudo to run the process as user xsup, regardless of which user called xctrl startup.

Many, if not all, MagAO-X apps are intended to run “forever” (i.e until shutdown). If the process is dead, you can attach to the tmux session that’s the parent of the process in question with xctrl inspect PROCNAME (where PROCNAME is the name of the failed process). This will occasionally reveal error messages that did not get to the log.

For example, if trippLitePDU is started by xctrl startup with config specified by -n pdu0 and there’s a syntax error in /opt/MagAOX/config/pdu0.conf preventing startup, you can attach to the tmux session with

yourlogin$ xctrl inspect pdu0

The errors before exit, if any, will be in the log. The last few lines of the log can be checked with logdump -f pdu0. The command that started the app will be of the form /opt/MagAOX/bin/$appName -n $configName. You can use the up-arrow key in the tmux session to retrieve it from the shell history and try to relaunch once you’ve corrected whatever error was preventing startup.

Addressing specific issues

Missing `camflows` or `camllowfs`

When booting ICC, ensure that camflowfs and camllowfs are already powered on. Otherwise, you will have to reboot ICC to successfully connect.

Things to try:

Hotplugging

begin with both cameras powered off

on exao3/ICC: xctrl shutdown camflowfs camllowfs

power on both cameras

as a non-xsup user, on exao3/ICC, run sudo /opt/pvcam/drivers/in-kernel/pcie/hotplug_pcie.sh

verify the number of “active cameras” it reports, and that none say no name found, rescan or PC reboot needed

on exao3/ICC: xctrl startup camflowfs camllowfs

Outcome 1: successful hotplugging

The camflowfs.fsm / camllowfs.fsm property will change to OPERATING and images will start to appear.

Outcome 2: unsuccessful hotplugging

When hotplugging is unsuccessful, you will see output like the below that includes no name found, rescan or PC reboot needed.

$ sudo /opt/pvcam/drivers/in-kernel/pcie/hotplug_pcie.sh
Detected 1 supported card(s)
0000:41:00.0 - upstream port (Dolphin PXH832 card [10b5:8733])
   0000:42:09.0 - downstream port
      No camera found
   0000:42:08.0 - downstream port
      No camera found

Found 0 active camera(s)
cat: /sys/module/pvcam_pcie/refcnt: No such file or directory

Performing hot-reset of downstream ports...
0000:42:08.0
0000:42:09.0
Removing active cameras...
Rescanning downstream ports...
0000:42:08.0
0000:42:09.0

Detected 1 supported card(s)
0000:41:00.0 - upstream port (Dolphin PXH832 card [10b5:8733])
   0000:42:09.0 - downstream port
      0000:44:00.0 - camera rev. 8 (no name found, rescan or PC reboot needed)
   0000:42:08.0 - downstream port
      0000:43:00.0 - camera rev. 8 (no name found, rescan or PC reboot needed)

Found 2 active camera(s)

This section TODO

Shared memory image problems with “No space left on device” errors

When starting MagAO-X apps or CACAO apps that use shared memory images, the ImageStreamIO library will try to create shared memory images on /milk/shm. This can fail with an error like:

ERROR [ FILE: /opt/MagAOX/source/cacao/src/ImageStreamIO/ImageStreamIO.c   FUNCTION: ImageStreamIO_createIm_gpu   LINE: 521 ]
C Error: No space left on device

Indeed, if you use df -h, you’ll see that /milk/shm is full:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
[...]
tmpfs            63G   63G     0 100% /milk/shm
[...]

The solution is to shut down and then clear /milk/shm.

you$ xsupify
xsup$ cd /milk/shm
xsup$ rm *

If rerunning df -h still doesn’t show any space available, something is probably holding a reference to the files. (See this SuperUser question.) You should reboot the computer with sudo reboot (having already shut down / rested any hardware).

Loop failing to close for no apparent reason and/or intermittent failures of CACAO calibration process

Believe it or not, this can be a sign of insufficient disk space. Consult df -h and see if any of the filesystems have Use% of 100%. This can also be checked in INDI with the sysMonitor process for the relevant compute (sysMonRTC, sysMonICC).

Lockup / Missing GPUs / `nvidia-smi` errors

Our computers with PCIe expansion cards will occassionally lock up, or will lose a GPU (GPU has fallen off the bus errors). Sometimes running nvidia-smi fails with Unable to determine the device handle for GPU 0000:8C:00.0: GPU is lost. Reboot the system to recover this GPU. GPU telemetry will also disappear from the monitoring dashboard.

If the system is responding:
1. If you were using the system, rest any attached hardware and begin camera warmup. (You don’t have to wait for them to reach the warmup temperature.) (For RTC: woofer, tweeter, ttmmod, ttmpupil, and camwfs.)
2. Shutdown (requires sudo)
```
[user@exaoN ~]$ sudo shutdown -h now
```
3. Now “press the power button” using the Moxa IO unit (see the ICC or RTC Power-On section for that computer in the System Power On procedure)
If the system is not responding, GPUs continue to fall off the bus, or nvidia-smi errors persist after following the procedure above:
1. If you can, perform steps 1.1 and 1.2 above to bring the system down in an orderly fashion.
2. Power down pdu0.comprtc or pdu.compicc (e.g. with pwrGUI)
3. Wait at least 10 seconds.
4. Now perform all of the ICC or RTC Power-On steps from the System Power On procedure.

OCAM connectivity / bad data

OCAM connects over two CameraLink connections. CameraLink #1 carries serial communication with the detector, so if you’re able to command the camera but your data appear bad in rtimv camwfs, the culprit is likely the CameraLink #2 cable. Reseat, on ICC do xctrl restart camwfs, and restart rtimv.

Alpao DM not responding

Make sure it has been initialized. There is an initialize_alpao systemd unit that runs at boot and initializes the interface card. Successful execution looks like this in systemctl status initialize_alpao output:

$ systemctl status initialize_alpao
● initialize_alpao.service - Initialize Alpao interface card
   Loaded: loaded (/opt/MagAOX/config/initialize_alpao.service; enabled; vendor preset: disabled)
   Active: active (exited) since Sun 2019-09-29 11:18:34 MST; 20min ago
  Process: 4449 ExecStart=/opt/MagAOX/config/initialize_alpao.sh (code=exited, status=0/SUCCESS)
 Main PID: 4449 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/initialize_alpao.service

Sep 29 11:18:34 exao3.as.arizona.edu systemd[1]: Started Initialize Alpao interface card.
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: ====================================================================
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: Ref.ID | Model                          | RSW1 |  Type | Device No.
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: --------------------------------------------------------------------
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: 1 | PEX-292144                     |    0 |    DI |    17
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: --------------------------------------------------------------------
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: 2 | PEX-292144                     |    0 |    DO |    18
Sep 29 11:18:35 exao3.as.arizona.edu initialize_alpao.sh[4449]: ====================================================================

The script is saved at /opt/MagAOX/config/initialize_alpao.sh, if you want to see what it’s doing. Note that executing it again will appear to fail with a message about not finding cards to initialize if the cards have been previously initialized.

DM Latency and Communication Troubleshooting

There are various ways that the shared memory interprocess communication between the deformable mirrors, loop control(s), and the hardware control processes can stop functioning properly.

Examples with known fixes:

Inability to set or zero flat or test from the dm control gui
- This likely points to a bad semaphore. Simply release DM, then re-initialize, and it usually clears. If not, go to more general steps below.
Excessive latency, occurs especially for ALPAOs
- This usually requires a power cycle of the driver itself. Release the DM, then use the power control GUI to turn off, then on the DM driver.
Skipped commands
- This is possibly caused by collisions on a semaphore, meaning more than one process is monitoring a given semaphore. This can be diagnosed with streamCTRL. If this is not the case, a full software shutdown (both cacao and magao-x) and clearing the /milk/shm and /dev/shm directories (rm *), then restarting, should clear the problem. See step 5 below.

General Troubleshooting

General troubleshooting steps, in order of severity (try the lower ones first if you don’t have a clear idea what the problem is): 1) release, then initialize from the dmCtrl GUI 2) release, then restart the DM controller software, e.g. for the woofer:

rtc$ xctrl restart dmwoofer

restart the CACAO process that combines the DM shmims:
- first stop the DM controller (see above)
- restart dmcomb (or testbed equivalent) using fpsCTRL
  - run fpsCTRL
  - select process to restart with arrow keys
  - hit lower-case r to stop the process
  - hit upper-case R to start it again
- restart the DM controller (see above)
Note: this may cause problems in some other processes due to shmim recreation.
Power cycle the DM
- release from the dmCtrl GUI
- turn off the power with the pwrCtrl GUI, then turn it back on
- if it doesn’t happen automatically, initialize the DM from the GUI when it has power
- if this does not fix the problem, try steps 1-3 again.
Full Software Restart
- Place all hardware controlled from this computer in a safe condition
  - rest modttm and ttmpupil
  - start camera warmup (in case you can’t get software back up)
  - release all DMs controlled from this computer
- Shutdown all software with:
```
rtc$ xctrl shutdown
rtc$ tmux kill-server  # for cacao processes not managed by xctrl
```
- Clear all shared memory:
```
rtc$ cd /milk/shm
rtc$ sudo rm *
rtc$ cd /dev/shm
rtc$ sudo rm *
```
- Now restart software and restore hardware to operating condition
Reboot
- This is a last resort. This may be necessary if a problem has developed in the device driver for instance.
- Follow procedure for computer reboot. Ensure all hardware is in a safe condition, including powered-off if needed, before rebooting.

EDT Framegrabber Problems (camwfs)

The EDT PCIe framegrabber occassionally stops responding. The main symptom of this is no data from camwfs, and no response on the serial over camera link.

If camwfs stops responding on serial (evident in logs, probably frame corruption), first shutdown the controlling application.

$ xctrl shutdown camwfs

You will next need to switch from user xsup to yourself:

$ su <your-user-name>
<password>

then do these steps to reload the EDT driver:

$ cd /opt/EDTpdv
$ sudo ./edt_unload
$ sudo ./edt_load

This will reset the kernel module and restore operation. Now return to xsup and restart the controlling application:

$ exit
$ xctrl startup camwfs #<-change if a different camera

After this occurs, you will need to re-start the CACAO loop processes so they re-connect to the camwfs shmim.

Camsci1/2 not responding

If camsci1 and/or camsci2 stop responding, first attempt to restart the control software with xctrl restart. If this does not restore operation, the PICam library needs to be reset. Perform the following steps:

Turn power off for both cameras. Note that you will not be able to verify detector temperature but this can not be avoided.
Stop both camsci control processes. Either use xctrl or go to the tmux session and use ctrl-c.
In a terminal on ICC, go to /opt/MagAOX/source/MagAOX/apps/picamCtrl and run the script cleanPI.sh as root. This removes lock files.
Re-start both control processes.
Power up both cameras

rtimv not updating

If rtimv is not showing images:

If remote, check that you have opened the right tunnel. Usually to port 9000 on AOC.
Check that the camera is actually updating. You can use:
[icc/rtc]$ milk-shmimmon camname
on the computer which hosts the camera. If not, try restarting the camera control software and then take camera specific troubleshooting steps.
Check that the mzmqServer process on the computer which hosts the camera is alive

Check that the mzmqClient process for the computer which hosts the camera is alive on AOC
Check that frames are arriving on AOC with:
aoc$ milk-shmimmon camname
If remote, check that mzmqServerAOC is alive on AOC
If you are using the -Z option with rtimv (milkzmq direct), try:
Start a milkzmqClient instance:
$ milkzmqClient -p 9000 localhost camname #assumes you have the tunnel open
it will tell you if it begins receiving images and you will see a file show up in /milk/shm.

Next try using milk-shmimmon camname or rtimv camname.

Killing INDI zombies

If the indiserver crashes uncleanly (itself a subprocess of xindiserver), the associated xindidriver processes may become orphans (i.e. reparented to PID 1 (init)). This will prevent xindiserver from starting again until these processes have been killed. (There will be output in logdump suggesting you kill the zombies.)

Often using xctrl to restart the INDI server will fail due to the zombies. You can restart the INDI server efficiently by going to the tmux session for it:

$ tmux a -t isRTC

where you change isRTC to the correct process name, such as isAOC or isICC. You need to be xsup. If the INDI server is down you will see error messages followed by a prompt. The following script will kill the zombies:

$ killIndiZombies

Now you can up-arrow to find the INDI server command. It should be:

$ /opt/MagAOX/bin/xindiserver -n isRTC

Hit enter to startup the server. Once it is running you exit the tmux session with ctrl-b then d. Now you can test the INDI server is running with

$ getINDI

which should return all the INDI properties on that machine.

Difficulties with NVIDIA proprietary drivers

When installing, ensure you have systemctl set-default multi-user.target and a display is connected only to the VGA header provided by the motherboard
If NVIDIA graphical output did work, and now doesn’t: Your kernel may have been updated, requiring a rebuild of the NVIDIA driver. Having dkms installed should prevent needing to do this, but an uninstall and reinstall over SSH will also remedy it.
Runfile installs can be uninstalled with /usr/local/cuda/bin/cuda-uninstaller. This may leave a vestigial /usr/local/cudaXX.YY folder (where XX.YY is a version number) that can most likely be safely removed. (It’s probably just some temporary files that the installer didn’t create and is too polite to remove.)

Computer Fails to Boot

There may be several reasons for this.

Examples with known fixes:

Startup screen frozen at “initalizing” and Q-Code A9
- This probably means that the BIOS has lost its setup, and is trying to use a GPU for video display
- Shutdown and fully power down.
- If you have a new mobo CR2032 battery, replace it now
- Remove GPUs (i.e. by disconnecting the PCIe expansion cable from the host card on the mobo).
- Install the VGA cable on the mobo (see manual for location)
- Alternatively, you may be able to plug a monitor into the GPU
- Boot, and press the del key over and over again until you see “Enter Setup” in the lower right corner.
- Follow the BIOS setup guide
- Reboot (F10, save settings).
- Now shutdown, fully power down, and reinstall/reconnect all GPUS.
- Reboot.

USB Device Communication Problems

If USB controlled devices, such as filter wheels, focus stages, and rotation stages, have errors such as:

ERRNO: -42001 [Unknown error -42001] >TTY: tcgetattr returned error

or:

USB Device 0403:6001:A9EF0AMU not found in udev

or similar, try these things:

Note

As of 2024A we are seeing occasional near-total scrambling of USB communications at LCO, probably due to grounding problems. If many, essentially all, USB devices appear to be having problems skip to step 3.

Power cycle the problem device.
- Note that not all USB devices have power control. In this case skip to step 2.
- Be sure to power cycle both main power and the USB power if necessary
If power cycling the device did not fix it (or it doesn’t have power control), next restart the software controller. This may be necessary after power-cycling if the USB device was re-enumerated on the motherboard.
- Use xctrl restart xxxx where xxxx is the name of the device
- watch the logs to see if the device is “found in udev”
If the above steps do not work, the USB hub associated with the device may need to be reset.
- The following devices are not on the main USB hub, but plugged directly into the computer
  - rhtweeter (RTC)
  - ttmpupil (RTC)
  - usbdu0 (RTC)
  - rhncpc (ICC)
  - temprack: lower and upper (ICC)
  - usbdu1 (ICC)
  For these devices you can try unplugging and replugging their USB cables directly on the motherboard
- If the above direct connection devices are not fixed by re-plugging, the computer will have to be rebooted. Follow the procedure for doing so.
- Most USB devices are connected to the main 16-port USB hub. This can be remotely power cycled to reboot it.
  - Power off dcpwr from the pdu using pwrGUI. Wait a couple seconds, and power it back on.
  - This will cause all of the USB devices to get new addresses/tty numbers, so the software will have to be restarted.It’s probably easiest at this point to use xctrl restart all on ICC instead of restarting them one-by-one.

Troubleshooting

Figuring out what exactly isn’t working

Addressing specific issues

Missing camflows or camllowfs

Hotplugging

Outcome 1: successful hotplugging

Outcome 2: unsuccessful hotplugging

Shared memory image problems with “No space left on device” errors

Loop failing to close for no apparent reason and/or intermittent failures of CACAO calibration process

Lockup / Missing GPUs / nvidia-smi errors

OCAM connectivity / bad data

Alpao DM not responding

DM Latency and Communication Troubleshooting

Examples with known fixes:

General Troubleshooting

EDT Framegrabber Problems (camwfs)

Camsci1/2 not responding

rtimv not updating

Killing INDI zombies

Difficulties with NVIDIA proprietary drivers

Computer Fails to Boot

Examples with known fixes:

USB Device Communication Problems

Missing `camflows` or `camllowfs`

Lockup / Missing GPUs / `nvidia-smi` errors