Shell "disconnects" if left unattended for long periods (hours) under 10.11/10.12

Tim_Jones · May 19, 2017, 4:02pm

I’m running into a scenario when I lose communication with a Mode 2 shell under OS X 10.11 or 10.12 if the pending task is paused overnight (the user isn’t around to interact with the notice for interaction). We at first thought that this was AppNap, but we’ve set the task flags to turn that off.

During the “idle” period, we are pinging the shell using a Poll that occurs in a Timer every 30 seconds.

Is this possibly a Xojo timer issue after running for so long?
Is this possibly the Xojo Shell getting lost after no interaction for so long?
Is this just Apple playing their unnecessary priority shuffle games?
Something else?

Thanks,
Tim

Tim_Jones · May 19, 2017, 4:11pm

@Dave S - Thanks, but that was the first thing I thought of - that maybe the timeout value had somehow gotten set in OS X. But, it was actually called out on the creation of the task as setting it to -1.

Good idea, though

DaveS · May 19, 2017, 4:12pm

I deleted my answer, when I realized you said macOS, and Timeout only applies to Windows (per the LR)

Tim_Jones · May 19, 2017, 4:13pm

I saw that you’d removed it after I replied. But, you can set it under OS X and Linux as well. I don’t know what effect it would have, but I’ve gotten into the habit of setting it to -1 by default.

DerkJ · May 19, 2017, 5:01pm

Perhaps it could be this?

From the LR.

Tim_Jones · May 19, 2017, 5:03pm

Thanks, Derk, but definitely not. The process had been running for around 6-8 hours and this only happens under 10.11/10.12.

DerkJ · May 19, 2017, 5:53pm

Fixed instance?
Hmm i have 2 shells running for months on a gui app. The console app is not made in xojo the gui is.

Are both made in xojo?

Tim_Jones · May 19, 2017, 6:03pm

Nope - the backend tools are written in C. If the task is continuous (such as when using an automated changer), the problem does not occur over days or even weeks in one user’s case (280TB of data).

The issue is that the primary tool finishes a section of the task, we see the prompt and present popup a dialog for the user to swap media, start a separate poll loop checking to see if the media is changed, signal the primary helper to continue if so. If the user responds to this within 1 hour or less, things work as expected and the task continues. However, if it sits longer - the probability of the thread “dying / hanging” increases the longer the task waits - whth 5 hours seeming to be the guaranteed fail point.

Running the same tasks in a terminal directly will wait for 5 days and still pick up when the media change is noted.

DerkJ · May 19, 2017, 6:14pm

No memory leaks or so to be found?

It seems odd, try and see if the dialog is blocking somehow. Has the application been in “hybernate” mode when this happened?
Perhaps you can use a Deactivate Event (set a boolean) of the main window to wait with giving the prompt untill the Activate event of the main window has been raised (reset the boolean).

Tim_Jones · May 19, 2017, 6:29pm

Now that I dig further, it has to be something to do with the something to do with the secondary poll task (Timer set to 30 second period). My primary shell is not showing as zombied, so I can only assume that it is still sane. Because the secondary poll task is Timer related, I’ll add some log output for ever 10th poll to see if I can find a point where the timer stops firing.

Just to reaffirm - this does not happen on systems running 10.6.8 - 10.9.5.

Tim

James_Sentman · May 19, 2017, 8:07pm

I also run shell tasks for helper applications that can run for months at a time and havent had any problems on newer OS versions. The timer to poll the shell isnt really a ping it just is checking the buffers for any data that needs to be transferred. I dont use any such thing and have never had a shell go silent on me. Its got to be some feature of the app youre running in the shell. In my case some of the shell apps are written in C and some are written in xojo but all are written by us. Ive had users keep a shell open for a full calendar year without difficulty! The only way to really poll something is to send it a data packet of some kind and see if it receives it by it sending a response. We do that, if there is no communication for a while we send a ping packet for which it should answer us with a ping response. Just polling the shell is not the same and does nothing to verify that the process is still running properly.

Tim_Jones · May 19, 2017, 8:11pm

The timer-based secondary “poll” task is actually calling a function to read the state of a tape drive on the user’s system monitoring for a specific SCSI Sense value (not ready to ready transition), so my use of Poll here was a bit misleading - sorry.

James_Sentman · May 20, 2017, 12:36am

ah, I get it then, its actually doing something then I dont know, but it should be possible to hold a shell open for a very long time

Michel_Bujardet · May 20, 2017, 3:27am

Shooting in the dark, but if it is the timer getting overflowed somehow, I would try to set the timer not for 30 seconds, but for 60 seconds. The issue should then present at twice the time.

I would not entirely discount AppNap just yet either. It is awfully suspicious the issue manifests when AppNap could be doing its thing. Maybe you want to log what the timer does, and see when it freezes eventually. With all the changes in El Cap and Sierra, it is quite possible AppNap still remains.

DerkJ · May 20, 2017, 8:15am

AppNap could be it as @Michel Bujardet suggests. Or it could be your disk getting into sleep mode while the shell is active. If the Mac is unattended, system settings apply.

You could try to set system settings -> energy management -> check the settings there please post them here (screenshot). We’d all benefit if we knew what it was.

Tim_Jones · May 20, 2017, 8:11pm

@Michel Bujardet - The task that runs in the timer action takes around 400ms, so I would hope that 30 seconds would be plently (and it is in 10.10 and earlier.

@Derk Jochems - I sent the following settings to one user to exterminate AppNap like the bug that it is. This also sets a sysctl flag that stops the kernel from dropping priorities causing some tasks to Zombie (that is a real Unix state).

defaults write NSGlobalDomain NSAppSleepDisabled -bool YES sudo sysctl -w debug.lowpri_throttle_enabled=0 sudo sh -c 'echo "debug.lowpri_throttle_enabled=0" >> /etc/sysctl.conf'

Fingers crossed.

What ever happened to just letting the kernel do its job?

Michel_Bujardet · May 20, 2017, 8:24pm

The idea was not that the execution time was too long, but that a bug in the timer itself could happen after a number of cycles.