Linux desktop app main thread frozen, other threads still executing

Hello, I have an linux desktop app that has become frozen (UI and event handlers are non responsive), however, other threads are executing.

GDB backtrace shows main thread is waiting for a mutex.

Thread 1 (Thread 0x7fb785b040 (LWP 192663) "redacted"):
#0  0x0000007fb791b694 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x220be434) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x220be434) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x220be434, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x0000007fb791e1d0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x220be3d8, cond=0x220be408) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x220be408, mutex=0x220be3d8) at ./nptl/pthread_cond_wait.c:618
#5  0x0000007fba042b14 in ?? () from /redacted Libs/XojoGUIFrameworkARM64.so
#6  0x0000007fba040d74 in ?? () from /redacted Libs/XojoGUIFrameworkARM64.so
#7  0x0000007fba03dbb4 in ?? () from /redacted Libs/XojoGUIFrameworkARM64.so
#8  0x0000007fba03b1f0 in ?? () from /redacted Libs/XojoGUIFrameworkARM64.so
#9  0x0000007fba03ccd0 in ?? () from /redacted Libs/XojoGUIFrameworkARM64.so
#10 0x0000007fba022424 in ?? () from /redacted Libs/XojoGUIFrameworkARM64.so
#11 0x00000000006f4b14 in DesktopApplication._CallFunctionWithExceptionHandling%%o<DesktopApplication>p ()
#12 0x0000007fba022230 in ?? () from /redacted Libs/XojoGUIFrameworkARM64.so
#13 0x0000007fba0224d0 in ?? () from /redacted Libs/XojoGUIFrameworkARM64.so
#14 0x0000007fba01f328 in RuntimeRun () from /redacted Libs/XojoGUIFrameworkARM64.so
#15 0x000000000075c860 in REALbasic._RuntimeRun ()
#16 0x000000000170eb90 in _Main ()
#17 0x000000000170e3ac in main ()

I can see other threads are starting and completing.

[New Thread 0x7f8606ee40 (LWP 789853)]
[Detaching after vfork from child process 789854]
[Detaching after vfork from child process 789856]
[Thread 0x7f8606ee40 (LWP 789853) exited]
[Detaching after vfork from child process 789859]
[Detaching after vfork from child process 789865]

I can see that Shell.Execute commands are being run from these threads via htop tree view.

Does anyone have some insights on where to look or possible commands to run to gain more information? I currently have the process in GDB but it took about a month for it to hang so I want to make sure I don’t miss this opportunity.

Thanks!

What shell mode are you using. Synchronous shells can cause the main thread to lock up, even when used from within a thread.

Thanks for the reply, all my threads are async.

Fair enough.

If the main thread is waiting for a mutex then what UI areas use them and the really hard part figuring out what is keeping that mutex locked. Could it be that the treads simply swamp the UI and never allow it to get a turn?

I have not used any mutexes in my app so it must be part of the framework. I have a few threads but most of their time is spent sleeping. I can see that 2 of the normally active threads are still firing but the rest are waiting for data from various sources like TCPSocket, Async Shell, etc. I can’t see any signs that any of the threads are taking more than their fair share. I also have a clock with seconds that is fired by a timer, it had not changed in the 5 days I let it sit frozen before I had to restart the app.

Hmm… So I take it this isn’t some beta version of Xojo you are running?

Extensive Logging is all you can do, i guess.
What threads are they, more details?

Nope, this is Xojo 2024r2.1

Memory leak? Disk space issue? Things like that can take a while to kick in.

I have tried my best at logging and debugging via GDB, however, with no event loop (frozen), my options are limited.

The are some data processing threads that wait on DataAvailable events to process from various sources like TCP/UDP Socket, Async Shell, Serial. None of them fire as they rely on the event loop to run.

I also have 2 threads that do calculations and set outputs based on the data processed from the other threads. These threads use Thread.Sleep to execute on an interval and appear to be running as I can see the outputs and trace in GDB.

I did not see any increase in memory or lack of disk space. Another note is that the issue appears to be happening on both Linux x86 and ARM (both 64bits). I have not observed the app hanging on macOS, however, I believe that is more of a lack the signals/conditions (I do not have any macOS machines deployed) that might cause the issue rather than the architecture.

You can monitor and log some items from the Xojo RunTime object. Particularly the MemoryUsed and ObjectCount will help you spot memory leaks. If your ObjectCount keeps going up then they are likely not being destroyed as they should. This happens in a few circumstances:

  • Very tight loops for a long duration. Fails to allow Xojo to tidy memory on its own.
  • Circular references. Object A points at Object B and vice versa. You have to be very careful about keeping references to A in B and B in A.

This sometimes could be a thread that holds a reference to a data item used by the main thread, where the main thread holds a reference to the thread. There are a number of way of dealing with this:

  1. Use a WeekRef on one side to automatically handle it.
  2. Be careful to destroy the reference to the main UI object when the thread finishes running.

Thanks for the pointers. I have made extensive effort into preventing all these issues throughout development of the app, however, it still seems to have this issue.

@Alex_Bombay

That mutex you see in the stack could just be the cooperative thread scheduler. That is, because only one thread is allowed to run at a time, a mutex is used to make sure that is the case. In which case…

Do you call Self.Sleep from within long running processes periodically to allow the main thread to run?

All of my data processing threads call Thread.Pause and then wait for more data. Each of these processing threads have a limited buffer that will disconnect and wait for a bit before reconnecting if too much data is coming in (buffer overflow).

The 2 calculation threads have a Thread.Sleep in them.

I have had another hang.

Thread 1 (Thread 0x78d1e73a6040 (LWP 157369) "redacted"):
#0  0x000078d1e9298d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x31e47e8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x31e47e8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x31e47e8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x000078d1e929b7dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x31e4798, cond=0x31e47c0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x31e47c0, mutex=0x31e4798) at ./nptl/pthread_cond_wait.c:627
#5  0x000078d1e99ccf10 in ?? () from /redacted Libs/XojoGUIFramework64.so
#6  0x000078d1e99caf6c in ?? () from /redacted Libs/XojoGUIFramework64.so
#7  0x000078d1e99c911a in ?? () from /redacted Libs/XojoGUIFramework64.so
#8  0x000078d1e99be7bf in ?? () from /redacted Libs/XojoGUIFramework64.so
#9  0x00000000007d99c8 in DesktopApplication._CallFunctionWithExceptionHandling%%o<DesktopApplication>p ()
#10 0x000078d1e99be6cb in ?? () from /redacted Libs/XojoGUIFramework64.so
#11 0x000078d1e99be8b2 in ?? () from /redacted Libs/XojoGUIFramework64.so
#12 0x000078d1e99bd406 in RuntimeRun () from /redacted Libs/XojoGUIFramework64.so
#13 0x0000000000849063 in REALbasic._RuntimeRun ()
#14 0x0000000001b3e96c in _Main ()
#15 0x0000000001b3e1c3 in main ()

I put a watch on the condition and then tried to release it with a signal to see it it might continue somewhere else or at least crash, however, it just goes right back to waiting.

(gdb) print ___pthread_cond_signal(0x31e47c0)

All the other threads appear to be working correctly. They wake after the right amount of time, do their thing and go back to sleep. Looking at the condition and pthread_cond_wait source (very convoluted) it would appear there are 2 waiters?

cond.__data = {
	__wseq = {
		__value64 = 680560182,
		__value32 = {
			__low = 680560182,
			__high = 0
		}
	},
	__g1_start = {
		__value64 = 680560178,
		__value32 = {
			__low = 680560178,
			__high = 0
		}
	},
	__g_refs = {
		2,
		0
	},
	__g_size = {
		0,
		0
	},
	__g1_orig_size = 4,
	__wrefs = 10,
	__g_signals = {
		0,
		0
	}
}

I can see that “RuntimeBackgroundTask” is being called from other threads so it should be yielding time back to the main thread, however, nothing in the main thread is being called (no event handlers, gtk_main_iteration_do, etc.)

If I make an app that does some blocking operation on another thread (synchronous serial), it seems to still at least be in some sort of event loop (gtk_main_iteration_do):

#0  0x00007ffff5098d61 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7107dc) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x7107dc) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7107dc, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff509b7dd in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x710788, cond=0x7107b0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7107b0, mutex=0x710788) at ./nptl/pthread_cond_wait.c:627
#5  0x00007ffff57ccf10 in ?? () from /redacted Libs/XojoGUIFramework64.so
#6  0x00007ffff57caf6c in ?? () from /redacted Libs/XojoGUIFramework64.so
#7  0x00007ffff57c911a in ?? () from /redacted Libs/XojoGUIFramework64.so
#8  0x00007ffff565e3f4 in ?? () from /redacted Libs/XojoGUIFramework64.so
#9  0x00007ffff4545522 in g_timeout_dispatch (source=source@entry=0x741b50, callback=<optimized out>, user_data=<optimized out>) at ../../../glib/gmain.c:4989
#10 0x00007ffff454448e in g_main_dispatch (context=0x72a1e0) at ../../../glib/gmain.c:3344
#11 0x00007ffff45a3717 in g_main_context_dispatch_unlocked (context=0x72a1e0) at ../../../glib/gmain.c:4152
#12 g_main_context_iterate_unlocked.isra.0 (context=context@entry=0x72a1e0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../../../glib/gmain.c:4217
#13 0x00007ffff4543a53 in g_main_context_iteration (context=0x72a1e0, context@entry=0x0, may_block=may_block@entry=1) at ../../../glib/gmain.c:4282
#14 0x00007ffff49fec6d in gtk_main_iteration_do (blocking=1) at ../../../gtk/gtkmain.c:1457
#15 0x00007ffff57be78a in ?? () from /redacted Libs/XojoGUIFramework64.so
#16 0x0000000000618b58 in DesktopApplication._CallFunctionWithExceptionHandling%%o<DesktopApplication>p ()
#17 0x00007ffff57be6cb in ?? () from /redacted Libs/XojoGUIFramework64.so
#18 0x00007ffff57be8b2 in ?? () from /redacted Libs/XojoGUIFramework64.so
#19 0x00007ffff57bd406 in RuntimeRun () from /redacted Libs/XojoGUIFramework64.so
#20 0x0000000000646853 in REALbasic._RuntimeRun ()
#21 0x00000000006e8a4a in _Main ()
#22 0x00000000006e8213 in main ()

Not sure where to go with this, I feel like the function at “0x000078d1e99be7bf” would give some clues, however, I can’t seem to place it exactly in the disassembly but it seems to be stuck at

sub_3ca4cb(0x1, rsi, rdx, 0x0);

in

int sub_3be747() {
    if (*(int8_t *)byte_28c8048 == 0x0) {
            if (__cxa_guard_acquire(byte_28c8048) != 0x0) {
                    *qword_28c8040 = g_main_context_new();
                    __cxa_guard_release(byte_28c8048);
            }
    }
    if (*(int8_t *)byte_28c8058 == 0x0) {
            if (__cxa_guard_acquire(byte_28c8058) != 0x0) {
                    rsi = 0x1;
                    *qword_28c8050 = g_main_loop_new(*qword_28c8040, rsi);
                    __cxa_guard_release(byte_28c8058);
            }
    }
    sub_41ed5c();
    COND = sub_25e6fb() != 0x0;
    rax = *(int8_t *)byte_28be564 & 0xff;
    if (COND) {
            rax = 0x0;
    }
    gtk_main_iteration_do(rax & 0xff);
    if (*qword_28c8038 != 0x0) {
            rdi = *qword_28c8038;
            (*(*rdi + 0x1b8))(rdi);
    }
    sub_342441();
    RunFireSerial();
    sub_3e268e(0x0);
    sub_25e422(0x1);
    sub_3ca4cb(0x1, rsi, rdx, 0x0);
    rax = 0x28c0e40;
    if (*(int8_t *)rax != 0x0) {
            rax = 0x28c0e58;
            if (*rax == 0x0) {
                    rax = 0x28c0e58;
                    if ((*(int8_t *)byte_28c8030 & 0x1) == 0x0) {
                            rax = *0x28c0e28;
                            if (rax != 0x0) {
                                    rax = *0x28c0e28;
                                    if (*(int8_t *)(rax + 0x59) != 0x0) {
                                            rax = sub_25d5a1(0x0, 0x1);
                                    }
                            }
                    }
            }
    }
    return rax;
}

The thread scheduling code has been radically changed in the latest version of Xojo to accommodate preemptive threads. Are you running/compiling your app using the newest version?

Hey Eric,

This is still 2024r2.1, I did not want to throw more variables at the problem with preemptive threads. There was about 6 weeks before this hang (longest it has gone), however, it has been as short as a few days and on average around 3-4 weeks. If it is an issue with my logic or in the framework, it would be nice to know how to avoid it or that it will be fixed when updating to newer versions.

The infrequency makes it very hard to know if any changes will have fixed it. So far I have bypassed anything that might be causing it, made all shells async, avoided any ui blocking operations, etc.

Without a more specific idea of what is causing the hang, there doesn’t seem like much chance that you’ll know whether they’ve fixed it. It’s great that you’ve managed to get it go so long without crashing; although this does limit your ability to troubleshoot it and notice any relevant patterns that may be helpful.

I think at this point that you may as well start using the new version, since you’ll be forced to at some point no matter what. If your issue is fixed, that’s a bonus. :grin: If not, it seems unlikely that it will get much worse.

I am quite surprised how well everything else is working under the hood, all the other threads continue like normal, if only I could get the event handlers to fire.

I have the ability to read and write files, I was thinking of some possible functions I could add to debug further. Maybe watch to see if a file exists and then run DoEvents, try to dump the runtime, or cause a crash to see where it spills out from.