getting some segfaults on linux with a webapp

Hello all,

We’ve been getting regular segfaults:
segfault at f4772ffc ip 00000000f63114eb sp 00000000f4773000 error 6 in RBConsoleFramework.so[f60e2000+14a1000]
segfault at f462effc ip 00000000f77481ac sp 00000000f462efe4 error 6 in libpthread-2.13.so[f7740000+15000]
so far only RBConsoleFramework.so and libpthread-2.13.so mentioned

General info:
Debian 7.0 (nothing special installed, pretty “raw” distro)
No apparent/relevant memory leaks (memory usage is stable/compatible with # of users)
System Memory is plenty, doesn’t even get to 50% usage.

WebApp Info:
Works as API for iOS mobile app through “HandleSpecialURL” as well as our site. Mobile app sends users’ location, Facebook data, requests etc to server, quite often.
Has a few threads for Delayed Saves when the info is not so important, i.e.: Users’ Facebook data is added to a dictionary “cache” and then the thread iterates over it saving the updates etc.

32Bit Libs (ldd seems fine(?))
(ldd on webapp file)
linux-gate.so.1 => (0xf77a0000)
libdl.so.2 => /lib/i386-linux-gnu/i686/cmov/libdl.so.2 (0xf7789000)
libpthread.so.0 => /lib/i386-linux-gnu/i686/cmov/libpthread.so.0 (0xf7770000)
libc.so.6 => /lib/i386-linux-gnu/i686/cmov/libc.so.6 (0xf760b000)
/lib/ld-linux.so.2 (0xf77a1000)

(ldd on RBConsoleFramework.so)
linux-gate.so.1 => (0xf7767000)
libglib-2.0.so.0 => /lib/i386-linux-gnu/libglib-2.0.so.0 (0xf61a1000)
libpthread.so.0 => /lib/i386-linux-gnu/i686/cmov/libpthread.so.0 (0xf6188000)
libdl.so.2 => /lib/i386-linux-gnu/i686/cmov/libdl.so.2 (0xf6183000)
libm.so.6 => /lib/i386-linux-gnu/i686/cmov/libm.so.6 (0xf615d000)
libc++.so.1 => /root/webapp/1.12_dev/popbumwebserver Libs/./libc++.so.1 (0xf5fd9000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xf5fbc000)
libc.so.6 => /lib/i386-linux-gnu/i686/cmov/libc.so.6 (0xf5e58000)
libpcre.so.3 => /lib/i386-linux-gnu/libpcre.so.3 (0xf5e19000)
librt.so.1 => /lib/i386-linux-gnu/i686/cmov/librt.so.1 (0xf5e10000)
/lib/ld-linux.so.2 (0xf7768000)

I can’t find an easy way to debug this and I’m quite unaware of what normally could cause segfaults within a xojo application.

My “question” is: what most likely could be causing these segfaults?

1- Is it possible that we forgot to implement a critical section somewhere? Could this cause segfaults??
2- Sleeping/waking threads too often could cause something?
3- Too many threads (less than 10) is a problem?

Anyway, I’m trying to log everything but as this only happens in production the logs get quite big, quite fast so knowing at least what to look for could save a lot of time… :slight_smile:

Thanks everyone and anyone who managed to read this compact post. :stuck_out_tongue:

[quote=192604:@Yosef Coelho]1- Is it possible that we forgot to implement a critical section somewhere? Could this cause segfaults??
2- Sleeping/waking threads too often could cause something?
3- Too many threads (less than 10) is a problem?[/quote]

No. Just about any time your app crashes, declares and plugins aside, it’s a bug.

Could you get a core dump of when it happens and send that to us somehow?

No declares and no plugins (only xojo included plugins, MySQL etc)

I will take a look at sending the core dump, not sure if we are allowed to do that.

Hi,

Just tried debugging it with gdb and xojo has no symbol tables? How can I debug it with gdb?

We can’t send the dump, unfortunately. Do you have some internal tool or way to debug this?

We figured out more or less where it’s hanging and it’s a method called very often but still we can’t figure out the pattern that generates the crash…

Yosef,

This isn’t a direct answer to your question, but may be something to try.
I ran into random, unexplained segfaults in a web app I was using much the same way. What I ended up doing that fixed it for me was I broke out the method that was creating all the problems into a helper console app and called it via the shell from the original web app. It added a small amount of additional response time (<1 sec) but it executes flawlessly 100% of the time.

Just a suggestion of something to try - hope it helps.

Thanks John, we will look into that.

It all alppear to happen in a GetID method which uses a mysql prepared statement to retrieve “one” integer id from the DB. It happens exactly on the “ps.SQLSelect” call, apparently always…
This method generally works, even retrieving ids from the same table, prior to the hard crash, so I’m out of options in the xojo side, without input from someone inside xojo…

Is there a reason why one would get a segfault from calling ps.SQLSelect?

The binding is quite simple, two doubles, and it works several times before the crash, etc…

you dont happen to be sharing one connection across the entire web app are you ?

[quote=192634:@Yosef Coelho]We can’t send the dump, unfortunately. Do you have some internal tool or way to debug this?

We figured out more or less where it’s hanging and it’s a method called very often but still we can’t figure out the pattern that generates the crash…[/quote]

Can you get a backtrace for all threads? The gdb command to do this is “thread apply all bt”.

Norman: No, I have a connection pool with critical sections etc for getting and freeing the connections.
ex: something needs to be saved -> (Enters Critical Section) -> GetFreeDB -> (Leaves CS) -> saves -> (Enters CS) -> FreesDB -> (Leaves CS)

The prepared statement itself is not a static or property etc shared among different threads etc.

Joe, I will take a look at getting that done.

Is it related to the mysql plugin? is it a good idea to take a look at the source and maybe recompile it? What parameters, Compiler version etc should I use to get the same plugin xojo offers?

Thanks for your support

Almost certainly is related to database plugins and its interaction with the threading system. I’m not sure where the source for that is or how to build it for Linux. I would assume there’s a makefile right there that can be used.

Could it be some problem with the data I’m applying to the PS? In this case it’s only two doubles but who knows…

I will try “filtering” the data before applying it to the PS.

Will also take a look at the plugin source, hopefully I can find something…

IIRC last time I peeked at a similar issue it ended up being due to some thread safety problems (either in the MySQL plugin or the main framework).

Ok,

I worked 10 minutes in the plugin source (C++, hateful language but… :-P) and it seems pretty straightforward and rather simple. Still, not sure how to fix it… (possibly related to the limited amount of time)

I need a quick fix for this… Maybe if I surround the method in question with a global critical section it will work? I can deal with the performance issues (if any) by just spawning more instances etc…

Anyway, if anyone can help with a tweaked MySQL plugin it will be great… I’ve read about MBS etc but I would rather prefer just switching one file and get it working…

Thanks all for the support and time.

ps. Our deployment is just one executable file that spawns itself in different ports etc depending on demand. Is it possibly related that one executable spawns itself (calls itself again to open up) be related, in terms of 32 bit, memory, threads, sub-threads etc? I’m not a computer scientist (probably easy to see) so I’m not sure how the OS handles such “self-launches” specially talking about 32bit etc and how linux handles 32bit within a 64bit environment etc… Anyway.

Is it worth spinning a true 32 machine or most likely unrelated? I have automated new machines spinning but only for 64bit so it would be a rather manual task etc…

Well, I just did a ps -e -o pid,args --forest and the spawn process are inside the first one… Again: I don’t understand much about it but I feel it’s related… I will try spawning processes in a different way (manually :-D) and see what happens…

Could it be that this would limit the total memory available? I mean, I thought that each process could use up to the 32bit limit but maybe they have a collective upper limit?

Anyone understands well about this?

Joe Ranieri, Following what you requested:

It seems to my untrained eyes that it’s a deadlock, am I right?

Thread 7 (Thread 0xf4360b70 (LWP 15846)):
#0  0xf7740b30 in __kernel_vsyscall ()
#1  0xf771e36b in read () from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#2  0xf494e9ad in ?? ()
   from /PATH/MySQLCommunityPlugin.so
#3  0xf494ec71 in my_net_read ()
   from /PATH/MySQLCommunityPlugin.so
#4  0xf4942d44 in cli_safe_read ()
   from /PATH/MySQLCommunityPlugin.so
#5  0xf49452d4 in ?? ()
   from /PATH/MySQLCommunityPlugin.so
#6  0xf4949c4b in cli_stmt_execute ()
   from /PATH/MySQLCommunityPlugin.so
#7  0xf494b0ed in mysql_stmt_execute ()
   from /PATH/MySQLCommunityPlugin.so
#8  0xf493b982 in MySQLStatement::SQLSelectHelper(REALarrayStruct*, bool) ()
   from /PATH/MySQLCommunityPlugin.so
#9  0xf493bb54 in MySQLStatement::SQLSelect(REALarrayStruct*) ()
   from /PATH/MySQLCommunityPlugin.so
#10 0xf493a99c in ?? ()
   from /PATH/MySQLCommunityPlugin.so
#11 0xf57843dc in ?? ()
Thread 6 (Thread 0xf4706b70 (LWP 12908)):
#0  0xf7740b30 in __kernel_vsyscall ()
#1  0xf771df32 in __lll_lock_wait ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#2  0xf77193cb in _L_lock_728 ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#3  0xf77191f1 in pthread_mutex_lock ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#4  0xf6322818 in ?? ()
   from /PATH/RBConsoleFramework.so
#5  0xf54813e0 in ?? ()
#6  0xf5caf3dd in ?? ()
#7  0xf6323d04 in ?? ()
   from /PATH/RBConsoleFramework.so
#8  0xf7716c39 in start_thread ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#9  0xf7682c6e in clone () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
Thread 5 (Thread 0xf4787b70 (LWP 12905)):
#0  0xf7740b30 in __kernel_vsyscall ()
#1  0xf771df32 in __lll_lock_wait ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#2  0xf77193cb in _L_lock_728 ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#3  0xf77191f1 in pthread_mutex_lock ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#4  0xf6322818 in ?? ()
   from /PATH/RBConsoleFramework.so
#5  0xf54813e0 in ?? ()
#6  0xf5d1bccb in ?? ()
#7  0xf6323d04 in ?? ()
   from /PATH/RBConsoleFramework.so
#8  0xf7716c39 in start_thread ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#9  0xf7682c6e in clone () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
Thread 4 (Thread 0xf75ab6c0 (LWP 12897)):
#0  0xf7740b30 in __kernel_vsyscall ()
#1  0xf771df32 in __lll_lock_wait ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#2  0xf77193cb in _L_lock_728 ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#3  0xf77191f1 in pthread_mutex_lock ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#4  0xf6322818 in ?? ()
   from /PATH/RBConsoleFramework.so
#5  0xf6267d10 in RuntimeDoEvents ()
   from /PATH/RBConsoleFramework.so
#6  0xf547e33b in ?? ()
#7  0xf5571329 in ?? ()
#8  0xf6269108 in CallConsoleApplicationRunEvent() ()
   from /PATH/RBConsoleFramework.so
#9  0xf544843e in ?? ()
#10 0xf547e56c in ?? ()
#11 0xf63198b7 in CallFunctionWithExceptionHandling(void (*)()) ()
   from /PATH/RBConsoleFramework.so
#12 0xf6318c4e in RuntimeRun ()
   from /PATH/RBConsoleFramework.so
Thread 3 (Thread 0xf4808b70 (LWP 12904)):
#0  0xf7740b30 in __kernel_vsyscall ()
#1  0xf771df32 in __lll_lock_wait ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#2  0xf77193cb in _L_lock_728 ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#3  0xf77191f1 in pthread_mutex_lock ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#4  0xf6322818 in ?? ()
   from /PATH/RBConsoleFramework.so
#5  0xf54813e0 in ?? ()
#6  0xf5d54f2c in ?? ()
#7  0xf6323d04 in ?? ()
   from /PATH/RBConsoleFramework.so
#8  0xf7716c39 in start_thread ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#9  0xf7682c6e in clone () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
Thread 2 (Thread 0xf4889b70 (LWP 12901)):
#0  0xf7740b30 in __kernel_vsyscall ()
#1  0xf771df32 in __lll_lock_wait ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#2  0xf77193cb in _L_lock_728 ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#3  0xf77191f1 in pthread_mutex_lock ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#4  0xf6322818 in ?? ()
   from /PATH/RBConsoleFramework.so
#5  0xf54813e0 in ?? ()
#6  0xf5d3649b in ?? ()
#7  0xf6323d04 in ?? ()
   from /PATH/RBConsoleFramework.so
#8  0xf7716c39 in start_thread ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#9  0xf7682c6e in clone () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
Thread 1 (Thread 0xf4685b70 (LWP 12909)):
#0  0xf77191ac in pthread_mutex_lock ()
   from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
#1  0xf63244f0 in ?? ()
   from /PATH/RBConsoleFramework.so
#2  0xf63249cc in ?? ()
   from /PATH/RBConsoleFramework.so
#3  0xf63249cc in ?? ()
   from /PATH/RBConsoleFramework.so
#4  0xf63249cc in ?? ()
   from /PATH/RBConsoleFramework.so
#5  0xf63249cc in ?? ()
   from /PATH/RBConsoleFramework.so
#6  0xf63249cc in ?? ()
   from /PATH/RBConsoleFramework.so
#7  0xf63249cc in ?? ()
   from /PATH/RBConsoleFramework.so
#8  0xf63249cc in ?? ()
   from /PATH/RBConsoleFramework.so
#9  0xf63249cc in ?? ()
   from /PATH/RBConsoleFramework.so
#10 0xf63249cc in ?? ()

What thread did it crash in?

Most threads are waiting inside of pthread_mutex_lock because that’s how Xojo’s cooperative threads work.

Program terminated with signal 11, Segmentation fault. #0 0xf77191ac in pthread_mutex_lock () from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0

So I suppose Thread #1?

Thread 1 (Thread 0xf4685b70 (LWP 12909)): #0 0xf77191ac in pthread_mutex_lock () from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0

Question: Could it be that opening “too many” prepared statements too quickly is a problem? I mean, there is a bug but can it be avoided by keeping a cache of prepared statements and re-using them?
I say the above because it happens during the most frequent action (maybe just statistics playing its role) of saving a user’s location. and it’s always with getting the ID of a location (to see if it exists) and not with any other part of the saving process.

I’ve checked the DB and the prepared statements are closing etc so it’s not reaching the prepared statements limit etc.

Quite lost, not sure there is any other way other than implementing a whole TCP/IP class for handling MySQL.