Web app on AWS Lightsail CPU utilisation blowouts

Mark_Blake · September 21, 2021, 2:17am

I am running 3 web apps on AWS Lightsail instance, using Lifeboat and I am sharing this to see if anyone else has experienced anything similar?

Normally my Lightsail server runs at around 3-6% CPU utilisation, and might hit 10% during peak business hours time, which is harmless. So i do not think I need to upgrade my instance. Then suddenly it will jump up to 60-80-100% and sustain that. I have alarms set, but when it triggers, the entire server is unresponsive so troubleshooting is impossible. I can only Stop and Start the Instance, (reboot doesnt work).

When I run TOP in normal circumstances, my busiest web app starts at around 5% memory utilisation then after about 5 days that increases to the 50’s and hovers up and down at that level, and gives no indications of any impending troubles.

a pattern is beginning to emerge, this event has occurred in these exact time gaps in days. 8,20,13,12,10,10,11. (it happened a few times before but i since upgraded the instance to a more powerful one). I have let it run its course for long enough to be sure its not going to got away, and seems now I can set my watch to it. It doesnt always occur during our websites busiest times (we are B2B so we dont expect much site traffic outside of business hours).

I could try rebooting the instance daily (or weekly) but thats only a bandaid.
I will run a script to write the output of TOP to a file and see if I can catch its last entry before the instance carks it.

Anthony_G_Cyphers · September 21, 2021, 2:21am

If you can set your watch by it, then it sounds like it might be a cron job that’s causing the issue. I don’t know of any timed mechanisms within the Xojo framework that would provide a similar effect.

Mark_Blake · September 21, 2021, 2:22am

it’s an exaggeration that I can set my watch to it - it never occurs at the same time of day/hour. It seems now that its settle to occur every 10-11 days. which would make one think I should know when to watch it…

Anthony_G_Cyphers · September 21, 2021, 2:24am

Ah, OK. Sorry for the noise, then. Maybe @Tim_Parnell can help out.

For what it’s worth, I usually have a weekly cron job on my servers that runs at a weekend off-peak time to perform maintenance and restart apps. I don’t, however, use Lightsail or Lifeboat (currently).

Tim_Parnell · September 21, 2021, 2:14pm

Have you tried a swapfile? This seemed to help mitigate the mysterious Xojo Web Eats Up 100% of the CPU issue that I had been investigating earlier this year. My only other workaround was to set up EC2 instances and have CloudWatch do a reboot when the CPU spiked. That doubled the cost of hosting a Xojo Web app.

My web app deployment tool called Lifeboat can help you set up a swapfile for your instance as well as tons of other useful things. Might be worth checking out if you haven’t already

Anthony_Dellos · September 21, 2021, 9:05pm

(emphasis mine)

@Tim_Parnell I think he’s already a customer?

Mark_Blake · September 21, 2021, 11:14pm

so it remains mysterious

Yep I have a swapfile, I am the guy who suggested you add that feature to lifeboat.

Thanks for cloudwatch suggestion, I don’t need to go there yet as I can act on it quickly enough from the alerts. Long term I may need to, or run a 2nd instance for failover. either way it doubles the cost.

Something interesting happened yesterday, at 3.00pm about 3 hours after I had to restart it, I had Top running in the shell window just in case and I casually noticed my webapp suddenly jumped to 95% CPU for a good 10 mins or so, then dropped back to normal. Memory use peaked around 25%, This is the first time thats ever happened (that I have seen)
This also coincided with a equally big spike in Inbound Network traffic. I have never noticed this correlation before as the instance normally falls over so stops reporting metrics too soon.

So there is my clue, either too many clients at once or some nefarious rogue - I am leaning towards the latter. I am getting my developer to add the google analytics code to the app, so I can get some details about that.
Screen Shot 2021-09-22 at 22-09-21, 9.16.39 am

Screen Shot 2021-09-22 at 22-09-21, 9.02.21 am

Mike_D · September 22, 2021, 2:04pm

I’m seeing CPU problems on macOS, and I wonder if they are related.
Scenario: a Web2 app built with Xojo 2021 R2.1, which has been running succesfully for over a week, with both light and heavy usage.
At the moment, there are no sessions connected, and yet the app is idling at about 32% CPU usage. A sample shows this:

    2530 Thread_41941606   DispatchQueue_1: com.apple.main-thread  (serial)
    + 2514 start  (in libdyld.dylib) + 1  [0x7fff5a4bb015]
    + ! 2514 main  (in Curve2) + 19  [0x10a49adf3]
    + !   2514 _Main  (in Curve2) + 536  [0x10a49b588]
    + !     2514 REALbasic._RuntimeRun  (in Curve2) + 19  [0x109f4ffb3]
    + !       2514 RuntimeRun  (in rbframework.dylib) + 53  [0x10a9cca68]
    + !         2514 CallFunctionWithExceptionHandling(void (*)())  (in rbframework.dylib) + 134  [0x10aa34f60]
    + !           2514 ConsoleApplication._CallFunctionWithExceptionHandling%%o<ConsoleApplication>p  (in Curve2) + 181  [0x109e65445]
    + !             2514 CallConsoleApplicationRunEventHelper()  (in rbframework.dylib) + 338  [0x10a9591dd]
    + !               2509 WebApplication.Event_Run%i8%o<WebApplication>A1s  (in Curve2) + 15180  [0x109f042cc]
    + !               : 2508 ConsoleApplication.DoEvents%%o<ConsoleApplication>i8  (in Curve2) + 11  [0x109e6525b]
    + !               : | 1532 ???  (in rbframework.dylib)  load address 0x10a7a1000 + 0x1cb295  [0x10a96c295]
    + !               : | + 1527 ???  (in rbframework.dylib)  load address 0x10a7a1000 + 0x1ccf10  [0x10a96df10]
    + !               : | + ! 1525 SleepToSystem(long)  (in rbframework.dylib) + 141  [0x10a96cebd]
    + !               : | + ! : 1524 xojo::ConditionVariable::WaitFor(xojo::UniqueLock&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000l> >)  (in rbframework.dylib) + 109  [0x10a963257]
    + !               : | + ! : | 1517 _pthread_cond_wait  (in libsystem_pthread.dylib) + 789  [0x7fff5a7d45c2]
    + !               : | + ! : | + 1507 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff5a60ba16]

I know there may be significant differences between Linux and macOS versions, but there could also be a similar failure mode(s)?

Edit: memory usage was not high (85MB) but when I restarted the app, memory usage dropped to 30MB and CPU dropped back to normal (idling at under 1%).

Tim_Parnell · September 23, 2021, 3:13pm

Yeah sorry, I don’t have a heck of a lot more information.

In my cases, I also believe it was HandleURL related. In one instance possibly a bad actor(s), but I didn’t want to jump that far with an app I literally just turned on (new instance, new domain).

When experiencing this, the instance gets locked up and I have to force shut down to regain control on EC2. I haven’t seen many issues since adding the swapfile (thanks for that awesome suggestion). Still, it was a non-zero number.

Would love more information, or experiences with this issue if you have any!

Mark_Blake · October 6, 2021, 7:04am

Well the next ‘predictable’ cycle of 10/11 days came up last week, and I was watching carefully and running Top and saving to a text file and guess what? It knew I was watching so it didnt happen, so none the wiser at this time.

I just rebooted it today, as it needed to complete a kernel update, and the "idle’ CPU % dropped from a constant ~5% to ~1.5%

Mike, I shared your info with my programmer (who is not me) and he doesnt relate to any of that.

Maybe I will just reboot this daily or weekly. I was going to resort to doing that anyway, but it looks like I wont be able to solve this.