I’m having an issue when load balancing with Haproxy where it seems to drop connections to the app and you get the app offline message.
What happens when looking at the Haproxy stats is it starts showing the nodes as going down/down, this isn’t the case they are still running fine. They appear as online again with a few seconds with no sessions.
I can’t see anything obvious in the Haproxy logs.
My config is
defaults
log global
mode http
option httplog
option dontlognull
retries 3
option redispatch
maxconn 2000
timeout connect 15s // Tried 5s same issue
timeout client 3m // Tried 50s same issue
timeout server 3m // Tried 50s same issue
How long before you get the offline message in the browser? Have you tried making the timeouts longer, like 15m just to make sure it has nothing to do with the timeout length?
Have you tested the app direct without going through haproxy? Is it a new app or one that was working before?
What happens if you comment out the httpchk so haproxy doesn’t mark the app down?
I need to see more of your config because your are just showing the defaults section.
Try this though - if you have “option httpchk” on anywhere (can be in defaults, listen or backend) comment it out and replace it with option-http-keep-alive. Let me know what happens. If it doesn’t help then post here as much of your haproxy.cfg files as you can.
Also - a question. Are you using haproxy 1.5?
Also - comment out your “timeout connect” and “timeout client” and change your “timeout server” to 90s.
I need to update that post because it is for haproxy 1.4 and an older version of xojo. I have made several tweaks since then for 1.5 and the current version of Xojo.
Also - what is your server OS? Do you know if you are running any type of Security Enhanced Linux?
When Haproxy has reported the node as down I’ve tested the node directly by going to it’s ip/port and it’s up.
The web app is started from the command line manually it’s standalone not cgi.
Without the check on node level it appears to run fine I have had two of nodes crash and only info I can get is from the messages log but I believe this is a seprate issue and will open another topic on it.
Apr 16 11:10:44 scb kernel: SCB[21433]: segfault at 90 ip b76e3836 sp bfac6ee0 error 4
Apr 16 13:45:17 scb kernel: SCB[7987]: segfault at 654c7375 ip 0132b902 sp 00a59460 error 4 in XojoConsoleFramework32.so[ff8000+755000]
This has happened to me as well. The node is not going down, it is just not answering the httpchk. If you refresh the stats page you can watch it as it “goes up and down” repeatedly. I’m not sure what changed along the way to make this no longer consistent.
If you disable this option HAproxy still does health checks. It just checks to see if it can open a TCP connection or not. That’s not as nice (you can do a lot with the httpchk) - but it works if your app crashes or otherwise becomes entirely unresponsive.
Let’s keep experimenting with different tuning options and see if we can’t figure out what the hold-up is. If I find out anything else I will come back here and post it. I am working on a project right now that this was effecting.
OK - I played around with this and I have another workaround. If you require more failed httpchk health checks before considering the app instance down, you can continue to use httpchk. At least, it is working for me right now
Here is what I mean, in your backend config:
backend mybackend
option httpchk OPTIONS /
option forwardfor
option http-keep-alive
cookie serverid insert indirect nocache
server node1 127.0.0.1:8080 check cookie node1 inter 2000 rise 2 fall 5
server node2 127.0.0.1:8081 check cookie node2 inter 2000 rise 2 fall 5
server node3 127.0.0.1:8082 check cookie node3 inter 2000 rise 2 fall 5
The parameter “inter 2000 rise 2 fall 5” sets the check interval to every two seconds and the server has to fail 5 in a row to be considered down. If it passes 2 in a row it is considered up. The default for “fall” is 3. Just adding two more seems to eliminate the false positives. This means that your server would be down for 10 seconds before haproxy stops routing requests to it - seems very reasonable.
I have noticed this only effects my apps when they are under load. If I am just clicking around the app they never go down, but when I am inserting 30,000 rows into a database the health checks begin to fail. I think it is just latency so extending the amount of failures seems like a decent trade-off.
@Liz Sykes - let me know if you try this and what results you get.
timeout server 90s
timeout client 15s
timeout connect 15s
There is no benefit in a long connect or client value - the server value may need to be longer if your application runs very slowly, but 90s is pretty long already.
My app is doing nothing special intact there is no user input.
It takes data from a IPC Socket every 5 seconds and writes it to some labels and then pushes that out to the clients using the example from the Xojo Chat app.