Haproxy Timeout

Graham_Spratt1 · April 14, 2015, 7:45pm

I’m having an issue when load balancing with Haproxy where it seems to drop connections to the app and you get the app offline message.

What happens when looking at the Haproxy stats is it starts showing the nodes as going down/down, this isn’t the case they are still running fine. They appear as online again with a few seconds with no sessions.

I can’t see anything obvious in the Haproxy logs.

My config is

defaults
log global
mode http
option httplog
option dontlognull
retries 3
option redispatch
maxconn 2000
timeout connect 15s // Tried 5s same issue
timeout client 3m // Tried 50s same issue
timeout server 3m // Tried 50s same issue

KevinW · April 14, 2015, 8:07pm

How is haproxy configured to determine if the app is down or not?

Graham_Spratt1 · April 14, 2015, 8:16pm

I’m not 100% sure to be honest, I used John Joyce guide here to set it up.

I think it checks / every 2s

http://john-joyce.com/xojo-and-load-balancing-with-haproxy/

KevinW · April 14, 2015, 9:47pm

How long before you get the offline message in the browser? Have you tried making the timeouts longer, like 15m just to make sure it has nothing to do with the timeout length?

Have you tested the app direct without going through haproxy? Is it a new app or one that was working before?

What happens if you comment out the httpchk so haproxy doesn’t mark the app down?

John_Joyce · April 16, 2015, 12:26am

Liz,

I need to see more of your config because your are just showing the defaults section.
Try this though - if you have “option httpchk” on anywhere (can be in defaults, listen or backend) comment it out and replace it with option-http-keep-alive. Let me know what happens. If it doesn’t help then post here as much of your haproxy.cfg files as you can.

Also - a question. Are you using haproxy 1.5?

Also - comment out your “timeout connect” and “timeout client” and change your “timeout server” to 90s.

I need to update that post because it is for haproxy 1.4 and an older version of xojo. I have made several tweaks since then for 1.5 and the current version of Xojo.

Also - what is your server OS? Do you know if you are running any type of Security Enhanced Linux?

Thanks,
John

John_Joyce · April 16, 2015, 12:27am

I have a typo, it should be “option http-keep-alive” I have an extra hyphen in the previous post.

Graham_Spratt1 · April 16, 2015, 1:18pm

I’m using HA-Proxy version 1.5.2 running on CentOS release 6.6 (Final)

My config is below, running it like this seems to have been far more stable. The main thing I did was remove the server check

global
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    stats socket /var/lib/haproxy/stats
    maxconn     4096
    user        haproxy
    group       haproxy
    daemon

defaults
    log                     global
    mode                    http
    option                  httplog
    option                  dontlognull
    retries                 3
    option                  redispatch
    maxconn                 2000
    timeout connect         15s
    timeout client          5m
    timeout server          5m

frontend http-in
        bind 192.168.9.1:8190
        acl is_myapp1 hdr_end(host) -i scb.domain.com
        use_backend SCB if is_myapp1
        default_backend SCB

backend SCB
#   option httpchk OPTIONS /
    option forwardfor
    option http-server-close
    cookie serverid insert indirect nocache
    server SCB1 127.0.0.1:8081 cookie SCB1
    server SCB2 127.0.0.1:8082 cookie SCB2
    server SCB3 127.0.0.1:8083 cookie SCB3
    server SCB4 127.0.0.1:8084 cookie SCB4
    server SCB5 127.0.0.1:8085 cookie SCB5
    server SCB6 127.0.0.1:8086 cookie SCB6

John_Joyce · April 16, 2015, 1:21pm

That’s great Liz - glad you got it worked out.

Greg_O_Lone · April 16, 2015, 1:56pm

This most likely means that the nodes are crashing (or being quit) and then relaunching.

Graham_Spratt1 · April 16, 2015, 2:34pm

When Haproxy has reported the node as down I’ve tested the node directly by going to it’s ip/port and it’s up.

The web app is started from the command line manually it’s standalone not cgi.

Without the check on node level it appears to run fine I have had two of nodes crash and only info I can get is from the messages log but I believe this is a seprate issue and will open another topic on it.

Apr 16 11:10:44 scb kernel: SCB[21433]: segfault at 90 ip b76e3836 sp bfac6ee0 error 4 Apr 16 13:45:17 scb kernel: SCB[7987]: segfault at 654c7375 ip 0132b902 sp 00a59460 error 4 in XojoConsoleFramework32.so[ff8000+755000]

Graham_Spratt1 · April 16, 2015, 5:27pm

Not sure it’s resolved more worked around, I would prefer the check of nodes to be on just not sure whats causing it to fail when that check is on.

John_Joyce · April 16, 2015, 5:31pm

This has happened to me as well. The node is not going down, it is just not answering the httpchk. If you refresh the stats page you can watch it as it “goes up and down” repeatedly. I’m not sure what changed along the way to make this no longer consistent.

If you disable this option HAproxy still does health checks. It just checks to see if it can open a TCP connection or not. That’s not as nice (you can do a lot with the httpchk) - but it works if your app crashes or otherwise becomes entirely unresponsive.

Let’s keep experimenting with different tuning options and see if we can’t figure out what the hold-up is. If I find out anything else I will come back here and post it. I am working on a project right now that this was effecting.

John_Joyce · April 16, 2015, 5:55pm

OK - I played around with this and I have another workaround. If you require more failed httpchk health checks before considering the app instance down, you can continue to use httpchk. At least, it is working for me right now

Here is what I mean, in your backend config:

backend mybackend
        option httpchk OPTIONS /
        option forwardfor
        option http-keep-alive
        cookie serverid insert indirect nocache
        server node1 127.0.0.1:8080 check cookie node1 inter 2000 rise 2 fall 5
        server node2 127.0.0.1:8081 check cookie node2 inter 2000 rise 2 fall 5
        server node3 127.0.0.1:8082 check cookie node3 inter 2000 rise 2 fall 5

The parameter “inter 2000 rise 2 fall 5” sets the check interval to every two seconds and the server has to fail 5 in a row to be considered down. If it passes 2 in a row it is considered up. The default for “fall” is 3. Just adding two more seems to eliminate the false positives. This means that your server would be down for 10 seconds before haproxy stops routing requests to it - seems very reasonable.

I have noticed this only effects my apps when they are under load. If I am just clicking around the app they never go down, but when I am inserting 30,000 rows into a database the health checks begin to fail. I think it is just latency so extending the amount of failures seems like a decent trade-off.

@Liz Sykes - let me know if you try this and what results you get.

Graham_Spratt1 · April 16, 2015, 6:15pm

I’ll give that a try.

I was seeing the issue while under load. under normal use it would work fine.

Do you think I should lower these?

timeout connect         15s
timeout client          5m
timeout server          5m

John_Joyce · April 16, 2015, 6:29pm

I am using these values currently:

timeout server 90s
timeout client 15s
timeout connect 15s

There is no benefit in a long connect or client value - the server value may need to be longer if your application runs very slowly, but 90s is pretty long already.

Graham_Spratt1 · April 16, 2015, 6:36pm

My app is doing nothing special intact there is no user input.

It takes data from a IPC Socket every 5 seconds and writes it to some labels and then pushes that out to the clients using the example from the Xojo Chat app.

John_Joyce · April 16, 2015, 6:40pm

You should be good with shorter timeouts then.
Give the interval,rise,fall settings a try on the backend and let me know what happens for you.

Graham_Spratt1 · April 17, 2015, 11:06am

Doing the inter 2000 rise 2 fall 5 seems to have worked okay today

I’m thinking of adjusting the timeout values to what you are using.