Multiple SSLSocket I/O in Threads: the slowest connection universally sets the pace for all connections

Georgios_Poulopoulos · August 4, 2021, 11:51pm

Hello there!

I’d really appreciate your help with this, I can’t figure it out and I suspect it’s more of a “feature” than a bug.
I guess the people more likely to explain this, would be the Xojo engineers who know how the Xojo Framework works under the hood.

The situation is as follows:

There’s this webapi-oriented server that I’m making, which you can find here
It is a ground-up implementation of the HTTP protocol, using a ServerSocket and SSLSockets.
One of its main features, is that every new socket fires up a thread that has access to that socket, so it can do its I/O stuff.
In this particular application, it’s doing filesystem access and manipulation via a web api. Uploads, downloads, renames, deletes, pretty straightforward…

Unsurprisingly, it implements file download functionality: The GET method of the /files/ endpoint opens a BinaryStream to a file and passes chunks of data from the file stream to the socket, until EOF.
I’ve tried different variations for the code to do that, and I’ve concluded that this appears to be the most optimal, in terms of performance/concurrency:

(This is just the loop where TX happens. The method is in endpoint_files.GET if you care to dive into it)

try
  
  stream = BinaryStream.Open(file)
  Readable = true // it is readable
  
  WorkerThread.SocketRef.PrepareResponseHeaders_SendBinaryFile(FileSize , file.Name)
  WorkerThread.SocketRef.RespondOK(true)
  
  while not stream.EndOfFile
    
    chunk = stream.Read(ipsc_Lib.SocketChunkSize * n)  // adjust n to taste, currently 4
    WorkerThread.YieldToNext
    
    if not WorkerThread.SocketRef.IsConnected then exit while  // freezes on connection drops without it, in this exact place
    WorkerThread.SocketRef.Write(chunk)
    WorkerThread.BytesSent = WorkerThread.BytesSent + chunk.Bytes
    
    WorkerThread.SocketRef.Flush  // without this, it is all one big data packet
    
    WorkerThread.YieldToNext
    
  wend
  
Catch e as IOException
  
  stream.Close 
  
  if not Readable or WorkerThread.BytesSent = 0 then // nothing has been sent, we can respond in error
    WorkerThread.SocketRef.RespondInError(423 , "Unreadable file , IO error " + e.ErrorNumber.ToString)  // locked
    Return
  end if
  // we got an io error while we had already started sending an OK response
  // we just kill the connection and hope the client can detect it's incomplete
    WorkerThread.SocketRef.Disconnect
  WorkerThread.SocketRef.Close
  Return
end try

When the server and clients are both on the localhost (or on different gigabit ethernet-connected devices), it works blazingly fast, for multiple concurrent downloads.
The problem manifests clearly when one of the concurrent downloads takes place via a connection that’s much slower than the rest.
What happens then, is that the slowest transfer, sets the pace for all active transfers at the time.
A way to demonstrate that is by printing a timestamped message every time a socket’s SendComplete event fires.

In the first example, we have two open sockets (840,792) sending an 845MB file to clients that also run on the same machine as the server.
The big number is a milliseconds timestamp. Notice that the interval between two SendCompletes is around 20ms for both sockets.

ipsc_Connection.SendComplete : 1768851272 : connection 840 - UserAborted = False
ipsc_Connection.SendComplete : 1768851300 : connection 792 - UserAborted = False
ipsc_Connection.SendComplete : 1768851301 : connection 840 - UserAborted = False
ipsc_Connection.SendComplete : 1768851322 : connection 792 - UserAborted = False
ipsc_Connection.SendComplete : 1768851322 : connection 840 - UserAborted = False
ipsc_Connection.SendComplete : 1768851342 : connection 792 - UserAborted = False

In the second example, we also have two open sockets sending the same file.
The first one (852) is with a client that runs on the same machine. That should have been a fast one.
The second (720), is with a client that runs on a laptop connected via wifi to the home LAN. A much slower link that is.
What happens, is that while the slow transfer is active, they both run on pretty much the same pace.
When the slow one finishes or gets aborted, the fast one skyrockets, as it normally would.

ipsc_Connection.SendComplete : 1772574921 : connection 852 - UserAborted = False
ipsc_Connection.SendComplete : 1772576537 : connection 720 - UserAborted = False
ipsc_Connection.SendComplete : 1772576537 : connection 852 - UserAborted = False
ipsc_Connection.SendComplete : 1772577772 : connection 720 - UserAborted = False
ipsc_Connection.SendComplete : 1772577777 : connection 852 - UserAborted = False
ipsc_Connection.SendComplete : 1772578738 : connection 720 - UserAborted = False
ipsc_Connection.SendComplete : 1772578739 : connection 852 - UserAborted = False
ipsc_Connection.SendComplete : 1772579723 : connection 720 - UserAborted = False

Exact same thing happens when more than 2 connections are involved.
Now, I was under the impression that every Socket works independently of the rest, but this is obviously not the case.

I guess there can 3+1 causes:

My code (obviously)
The threading system
The sockets system
a good combination of some/all of the above

Could someone be kind enough to shed some light on the situation?
Windows 10, Xojo 2021R2 btw.

Thanks!

George

Rick_Araujo · August 5, 2021, 1:48am

There’s no “real” thread system in Xojo, Xojo uses just one thread for all and time-slicing simulating threads, not taking advantage of CPU cores. You never have a Xojo code running in parallel and Xojo says it does it to benefit the amateur devs not having clashing parallel conditions. Xojo is not adequate for this task. More threads means progressive geometric sluggishness as you noticed.

Sam_Rowlands · August 5, 2021, 2:04am

I wonder if workers would help here?

Georgios_Poulopoulos · August 5, 2021, 2:18am

Yes, I’m aware of that. If the threading model plays a role in this case, I guess it’s going to be for more complicated reasons. When multiple threads (tried up to 5) are downloading a large file over a fast link, performance does noticeably degrade, but it’s still not that bad.
The problem is when one of the connections is over a slow link: that seems to affect all active transfers at that time.
The other thing I’ve noticed, is that performance is slightly better when socket threads have the lowest priority. In my example they are all assigned a priority of 1 in their constructor.

Greg_O_Lone · August 5, 2021, 2:24am

Somethings not right here. That code you posted above… is that being called from inside the thread?

Georgios_Poulopoulos · August 5, 2021, 2:24am

Workers wouldn’t help because:

Workers are Desktop-only. The server is a service application, as all servers usually are.
Even if workers were available for console applications (which would be great), I don’t think the main application’s ServerSocket would be able to pass a new connection to an SSLSocket running on the helper application. Haven’t tried it, but I’d instinctively say it can’t.

But generally, as an architecture, it’s not bad at all. PostgreSQL does this: fires up a new separate process for every session and lets the OS handle allocation of CPU resources.

Sam_Rowlands · August 5, 2021, 2:28am

What about forking?

The main application wouldn’t be handling the connections at all, you’d fire up the worker, give it the URL and the worker (or fork) would then return the result.

Georgios_Poulopoulos · August 5, 2021, 2:31am

The thread calls the following routing method, passing it a reference of itself

Public Sub RouteRequest(WorkerThread as ipsc_ConnectionThread)
  select case WorkerThread.SocketRef.RequestPath.NthField("/" , 2).Lowercase
    
  case "files"
    
    dim files as new endpoint_files(WorkerThread , RootFolder)
    
  case "folders"
    
    dim folders as new endpoint_folders(WorkerThread , RootFolder)
    
  case "opensockets" // just for debugging, method is irrelevant
    
    dim sockets() as TCPSocket = Server.ActiveConnections
    dim SocketHandles as String = "Active socket handles at " + DateTime.Now.SQLDateTime + EndOfLine + EndOfLine
    for i as Integer = 0 to Sockets.Ubound
      SocketHandles = SocketHandles + Sockets(i).Handle.ToString + EndOfLine
    next i
    WorkerThread.SocketRef.RespondOK(SocketHandles)
    
  else
    WorkerThread.SocketRef.RespondInError(501)  // not implemented
  end select
  
End Sub

The actual processing happens in the endpoint_* classes which are also passed a reference of the thread. And the thread contains a reference of the socket.

So it’s methods calling other methods with the starting point being the thread’s Run event.
That’s not a source of problems, right?

Greg_O_Lone · August 5, 2021, 2:33am

Ok, I’ve read and re-read your code and I think you’re actually working against yourself. My guess is that your code is actually what’s slowing you down.

If this were me…

Subclass thread
Have a property on the thread for the file being downloaded
Have the code above in the Run event (or a method called from run) which does this:

While not stream.endoffile 
      Sockeref.write(stream.read(ipsc_Lib.SocketChunkSize * n))
    Self.sleep(10)
Wend

Yes, ultimately the data may be sent in a single chunk, but that’s up to the negotiation between the socket and the client. Calling Flush actually causes speed loss because all you’re doing is causing the whole process to stop while that socket sends its data.

IMHO, the only reason for using flush is if the client really needs to stream the data… like audio or video.

Georgios_Poulopoulos · August 5, 2021, 2:48am

On putting the thread to sleep: this is among the first things I tried. While doing so, I was getting the exact same data transfer rate I would get if I did the whole thing in the Web Frameworks’s HandleURL event: ~4MB/sec. I wanted more, and I got more with the way I did it: >100MB/sec.

On chopping the data by calling Flush often: it makes the server more responsive to handling concurrent requests. If I don’t chop, especially in the case of large files (or datasets anyway), a new request isn’t handled until the current finishes.

I can try your recommendations again and get back with the results, but the code you see is the result of a long process of trial and error.

Georgios_Poulopoulos · August 5, 2021, 2:58am

Sam, I’m not sure I understand what you mean. It’s a server, it’s listening for incoming connections. it doesn’t have any URL to follow. That’s what the client does.
Am I missing something?

Sam_Rowlands · August 5, 2021, 4:02am

It was just an idea, but if it would / could work, I’d imagine that the main app would launch an instance of the worker, which listens for a connection.

Upon the worker being connected to, it then notifies the main application, which launches another instance of the worker, which then opens up listening for a connection.

Greg_O_Lone · August 5, 2021, 10:41am

It’s important to remember that sockets in general have a buffer for incoming and outgoing data. That is, for incoming data, socket data is queued into a buffer before the DataAvailable event fires and is held there until you actually read from the socket. This is why you can use lookahead and then read it later.

On an outgoing buffer, any time that you write data it is also put into a buffer and fed to the client at a rate that it can handle which also doesn’t bog down other sockets and threads. Calling Flush may make a single socket perform better, but it also pauses the rest of the app while doing so and the performance for the rest of the users will suffer.

Remember, sockets all run their code on the main thread, regardless of where you write to them, so they’re all sharing the main thread’s time. That main thread is also responsible for keeping everything else working cooperatively, including the app itself.

So anyway yes, you might get individual sockets to get 100Mbps locally, but once you have multiple sockets running, you’ll end up being limited by the slowest connection the way you’re doing it. Allowing the socket to send data off as it sees fit will get you much better concurrent performance overall.

…at least that’s what we found when updating the web framework for web 2.

Greg_O_Lone · August 5, 2021, 10:46am

And no, workers don’t help you here. There’s no way to pass a socket generated by a server socket off to a worker to allow it to run independently of the main app at this point.

This is a place where load balancing your app is really helpful.

TimStreater · August 5, 2021, 11:24am

On the subject of Flush, @Greg_O_Lone , I ended up with my own version which goes like this:


While  (True)
  
  me.Poll ()            // Get latest status from the socket
  
  if  (me.IsConnected=False)  then Return False
  if  (me.LastErrorCode>0)  then Return False
 
  If  (me.BytesLeftToSend<=0)  Then Return True
  
Wend

Is this a waste of time or are there any other downsides? I have, or may have at times, multiple threads doing SSLSocket I/O.

Rick_Araujo · August 5, 2021, 12:33pm

I would mark this post as the answer:

Georgios_Poulopoulos · August 5, 2021, 7:39pm

Okay, first of all, Greg thank you for taking the time, I appreciate it!
I tried with the minimal setup you suggested. The code in the thread’s Run event has been updated in the Github repository and is as follows:

while not stream.EndOfFile
  
  SocketRef.write(stream.read(ipsc_Lib.SocketChunkSize * n))  // n=4 , SocketChunkSize = 1048576 
  
  if endpoint.IndexOf("flush") > 0 then
    SocketRef.Flush
  end if
  
  Self.sleep(10)
  
wend

As you can see, this code allows for two modes of operation, one with Flush invoked and another without.
I run these 7 test scenarios, all trying to download a 2GB file using Mozilla Firefox as a client:

No flush, 1 local client: 4MB/sec (same throughput I’d get with Web1 or Web2 2021R2 and much better than the 700kb/sec when Web2 was first introduced, thanks for fixing that!)
No flush, 4 local clients: all run at 4MB/sec
No flush, 4 local clients, 2 remote clients over slow wifi: locals run at a steady 4MB/sec , remotes run at 1,5MB/sec. The issue of “SYnchronized-THroughputs” does not manifest. (shall we call it SYTH for short?)
With flush, 1 local client: 100 MB/sec. Proud of Xojo!
With flush, 5 local clients: declining from 50MB/sec to ~25MB/sec. Very proud of Xojo!
With flush, 1 local client, 3 remote over gigabit ethernet: local at 40MB/sec declining , remotes at ~30MB/sec. Enthousiastic about Xojo!
With flush, 1 local client, 1 remote over slow wifi: local at 2MB/sec , remote at 2MB/sec. Clearly a bad case of SYTH.

Now, there are two ways of looking at the experimental data:

I’m doing things wrong: If I don’t want any SYTH, I should not Flush. Period.
The problem in this case is that I’m really not happy with the flat throughput of 4MB/sec. And it makes people who accuse Xojo of being a “toy language” happy. I’m not happy with that either.
I’m not doing things wrong: And I’m getting tangible results I can rub in the face of Xojo detractors: Xojo can prove somewhat acceptably performant in the servers game too.
But deep inside the Framework, there’s something that’s causing SYTH. And it might be worth having someone take a closer look into it, because it’s going to make Xojo a better product.

The outline you presented (and Rick A suggests I mark as the answer), looks to me like a solid explanation of how performance degrades with increasing connections, something that’s neither surprising, nor do I see as a problem: that’s life and that’s what load balancers are for, as you correctly pointed out.
But, to my understanding at least, it does not account for SYTH: 5 clients over a fast, uniform link run beautifully at 25MB/sec. 2 clients over different-bandwidth links run both 10 times slower. This is something I still find strange and had Xojo been my product, I’d be curious to find out why.

Anyway, I think I can live with the possibility of SYTH much better than I can live with the certainty of 4MB/sec over optical fiber.
Shall I file a bug report just for the formality of it?

Rick_Araujo · August 5, 2021, 8:04pm

Slow xfer rates causes slow services and slow events. Any thread in some “busy wait condition”, not sleeping, causes a “global wait”, so the entire system degrades. So any time you add a slow xfer service to the pool, the slow operations affects all the system and the entire pool degrades.

I can’t see “a cure” under the current cooperative threading design. So it does not need a bug report, it is as is “by design”, and would not trigger an action, as it is an already known fact, and requests for a preemptive model were already made in the past with no results.
A way to mitigate the effect is making heavy use of features that currently the Xojo framework and the language lacks, like Futures, Async and Await.

Georgios_Poulopoulos · August 5, 2021, 8:22pm

Now, that’s a potential explanation!
Then my question to Greg becomes: does the root of the SYTH evil lie in the threading architecture?

Rick_Araujo · August 5, 2021, 8:28pm

If you read again, only this part of the explanation he wrote, don’t you conclude it is?