Remote Sync Advice

I am undertaking my first attempt at a local/remote db sync in an app. I am looking to sync a client (local) database with a server (remote) database using a RESTful API on the server. The remote db can be accessed by multiple devices (desktop, laptop, phone) by the same userID. I see three basic conditions when the sync starts:

  1. the UUID is in both the local and the remote db’s
  2. the UUID is in the local db only
  3. the UUID is in the remote db only

It seems to me that the only way to determine which UUIDs belong in each group is to have the client send a list of all of the UUIDs in the local db to the server. I don’t see this being a real issue while the list remains short, but I also don’t know when the list crosses the line into “long” and starts being a performance issue.

Is there a better way to approach this?

If you are keeping track of the last modified date then you can keep track of things that have changed since the last time you sync’d your device.

Find me all records that have been updated since This will catch new and updated records assuming you’re got that data.

Using last sync time and mod dates (GMT, of course) was my first thought. But while I was researching syncing I read that you can’t always trust the clients’ time settings. I guess most devices these days will automatically get the right time, but that one sync where the time is off can really mess things up.

The better approach (according to what I read) is to track revision numbers (handed out by the server) for each record and compare the client and server revision numbers to determine which one is more recent. This is what I was going to implement, but then when I got into the details I ended up with the question in my original post.

@Bob Keeney I know you have worked on many different kind of apps - have you ever done an app with local/remote sync? If so did you use the mod date/last sync date approach?

If you add a timestamp from the server during the handshake you can adjust your sync time to suit.

I’m assuming that clients OWN their data so there will be no conflicts between clients updates.

Yes. All dates are GMT 0 so all databases are using the same timezone. Network computers are usually pretty accurate but you could query the server to get it’s time to see if there is any significant difference between local clock and server clock.

This is something I have used previously and it works very well:

http://havrl.blogspot.com/2013/08/synchronization-algorithm-for.html?m=1

Steve

@Stephen Pardoe thanks for the link. That is a concise summary of what I had found in all my web reading AND put together as an algorithm (which I hadn’t actually found anywhere else).

@Bob Keeney & @Wayne Golding thanks for your advice about comparing the server time and the client time during sync.

After reading that link from Stephen, it seems to me that the algorithm presented there can be used with either a timestamp or a revision number. What was missing from my initial approach was tracking new and dirty records on the client side, which then necessitated throwing everything at the server and letting the server figure it out.

I think I know what to do now, and I sure appreciate everyone’s help.

Oh I don’t think I saw anybody mention it here. Don’t use sequential row ids for a primary key. You’ll have a hell of a time trying to sync when a new record has been created on each side. A UUID works well in some databases like pgsql and is tolerable in sqlite. PG will store the UUID as a 128-bit number, so it’s fast and compact. But SQLite is more complex because it has no support for 128-bit numbers. You’d have to store as formatted text, which has issues with case sensitivity, or as binary which isn’t human readable and harder to deal with.
Ug nevermind, I just reread the original post.

which database are you using?

MySQL on Xojo Cloud.

[quote=432344:@Clifford Antrim]@Stephen Pardoe thanks for the link. That is a concise summary of what I had found in all my web reading AND put together as an algorithm (which I hadn’t actually found anywhere else).

@Bob Keeney & @Wayne Golding thanks for your advice about comparing the server time and the client time during sync.

After reading that link from Stephen, it seems to me that the algorithm presented there can be used with either a timestamp or a revision number. What was missing from my initial approach was tracking new and dirty records on the client side, which then necessitated throwing everything at the server and letting the server figure it out.

I think I know what to do now, and I sure appreciate everyone’s help.[/quote]
would you mind on sharing more on this ? i just read the link you mentioned and it sounds quite interesting, i needed something like this but looking for a good solution for a while.

For me due to the mobility needs i’m looking for something like SQLite for the master app and clients apps and use the master as a rest server, write is quite low and more on read so this here will do the job but looking as well to the posibility to have functions like offline mode and sync it once online as the master in some cases can be mobile or moving around from the site so once back a full sync is required.

thanks

@Aurelian Negrea - because of other priorities I haven’t made much progress with my app since I asked the question, but I have settled on a basic approach of using last modified timestamps for each record (client and server), and a global last sync’d timestamp for the client.

The greatest advantage to timestamps is that you can simply ask the server for anything that has been modified since that last sync time, and send to the server anything that has been modified on the client since the last sync time. If you don’t use timestamps, then you need to compare the entire contents of the client and the server to each other to see what has changed.

Using timestamps does require that the server time and the client time be the same. For that you need to do two things - store all sync and modified timestamps as GMT 0 so that you can compare apples to apples, AND ensure the clocks on the server and the client are synchronized. Storing as GMT 0 is easy enough. To synchronize the two clocks you can have the client query the server for it’s current time at the start of the sync and then compare that to the client’s time and get a time delta between the two that you can apply to all the timestamps on the client side.

I don’t know if any of that really answers your questions, but that was the crux of my original question (to use timestamps or revision numbers) and I have settled on timestamps.

I’ve actually just spend the last two months implementing just this sort of thing.

I use a combination of time stamps, revisions, and a sequence number to track the last change synchronized by either the client or server. As a general principal all our tables have a DTCreated and DTModified timestamp. We did this almost a decade ago without a clear need but it’s saved us some much trouble down the road.

I’d strongly suggest taking a look at the couchDB/couchbase replication protocol. You can simplify a number of implementation aspects but it’s a good overview of the core concerns and some of the edge cases that you might not have considered. Creating missing assets is the simple part. Handling the changes, merges, conflicts and tombstones is the harder part.

@Julian Mussi - thanks for the tip. I will look into couchDB. I knew merges and conflicts would complicate things, but I haven’t gotten to them yet so I hadn’t thought about them much. By tombstones, I’m guessing you mean “deleted” items?

Yes, sorry. when something is deleted you need to record that something is no longer there rather than just deleting the record, as it creates an ambiguous scenario. Is it missing asset on the client, or do you actually want to remove the asset from all connected clients.

Honestly it’s the edge cases that make replication/synchronization complicated. I would think about them now as you’ll like find that you need to go back and add additional components to your solution to deal with them. In many cases these are non trivial.

To address your original question I would use both a rev number and a timestamp(sequenceID). The timestamp will track what changes have been accepted by the server or client. You need to track this as if you ask for everything since the last updated but for some reason encounter an error and only write the first five of ten items you need to be able to pick up where you left off.

Revision is important as you need some way of know if you’re trying to commit data from a client to a server where the data is “out of date”.

Say two clients and server have REV_1 of some asset. Client A makes a change and tries to commit the change. For whatever reason the client connection drops and the change isn’t immediately pushed. Meanwhile client B also makes a change, but this is accepted by the server and now our asset is REV_2. When Client A reconnects, it is now trying to update REV_2 data with REV_1 data. That same client will also receive an update transaction as client B has beat it to the server. you now have a conflict scenario, but without rev numbers you wont know how to handle the changes. Using only time stamps client A made the first change, but client B was the first to update the server. Which one is newer? I guarantee that you will run into this scenario at some point and it’s much better to engineer for that eventuality from the start.

You can still resort to simple handling such as first one in wins but there needs to be some logic there.

The couchDB protocol ( use the newer version) does a good job of calling to light all the potential problems. You can then work through how you want to handle them. In an always connected scenario, e.g, don’t allow changes unless connected to the server, it’s a bit simpler. However if you’re building a mobile app or even desktop apps you may want to allow disconnect scenarios.

Thanks guys for advices, honestly we had some partners and customers using couchDB and it’s quite a nightmare, so no idea if it’s poorly done or the database itself it’s a mess but that type of database is so hard to maintain for me and seeing the amount of duplicated data on their database is quite useless use of space in this case . As for the protocol I’ll have a look on how it works and what is the best way to handle this .

Thanks again.

Poor db design and/or poor data practices will make sync a LOT harder.