Seeking input on a massive database project

That’s a very long time for a migration of static data (i.e., not transforming running code into new code). I don’t know if a problem like this suffers from the limitations of the Mythical Man Month, but if adding more people would speed it up, I would do that in a heartbeat even if it costs more.

The real problem with such a long migration timeline is that both sides will certainly drift, as issues on each system are addressed and features added. Each change adds to the chances of invalidating existing work; hence, the expense of adding more people to speed it up may well be worth it.

It’s hard to tell from what you’ve said, but I doubt a “reasonably accurate” price is possible at this point. If I were you, I’d consider trying to knock this project down into small pieces: inspection and evaluation; scoping; work on a small corner to get an idea of how difficult the whole project is going to be; break down the transition by target table (i.e., this month we’re going to migrate customer contact information; next month, order history; month after that, pricing tables; etc.); testing and validation on a parallel track once target tables begin to be populated, with time each month for the devs to cycle back and fix any issues identified. The smaller you can make the chunks, the more accurate your pricing will be.

You might even model this project as a series of ongoing subprojects to populate target tables (or groups of tables simultaneously) without a solid end date; each project gets completed and the next one is scoped out for time and cost as your understanding of the system evolves.

And if they complain about the range of pricing you’re giving them being too wide, politely ask them to estimate how much money and time went into creating the monstrosity in the first place. :slight_smile:

If I have to migrate something old I always end up starting from scratch.
In the end I takes much less time then trying to bend things the way it should be

Unfortunately that doesn’t work for customer data with years of history. Folding the data into our main db is, effectively, leaving the old code of the company we’re merging with behind. We don’t want to maintain an awful code base or database system, nor try to make it better. We already have better. We just have to get the customer data over.

But that data is hundreds of tables, thousands of columns, millions of records.

Thank you again for your responses.

In fact, you should consider refining your original post from this forum and posting it on a DB forum where the right people can see it and apply.

I’m going to post it in a few places to try and get more ideas. Again we’ve had good luck with Xojo people in the past. The one rough bid and set of ideas we’ve got came from a consultant who wrote some Xojo data processing apps for us before. And maybe I’m wrong, but I’m convinced this will take custom software. He had suggested a Xojo web app that all of us could use to review the column mappings, and one or more Xojo applications to execute those mappings and get the data moved.

That’s a very long time for a migration of static data (i.e., not transforming running code into new code). I don’t know if a problem like this suffers from the limitations of the Mythical Man Month, but if adding more people would speed it up, I would do that in a heartbeat even if it costs more.

It’s a lot of data and mapping and transformation. I do think more people could help, at least with figuring out what goes where and with testing. But just because of testing alone I’m thinking a year. It will take many rounds of testing, with our team exercising every screen and report, while the transfer team figures out parts of the transfer. Then many more before we do it live. There’s no way we do this transfer until there have been a dozen dry runs of all data that went off without a hitch.

The real problem with such a long migration timeline is that both sides will certainly drift, as issues on each system are addressed and features added.

As I’ve said, we can’t freeze the databases for the entire process, but we are going to control drift as much as possible. We know the person or team helping us can’t hit a rapidly moving target. Both companies have mature software suites. While we maintain the old database the only thing we will be focusing on and changing is security (which I won’t go into, but you can imagine given the state of the db schema). And we will minimize new features/tables in the new. I’ve been thinking about that from your last post and we will definitely have something setup where the developer(s) helping us review and can veto schema changes, and at least a 90 day schema freeze prior to the actual move.

But point taken that if more people can shorten the time, then there will be less drift any way. Thank you again!

That’s what I thought. And what I need to know. Thank you!

Just curious, what is the size in GB of the MySQL database that needs to be migrated?

EDIT:
I don’t have much more to say than what was written here.
Although it might help to add a TinyInt column named “is_migrated” in each table to mark rows that were already migrated if the final solution is to break down by customer instead of table by table.

Very good point, thank you. Unfortunately when we move the data it has to be somewhat ‘all at once’. I don’t see how we can break down by table, though we can break down by customer and redirect on login so that we’re not trying all records at once, or running into massive tech support issues if something goes wrong. But until that time we will maintain complete copies of each stack so for development and testing, we can tackle it in small pieces as you suggest.

I will have to find out. I have the record counts in my notes, but not the size in GB.

Thank you for the suggestion. One thing the rough bid consultant brought up is that we will need very complete logging of everything that happens. He was envisioning a third database to store all the information about how to transfer like column mappings, and to also store logs detailing what has been transferred and where it went. All the record IDs on both sides. If table A feeds table 1 then that would just be two IDs. But if tables A, B, and C feed tables 1 and 2 as part of a single ‘record transfer’, using ‘record’ in the human sense, then all those IDs would be stored. I liked that idea a lot, so did the other managers involved.

1 Like

Oh, I was actually suggesting breaking down the analysis and migration mechanism into digestible chunks, not actually migrating the data. Essentially, the monthly deliverable will be knowledge and skills development: “we now understand this swath of the system and can populate these tables in the target system”. The final deliverable is “we migrated the data into the target tables”.

The more I think about it, the more I’m convinced that there are going to be several leaps of faith here due to the scope of the project. You’re going to need to hire a team partially on your belief that they can do the job, the customer is going to need to be patient and have faith that you will successfully complete the project, etc.

Oh, I was actually suggesting breaking down the analysis and migration mechanism into digestible chunks, not actually migrating the data. Essentially, the monthly deliverable will be knowledge and skills development: “we now understand this swath of the system and can populate these tables in the target system”. The final deliverable is “we migrated the data into the target tables”.

Excellent advice, that’s how I’m imagining it as well.

You may also want to check how much code resides in triggers and functions in the original MySQL DB. The DB I created in my current job has around 800 custom PLPGSQL functions (which my Xojo Web app calls), the entire highly complex business logic sits in those. Modern DB systems are development environments in themselves. It would be impossible to migrate it to another DB engine I guess (without the help of AI, which is why I mentioned Postgres.ai in my earlier post).

That’s one bit of good news: the MySQL database uses a number of views, but no triggers or functions. And hardly any constraints, which is part of the reason why it’s a mess. The views provide insight into some of the insanity going on, but don’t need to be transferred.

That is very good news indeed. I was wondering if they have any views set up.

Have you considered scraping the data from a web front end?

OK, now that you’ve stopped laughing, think about it: there’s an entire software stack sitting in front of this morass of database columns that is currently doing all the work you’re contemplating. It might actually speed up the process of getting the data out if you simply read it directly from what the customers are already seeing.

Granted, there might be bits and bobs here and there that are important and not exposed to the customer, but I bet you could line those up with the front end data without too much difficulty.

1 Like

I don’t think that would work for everything, but…I’ve made a note because we might need to do that for some of the data. Or it might prove easier to do it that way for some of the data. Thank you!

Since it appears that both of the companies are in the same, or very similar, businesses, you can expect that the same customers, suppliers, etc. are going to be present in both databases. So, you’ll have to allow for the weeding out or merging of the duplicates. Many years ago, I worked on merging two much smaller databases, and I found that dealing with duplicates probably ate up more time than anything else. You would think that it should be relatively easy to compare names and addresses and other supposedly unique information to do this, but in my case it all fell apart because the creators of the original databases hadn’t set up very strict enforcement of data types for data entry. AI might help with this though.

That’s a very good point, and while we’ve talked about customer de-duplication, this has to be done on more tables. Thank you.

It also occurs to me that there will be data privacy legal liability issues, if, for example, two truly unique customers are incorrectly identified as being the same person and their personal data are merged into the same record.

We had this problem at my old company: users would forget they even had an account with us and set up a new one. We ended up creating a manual “Merge Customers” button in our admin interface.

Are you sure you want/need to merge the databases?

PostgreSQL has functionality named Foreign Data Wrappers (FDW) that enable you to attach third-party data sources and interrogate them as if they were tables in your own database (disclaimer: I have never used this functionality in PostgreSQL but I have used similar functionality with another RDBMS). I have Googled it and there is an FDW for MySQL, imaginatively named mysql_fdw.

Assuming your primary motivation for merging the databases is so you (or your customers) can interrogate history, using an FDW should enable you to create UNION views that combine the old data with the new without ever actually polluting your nice shiny tables with the crufty stuff.