Seeking input on a massive database project

John_Miller1 · September 5, 2024, 5:40am

Sorry for the length but this is a complicated project with lots of information.

I’m posting this under a ‘throw away’ account, but may post it to the jobs section under my normal account once I have more information and the job is formally written up. I’ve been asked not to reveal anything specific about the companies until we’re ready to post it officially (hence the throw away account to avoid guessing). But I need to get some opinions, ideas, and direction before we can finalize and post the job.

Given the size, nature, and complexities of the job right now I’m at a bit of a loss as to how to properly approach and estimate it. So I’m looking for opinions on how to best estimate it and hopefully some very rough estimates. I’ve had good luck with the Xojo community in the past, and I believe this will require custom software which can be in any language. I don’t know of any off-the-shelf tools which can handle everything we’re going to need.

The goal is to transfer as much data as possible from a MySQL database into an existing PostgreSQL database (company merger) so that the user applications which sit on top of MySQL can be retired, and all customers can be part of one company and use the software which sits on PostgreSQL. The schemas are unique designs and similarities are due to similarity of function from the company and end user perspective. Table and column names are different, and there are many instances of the same data being broken down into tables/columns in a different way. But also a lot of examples of tables which represent the same data, just with different column names that have to be identified in order to transfer the data.

MySQL size and complexity: over 450 tables with over 8,000 columns. Unfortunately it is very, very poorly documented. And we’ve found that what little documentation is available isn’t reliable. It is not normalized meaning there are lots of copies of some of the data. This means not all tables/columns will need to be transferred, but we don’t have a good estimate on what does need to be transferred at this time. Whoever helps us will need to identify redundant tables and columns as part of the job. There are just over 200 million records, all tables combined.

PostgreSQL size and complexity: over 300 tables with just over 4,000 columns. We’re still in the process of properly documenting this database schema, but one advantage is that whoever helps with the data transfer will have access to the team that created and maintains the PostgreSQL database and the software which uses it. The PostgreSQL database isn’t perfect, but is much better than the MySQL one and much closer to proper normalization. That’s the reason for the different table/column counts for data which is otherwise very much the same in terms of purpose and function.

Challenges: The MySQL database has a lot of issues. There are orphaned tables and partially orphaned tables (i.e. no new data goes in, but some old data is still referenced). There are inconsistent foreign keys, meaning that a record in table A might link back to table B or C depending on other data in the record. Some of the views are a mess because of this, but at least they indicate what is going on. There’s an example of data that should all be in one table being split among multiple tables with the same columns. As best as we can tell this was due to performance because of inconsistent foreign keys and poor indexing. There are examples of incorrect column types, such as using a string to indicate a boolean or number, or a number for a boolean, and dates as numbers or worse, strings with inconsistent formatting (this is a big pain point). There are some misnamed columns, as if the team who designed it decided to just repurpose a column instead of renaming it and changing any downstream code. The MySQL database is honestly an example of what not to do with a database schema.

There will be many examples where data from table A has to be split into tables 1, 2, 3, or where data from tables A, B, and C have to be combined into table 1. (Not actual table names, you get the idea.) Foreign key relationships have to be maintained, so that if you have a “car” and an “owner” in one database, they share the relationship in the other. But database IDs do not have to be re-used and in fact cannot, we just need to preserve the relationships which means translating IDs during the transfer. There are also a few examples of JSON use in each database, but naturally they don’t match, i.e. some source JSON will have to be parsed out to columns, and some source columns will have to be packaged as JSON. There’s not much of this, but it is there. The person or team who ends up helping us will be free to create additional views in MySQL to assist in the transition.

The naming conventions defy an algorithmic approach to matching tables and columns which have, or should have, the same data. But most are pretty obvious to a human being. There will be examples where the person handling this has to get feedback from us to determine what to do. Whatever person or team assists us will have isolated test environments setup so that they can do research and figure out tricky columns using the application GUIs with no fear of impacting operations.

We expect that someone will have to write one or more tools to specify how the data should be transferred and do it. There’s no way this can be done with views and simple SQL file export/import. There’s going to be a lot of research and a need to present the results of this research to us for review (custom web application?). We expect there to be a long period where parts of the data are copied to a test environment for thorough testing by us, with feedback to the person or team we hire so that they can make adjustments, and then move on to the next part or set of tables. Once testing and review shows everything is correct…or as good as it can be…we expect to move live data over several nights. This transition period can’t last too long, so performance is also an issue. During the transition customers who login will be automatically redirected to the correct servers and database (we will handle that).

Thing is, I’m not sure as to the best approach estimating this. I don’t know how to recognize an estimate that’s too low or too high. I have one estimate that uses a minutes-per table and column approach, plus time for other things like meetings and writing the applications that are needed, and it landed at roughly 2,500 man hours. Sometimes I look at this and think it’s high, but other times I think it might be low. We can’t expect a perfectly accurate estimate, we just want to be in the ballpark so that we can plan our transition and how many people we need to hire. We understand that this might require more than one person and take a year or more.

Thoughts? How would you approach estimating and accomplishing this? Any rough ideas as to how long something like this should take based on the table/column counts?

Beatrix_Willius · September 5, 2024, 5:49am

About 20 years ago a small part of a project to develop a new database app was to convert the old database to the new format. The old “database” was Lotus Notes and not normalised at all. It took several weeks to write an intermediate Access database to normalise so that we could convert the database to the new one on migration weekend. The new database had between 20 and tables as far as I remember.

Do you need to convert all data at once or can you convert at least groups of tables? I think with 2500 man hours you are on the conservative side.

Richard_Klingler · September 5, 2024, 6:12am

Did several database migrations for a large ISP here in .ch, like IPAM systems from Firebird to Oracle/PSQL.

Was mostly a two-stage tool extracting data into a intermediate database before storing in a new DB.

Well the tools did also a check in the RIPE DB in Amsterdam for valid IP prefix entries and compared them to the ISPs network via SNMP OSPF lookups (o;

I assume your inbox will fill up pretty fast with offers (o;

Maximilian_Tyrtania1 · September 5, 2024, 7:55am

Looks like a fun project. You may want to check if maybe Postgres.ai might help with the migration.

Emile_Schwarz · September 5, 2024, 8:18am

This may look strange at first, but you’re in business, so you have to be conservative / you cannot risk the target data base.

So, my advice is to follow that path:
a. determine the transfer (what TABLE / What Columns goes where in the Target database)
b. Create a temporary - and empty - data base using the target properties (the TABLEs / Columns you need to store the data to be migrated),
c. Do the process: read the old database (data to migrate) into the temporary target database (similar to the target one).
d. Once done, make tests to know if everything was saved correctly:
if OK, go to step e.
else make needed changes until its OK.

e. When you are happy with the result, create an application that will read the newly created database contents and insert the contents into your working database (final db).

May takes a bit longer, but will be safer especially if you comment both newly created project and archive everything (just in case).
At last, the final stage will be done on company’s customers closed doors; since this two way process will be faster, this is doable this way.

MarkusR · September 5, 2024, 11:14am

i would compare booth apps,
from the point of the user, what he must input. (topics of a Enterprise Resource Planning)
then check if there is a must have in the new target system, does it need a extenstion.
changing the database without having a working frontend would be useless.
some kind of business layer could be useful to get a standardized processing.

after migration the old app & database could be made read only in case that the user must do a research.

Sascha_S · September 5, 2024, 11:27am

Isn’t that what @Richard_Klingler wrote?

Greg_O · September 5, 2024, 11:55am

You should look at Navicat. One of the editions has a migration tool that’s designed specifically for helping with these types of database to database transfers.

Michael_Bierly · September 5, 2024, 12:02pm

Many years back I worked on a project which included a major effort to migrate and map data from an existing system into the database of the new software.

My question would be would the customer accept the idea that it will take what it takes and pay the hourly expense? You give them an idea that your best guess is that it will take between xxxx and yyyy hours but there are too many unknowns to be able to give them an absolute number. Ultimately it needs to be a pay by the hour for the effort.

From what you described it sounds like this is going to be a constantly evolving process to get the data correct. A lot of learning and figuring out as you go and no matter how much upfront analysis is done there will be many adjustments as data is migrated and tested.

Bob_Keeney3 · September 5, 2024, 1:41pm

Lot of unknowns in the equation which makes estimating hard to impossible. When possible, we used to give the worst case scenario estimate to the client and then say if things go well we won’t charge you for the time we don’t use. But there’s still the possibility of it going horribly, horribly slow.

But I think having an intermediate database that normalizes the MySQL data is a good first step. Lots of backups. Lots of notes. Lots of testing in Postgres in a dev environment before the final step in production. The hard part will be bad/incomplete data in the system that you might not notice for months.

Eric_Williams · September 5, 2024, 2:36pm

You’re probably not going to like what I have to say: I think you might be in over your head, based on the questions you’re asking.

For starters, given the limitations of your NDA, you’re never going to be able to give us enough information to solicit accurate advice -we’re only going to be able to give you vague suggestions.

Secondly, it sounds like you’ve never personally tackled a problem of this magnitude. My hunch is that, even with the horrors you’ve described here, you’ve only scratched the surface; for example, you’ve made no mention of the codebase that interfaces with this database. Given the disorder in the schema, it’s very likely that the code is performing some major acrobatics to make it all work. You’ll need to understand how it works in order to transition properly.

Thirdly, while this forum is filled with talented, helpful individuals, it’s not a database forum. You’d be get better guidance elsewhere.

But! Since you’ve asked for advice, here is what I can offer you from my own personal experience as a consultant and my experience on a team that worked to fix a similar system.

Hire an expert to help you get your head around this monster. Find a DB consultant who can at a minimum outline a process to inspect the system and produce an estimate.

Don’t be afraid to ask the customer for some up-front cash to do this. A project of this magnitude is very valuable to them and they’ll want to know the scope beforehand, and it is normal and expected to sometimes have to pay before the project begins. The only way you’re going to be able to scope this thing is to assign a small team to it for a limited time, and that’s not free. Which leads us to suggestion 2.

Do a detailed forensic inspection of the system before you quote and/or accept the project. Your small team should produce a report detailing the scope of the project, including any areas of uncertainty. You should have a very clear idea of the timeline, customer expectations, and staffing/skill/effort needs. Because this is a project with a lot of uncertainty (presumably), note this in the project plan. Perhaps the quote will have a section “three months of work” and then “one month of additional work if necessary, refundable if not used”.

They’re going to have to give you 100% access to the system and the data, and the people who maintain it. Don’t accept less.

Insist on a schema freeze on both sides of the transition for the time you’ll be working on the project - or find a way to put a penalty charge into the project bill plan if schemas are changed. A code freeze would also be desirable but less likely. This will limit the risk that they make changes that invalidate your work in progress.

If they get nervous about a schema freeze, you can negotiate a schema change review so that you get at least review (and ideally veto power) over any proposed changes.

Finally - even though I said at the top that you may be in over your head - sometimes biting off a little more than you can chew can be a great opportunity for growth and learning, provided you don’t choke. Take it slow and this could be an amazing leap forward!

John_Miller1 · September 5, 2024, 4:19pm

You’re probably not going to like what I have to say: I think you might be in over your head, based on the questions you’re asking.

To be clear, I’m responsible for helping to find a consultant or firm that can handle this. And yes, every manager involved knows there’s nobody on our team who has ever had to tackle something like this before. Hence the vague questions to help guide me on a job posting and then sorting through the resulting bids to find someone who can handle this. We have one bid from a consultant who has done good work for us in the past, and while he is being considered, soliciting a bid from him was also part of figuring out how do we even tackle this? (He knows this.)

For starters, given the limitations of your NDA, you’re never going to be able to give us enough information to solicit accurate advice -we’re only going to be able to give you vague suggestions.

That’s going to be true even once I’ve written up the job. I’m thinking finding someone will be a two stage process. First stage will be an overview like the one above. Second will be select applicants being asked to sign an NDA so that they can be given access to test environments for a more thorough evaluation with sample data, a subset of the full database. I’m guessing we will end up having to pay something for that second round of bids because of the time that will be involved. The person who gave us a rough bid and some guidance brought this up. His bid isn’t firm, only a first round bid, and he would want to be paid to fully investigate the databases.

But if you have ideas for first round information that does not involve sharing sample databases, please let me know. What would you want to know or see before submitting something for the first round, if you were interested in a project like this?

Secondly, it sounds like you’ve never personally tackled a problem of this magnitude. My hunch is that, even with the horrors you’ve described here, you’ve only scratched the surface; for example, you’ve made no mention of the codebase that interfaces with this database. Given the disorder in the schema, it’s very likely that the code is performing some major acrobatics to make it all work. You’ll need to understand how it works in order to transition properly.

Noted: developers will need access to the source code of the applications which sit on these databases. I’m sure we can arrange that for second round bids and definitely for whoever we hire.

Hire an expert to help you get your head around this monster. Find a DB consultant who can at a minimum outline a process to inspect the system and produce an estimate.

Noted.

They’re going to have to give you 100% access to the system and the data, and the people who maintain it. Don’t accept less.

I’m not sure we can give 100% access to anyone but the person or team we hire. I will bring that up again for second round bids, but I’m guessing the decision will be only a subset of the customer data in the second round. Obviously whoever we hire will have access to all of it, and we would be open to a change in the bid based on what they find at that point. I don’t even think we can hold someone to a firm price, we’re going to have to have some procedure for them to let us know about new difficulties and how that will affect the time and cost. It’s just too big to expect someone to commit to one price up front. But we need a reasonably accurate idea to plan the customer transition.

Insist on a schema freeze on both sides of the transition for the time you’ll be working on the project - or find a way to put a penalty charge into the project bill plan if schemas are changed. A code freeze would also be desirable but less likely. This will limit the risk that they make changes that invalidate your work in progress.

I’m guessing that won’t be possible for the duration of the project because like I said, I won’t be surprised if this is a year or more. But we could for the last 90 days of it. Nothing major would change in any case, we need to merge these databases before doing anything crazy with the target schema. Changes would be new columns or possibly a few new tables, and I am aware we have to have a way to price that in, i.e. whoever handles this needs to get paid if they have to move additional data.

If they get nervous about a schema freeze, you can negotiate a schema change review so that you get at least review (and ideally veto power) over any proposed changes.

Noted.

Thank you!

John_Miller1 · September 5, 2024, 4:36pm

Thank you to everyone who responded!

Regarding an intermediate database: we’re not moving data until this is fully tested. Whoever we hire will have test environments that are basically complete copies of the live servers. We expect lots of partial transfers of data for testing and review, followed by test transfers of all the data and more testing and review. We could also setup a ‘middle’ server if needed for data preparation. Give the team who handles this a database server where they can setup whatever schema they want to first normalize and prepare the data. Testing is one reason why I think this will take a year or more.

Regarding Navicat: we need more flexibility. I’m convinced someone will need to write custom applications to handle this so that they have complete freedom to process the data any way they need to.

Regarding payment/a firm bid: we know we can’t ask someone for a firm price up front. It’s just too big with too many problems in the MySQL database. We know they’re going to discover things that affect the timeline and cost. But we need a reasonably good idea to plan the customer transition.

John_Miller1 · September 5, 2024, 4:40pm

Do you need to convert all data at once or can you convert at least groups of tables?

We can transfer groups of customers and are planning on that for many reasons. But all of the data would have to transfer for that customer group. So we can’t really group by tables, but rather by records.

I think with 2500 man hours you are on the conservative side.

Meaning you think it might be on the low side?

Richard_Klingler · September 5, 2024, 4:41pm

Regarding Navicat:

Had a online shop migration this spring from a Debian 11 to Debian 12 system…

Navicat failed miserably doing the backup/restore, just stopped restoring before all rows where inserted without any errors…even pg_dump didn’t help…

Only solution was to use pgAdmin application…

Did you ask on the postgresql developer mailing list?
I’m sure there are more professionals there than here…

Well…did only sell once a web/database application to Novartis Switzerland for their whole network equipment inventory including wiring, patching and active SNMP monitoring including Layer-2 traceroutes (o;

But in the end… we are mostly just individuals trying to make ends meat

Beatrix_Willius · September 5, 2024, 5:44pm

Oh yes. As crappy as the Lotus Notes database was it was simple. If you need to investigate which data should go where you will have a lot of fun. This will take a lot of time.

Richard_Klingler · September 5, 2024, 5:46pm

If I have to migrate something old I always end up starting from scratch.
In the end I takes much less time then trying to bend things the way it should be

MarkusR · September 5, 2024, 7:26pm

starting from scratch

that is somehow true, old systems gets messy over the years because we say always yes to the customers needs to have a job and income.
and if mistakes was build in over time,
they usually stay forever because changing a running system is critical, time-consuming and expensive.

Richard_Klingler · September 5, 2024, 7:31pm

Never run a changing system

Eric_Williams · September 5, 2024, 8:06pm

Put out a general call for bids, as you’ve already described, and then have prospective contractors show you similar work they’ve done for other clients.

In fact, you should consider refining your original post from this forum and posting it on a DB forum where the right people can see it and apply.