Sorry for the length but this is a complicated project with lots of information.
I’m posting this under a ‘throw away’ account, but may post it to the jobs section under my normal account once I have more information and the job is formally written up. I’ve been asked not to reveal anything specific about the companies until we’re ready to post it officially (hence the throw away account to avoid guessing). But I need to get some opinions, ideas, and direction before we can finalize and post the job.
Given the size, nature, and complexities of the job right now I’m at a bit of a loss as to how to properly approach and estimate it. So I’m looking for opinions on how to best estimate it and hopefully some very rough estimates. I’ve had good luck with the Xojo community in the past, and I believe this will require custom software which can be in any language. I don’t know of any off-the-shelf tools which can handle everything we’re going to need.
The goal is to transfer as much data as possible from a MySQL database into an existing PostgreSQL database (company merger) so that the user applications which sit on top of MySQL can be retired, and all customers can be part of one company and use the software which sits on PostgreSQL. The schemas are unique designs and similarities are due to similarity of function from the company and end user perspective. Table and column names are different, and there are many instances of the same data being broken down into tables/columns in a different way. But also a lot of examples of tables which represent the same data, just with different column names that have to be identified in order to transfer the data.
MySQL size and complexity: over 450 tables with over 8,000 columns. Unfortunately it is very, very poorly documented. And we’ve found that what little documentation is available isn’t reliable. It is not normalized meaning there are lots of copies of some of the data. This means not all tables/columns will need to be transferred, but we don’t have a good estimate on what does need to be transferred at this time. Whoever helps us will need to identify redundant tables and columns as part of the job. There are just over 200 million records, all tables combined.
PostgreSQL size and complexity: over 300 tables with just over 4,000 columns. We’re still in the process of properly documenting this database schema, but one advantage is that whoever helps with the data transfer will have access to the team that created and maintains the PostgreSQL database and the software which uses it. The PostgreSQL database isn’t perfect, but is much better than the MySQL one and much closer to proper normalization. That’s the reason for the different table/column counts for data which is otherwise very much the same in terms of purpose and function.
Challenges: The MySQL database has a lot of issues. There are orphaned tables and partially orphaned tables (i.e. no new data goes in, but some old data is still referenced). There are inconsistent foreign keys, meaning that a record in table A might link back to table B or C depending on other data in the record. Some of the views are a mess because of this, but at least they indicate what is going on. There’s an example of data that should all be in one table being split among multiple tables with the same columns. As best as we can tell this was due to performance because of inconsistent foreign keys and poor indexing. There are examples of incorrect column types, such as using a string to indicate a boolean or number, or a number for a boolean, and dates as numbers or worse, strings with inconsistent formatting (this is a big pain point). There are some misnamed columns, as if the team who designed it decided to just repurpose a column instead of renaming it and changing any downstream code. The MySQL database is honestly an example of what not to do with a database schema.
There will be many examples where data from table A has to be split into tables 1, 2, 3, or where data from tables A, B, and C have to be combined into table 1. (Not actual table names, you get the idea.) Foreign key relationships have to be maintained, so that if you have a “car” and an “owner” in one database, they share the relationship in the other. But database IDs do not have to be re-used and in fact cannot, we just need to preserve the relationships which means translating IDs during the transfer. There are also a few examples of JSON use in each database, but naturally they don’t match, i.e. some source JSON will have to be parsed out to columns, and some source columns will have to be packaged as JSON. There’s not much of this, but it is there. The person or team who ends up helping us will be free to create additional views in MySQL to assist in the transition.
The naming conventions defy an algorithmic approach to matching tables and columns which have, or should have, the same data. But most are pretty obvious to a human being. There will be examples where the person handling this has to get feedback from us to determine what to do. Whatever person or team assists us will have isolated test environments setup so that they can do research and figure out tricky columns using the application GUIs with no fear of impacting operations.
We expect that someone will have to write one or more tools to specify how the data should be transferred and do it. There’s no way this can be done with views and simple SQL file export/import. There’s going to be a lot of research and a need to present the results of this research to us for review (custom web application?). We expect there to be a long period where parts of the data are copied to a test environment for thorough testing by us, with feedback to the person or team we hire so that they can make adjustments, and then move on to the next part or set of tables. Once testing and review shows everything is correct…or as good as it can be…we expect to move live data over several nights. This transition period can’t last too long, so performance is also an issue. During the transition customers who login will be automatically redirected to the correct servers and database (we will handle that).
Thing is, I’m not sure as to the best approach estimating this. I don’t know how to recognize an estimate that’s too low or too high. I have one estimate that uses a minutes-per table and column approach, plus time for other things like meetings and writing the applications that are needed, and it landed at roughly 2,500 man hours. Sometimes I look at this and think it’s high, but other times I think it might be low. We can’t expect a perfectly accurate estimate, we just want to be in the ballpark so that we can plan our transition and how many people we need to hire. We understand that this might require more than one person and take a year or more.
Thoughts? How would you approach estimating and accomplishing this? Any rough ideas as to how long something like this should take based on the table/column counts?