Default Primary Key Integer?

Bob_Keeney · June 16, 2015, 8:34pm

Never, ever, ever depend on this always being true. It’s true NOW but what about 10 years from now.

A primary key should never have meaning to a human. It’s meant strictly for the database - not you. This leads to bad stuff happening to your data. Because then people want to manually change the primary key because of some silly human error and then you’ve just lost your data integrity.

From experience - this is a BAD idea.

Policy numbers are a human invention. Make them a separate field. The primary key is used by the database to do database things. Sure, they might be the same forever but do you want to take the chance of screwing up data?

Paul_Raulerson1 · June 16, 2015, 8:39pm

Um- insurance companies typically use operating or policy companies, and that is part of the policy number. When we acquire another company, they get either a new policy or operating company number, so of course, there is never ever an issue with duplicates.

Unless the company we acquired had duplicates in their data, which often occurs when someone is using a synthesized key with no meaning, and inserts the same policy twice. Or even 184 times, which I have seen happen.

Airlines typically also use meaningful data as a primary key, as do many other service companies.

I am trying to wrap my head around why you guys think that using meaningful data as a primary key is a “bad practice” - but I can’t find an software or database engineering reason for prohibiting it, and many reasons to encourage that usage, at least in particular cases.

(shrug) Not that important, you don’t work in my shop nor I in yours, so no need for conflict. But there are dissenting opinions there.

Perhaps there is some XoJo / RealBasic reason that makes it undesirable?

Paul_Raulerson1 · June 16, 2015, 8:52pm

[quote=194730:@Bob Keeney]Never, ever, ever depend on this always being true. It’s true NOW but what about 10 years from now.
[/quote]

Out of several million policies, the earliest one I can find here is from 1948. Not had any problems with that, though the policy number it is held under now is not the same one that it was issued under of course. That company no longer exists and hasn’t for close to half a century. Of course, I can tell you the original policy number, but it is just a reference.

Perhaps a better example - database’s often describe something physical in a one to many relation. One policy can have litterally hundred of scanned images associated with it. The physical access to those scanned images can, for reasons of speed and accuracy, be described as a specific offset in a specific file in a specific path on a specific system, which is in fact the primary key. Why? Because there are literally billions of these images. And yes, file maintenance can be a chore when someone tags and image as belonging to the wrong policy. But it is neither insurmountable nor all that difficult, and files do occasionally have to be reorganized.

In this case, we want the images to be diverse, though not quite hashed. Speed is the key there. We need them stashed in rather large container files for backup purposes. Stash a quarter of a million images in a single file, and it takes a fraction of the time to backup and restore with would to do the same with a quarter million images. Add in data compression and de-duplication, and you have a working system.

When you factor in we do write a compressed header on each image, it also means we can recreate the indexes from the data should we ever face a disaster that means we needed to do that. That is a good use of space in terms of disaster recovery.

But a synthetic meaningless primary key in this instance would translate to a LOT of wasted storage space, a lot more DASD activity to retrieve the file, and more backup bandwidth. That adds up.

As I said, in a small database like we are looking at here, I don’t think it would be an issue. But I tend to look at larger systems. Indeed, one of the things I want to do with XoJo is write a nice interface into that image system.

Bob_Keeney · June 16, 2015, 9:09pm

Personal experience. I’ve had to pick up the pieces afterward of developers that used primary keys that could be changed by a human. They thought they were doing the right thing but all they did was screw up their data. Much of it unrecoverable unless reverting to a backup.

You obviously are dealing with really, really, really large databases and a ton of data. Most of us mere mortal developers aren’t dealing with a huge data nor do we have a dedicated staff to make sure those system are maintained, tweaked, and backed up. Teaching new developers I would rather have them design their db’s safely until they get to the point where experts are brought in. That ain’t us.

So perhaps for you that makes sense. I don’t recommend this approach. For most people disk space is cheap and they’ll never get to the point where it becomes an issue.

Norman_P · June 16, 2015, 9:15pm

Your key then is a compound primary key so its part company + policy number ?

I’ve worked on systems like that
Now your primary keys have LOTS of meaning
Used to work on one where the account was operating company + operating company account id
And then they bought another company and had to merge their accounts in. They mostly bought for offshore tax reasons like what Apple Google etc do to double irish, cayman subsidiaries, etc. Legal but … questionable as far as the intent of the laws but when you have armies of tax guys to figure this stuff out…

That made a mess since the operating company numbers were so large and some accounts short that you had no way to segregate what was what since 001001002 might be company 001 - account 001002 or operating company 0010 and account 01002
You have no idea the joy that caused
It made doing consolidated financial statements a real exercise
Resulted in running parallel accounting system for about 4 years so charts of accounts could be redesigned across all 185 operating entities and companies.
Anyone worked in accounting & know how hard it is to design a chart of accounts for ONE decently large company ?
Wanna try 185 at once ?
Masochist was a perfect title for that job.

Paul_Lefebvre · June 16, 2015, 9:27pm

No, it has nothing to do with Xojo. It’s just a general DB design principle I prefer. Having worked on large Oracle databases in my pre-Xojo days, I’ve regularly run into situations where primary keys that were supposed to always be unique and never need changing ended up not being unique and requiring changing. Using an artificial primary key instead and just storing the other value (alternate primary key) in its own column (indexed for fast lookups) always worked better for our software.

Tim_Hare · June 16, 2015, 9:52pm

@Paul Raulerson there’s a big difference between a primary integer key and a unique constraint. Your use case begs for a unique key, but not a primary key.

Paul_Raulerson1 · June 16, 2015, 10:42pm

LOL! Whoa guys, I am the merest of mere mortals. And today’s PC’s are handling some ginormous amounts of data, so saving data and processing is always a good idea IMNSHO.

Tim - A Unique constraint can and often does involve significant database activity. In Norman’s example, we avoid that by choosing not to run parallel systems. That allows a simple and easy cross reference. And we don’t wind up with a mess like Norman experienced. I bet if Norman were to magically go back into time and do the same project all over again, he would do things differently. We learn. (grin)

Guys- this argument between synthetic and natural keys has been going on forever, since databases started. C. J. Date argued it I believe. COBOL programmers argue about it even. The best advice I would give you or anyone is - do not make a religion of either choice, use the right choice for the each situation. Natural keys are best in some situations. Synthetic keys in others.

Paul L. is obviously concerned with the stability of (primary) keys. Well, fact is, keys are going to change regardless of what design you use. One had best build change control into the design from the very beginning. Controlling change to primary keys is not any more difficult than controlling change to a secondary index. It is definitely one factor to consider when choosing what type of primary key to use.

Natural data, which is real data with meaning, is very often a very good choice. And just like synthetic keys (maybe called surrogate keys these days, depends upon whom you talk to…) they have good sides and bad sides.

I reacted to the absolute authoritarian views expressed - because there is no commandment to use synthetic keys in all cases. or even in most cases. it all depends upon the design. In fact, it quite probably wrong to use synthetic primary keys in all cases. I certainly think so.

Mere mortals experience the same issues as anyone else. If you code an application that makes it possible or easy for people to go in and change things up, they will. Like the fellow around here using the Access database where his client wanted to be able to change the data easily. That would be something I would run away from.

And inevitably, unrestricted and uncontrolled change access to data will screw things up. So, if you possibly can, do not allow it happen.

Same idea applies, though a little less strictly, to ad-hoc reporting. Have you designed a system with ad-hoc reporting only to have some yahoo design a report that gives erroneous or misleading results? Boy Howdy - I sure have, despite taking great care to make it “impossible.”

-Paul

Paul_Raulerson1 · June 16, 2015, 10:54pm

That would sure work. I’ve used Oracle a lot, but I speak native DB2 more comfortably. The usual reason I hear for (over) using synthetic (surrogate) keys in Oracle is join efficiency - you only need one column from a table to do a join. In practice (and only in my experience) this doesn’t prove true, and there is often a efficiency loss when “best practice” requires synthetic keys everywhere.

I do have a story about a guy who insisted on surrogate keys based upon a timestamp. Think right growing index… (grin) To be fair, that was in a data collection system that ingested a large number of inputs each second. But it was rather comical none the less. That particular system did have a natural key that made sense to use, but there are plenty of system that do not.

Johnny_Harris · June 16, 2015, 11:30pm

I see a new “Pet Peeves” growing here.

Norman_P · June 16, 2015, 11:58pm

[quote=194753:@Paul Raulerson]
I do have a story about a guy who insisted on surrogate keys based upon a timestamp. Think right growing index… (grin) To be fair, that was in a data collection system that ingested a large number of inputs each second. But it was rather comical none the less. That particular system did have a natural key that made sense to use, but there are plenty of system that do not.[/quote]
A poor algorithm for implementing surrogates doesn’t make surrogates a bad idea.

I worked on an expert system analyzing SCADA data gathering about 70000 data points a minute 24/7/365 and the # of data points was growing quite rapidly when I left (last I heard from the guys I knew operating it was over 150K points a minute)
30+ billions rows annually and has ridiculous retention rules (its a fully regulated business in Canada so they make you keep everything for “the life of the business + 10 years” kind of regulated)
We started by using the natural key (which was unique and and still is after 15 years - until they buy another pipeline or something then who knows)
Natural keys caused severe hots spots in the DB as they were also by default being indexed and the indexes were very right slanted resulting in horrid performance - we could not insert 70000 rows a minute because of the over head maintaining natural indexes and page splits + index rebalancing etc.
Absolutely HAD to use a non-natural primary key then force the indexing in such a way we did not cause hot spots in the db pages or indexes. After that 7000 rows a minute was a breeze and no queries suffered. Yes people were querying it while data came in - that was a prerequisite - no OLAP / OLTP split here.

The natural keys were still perfectly valid but making them the actual primary key meant either you had unindexed primary keys - which is kind of obtuse and in many ways counter to why you wanted them to BE primary keys. Our Oracle DBA looked at us really sideways when we asked if we could have an unindexed primary key
Or used something where it held no meaning that you could get to not cause hot spots.

Analyzing problems with natural keys and where / when they will get you in trouble is a VERY good habit to get into.
And you can avoid them by simply not using them even when they seem “right”.

SSN in the USA “seems” like a great key for a “person” (if privacy laws allow it) as it should be unique - but that’s not true. (We never used the Canadian counterpart in the payroll systems I worked on here in Canada either for similar reasons)
There are recorded instances of companies with large payrolls having several people with the same actual SSN (fake id, illegal immigration, identity theft, you name it)

The other problem with natural keys is that your data is now coupled to that key & that data.
And if you need to retool or revamp how you do business that key is related to a LOT of other data because it has meaning.
And that meaning and all its repercussions are IN your schema everywhere.

Suppose all the sudden you had, in your business, to change policy number to be a different data type.
Say it needed to change to support alphanumeric where today it is only numeric.
You have to change it EVERYWHERE its used as a foreign key because that business decision is embedded IN your schema.
Not the case with a surrogate. It has NO meaning and so remains unaffected by business changes like this.
You change the ONE spot where policy number is an alternate key and you’re probably good to go.

Now any single row in a table may have many alternate keys - and policy number might be a good one, fully indexed.
Thats is OK.

You might have been lucky enough to not have ever had to make any of the changes that others have mentioned but they do happen even when natural keys seem right. The places where I’ve had data architects that understood this never had any such issues. Where they didn’t these problems would come up every once in a while usually as the result of some big acquisition and one of the zillions of subsystems that were supposed to just “drop right in” - famous last words.
And part of the merger work to meld all the data together often included “fix this damned problem once & for all so it never happens again”. And it made the merger harder to achieve, cost more, and take longer.

You’re welcome to my scars if you want em.

Paul_Raulerson1 · June 17, 2015, 1:03am

If you insist. The system I was referencing happened to be a SCADA system as well, and was ingesting between seven and nine thousand transactions per second. That was just the primary transactions- many subsequent actions were triggered by those primary transactions. Synthetic keys brought that system to it’s knees, while a natural key was much more controllable. The use of synth keys was not a bad algorithm, it was a poor design choice in that instance. (By the way, I approved the design choice - nobody’s perfect!)

Now admittedly, that would not have been a task for Oracle, and we were not using Oracle. We had much more control over where the data resided and how the indexes were formed and handled. As well, we had a genius doing the database optimization. The machines were not terribly fast by today’s standards either.

In your example, 70k trans/sec is very respectable. I know the answer already, but you did not do anything like that with Xojo did you? That boggkes my imagination if you did!

I believe it is not sensible to make judgements about design patterns based on one bad experience, then extend that into some generalized rule that does not necessarily describe the general case. Such as never using natural data for a key, nor never using synthetic/ surrogate keys. Neither position is one I would suggest to anyone.

By all means, if you have a natural key, use it. If you allow critical data to be ingested that you know is false, such as duplicate SSNs, well, to me that is a less than optimal system design. Not in my shop. We have people on hand to investigate and resolve issues like that. Until they do, the data does not get into the system.

Bob_Keeney · June 17, 2015, 2:05am

I think you’ve made our point. You have ‘people’ to investigate and resolve issues so bad data does not get into your database. In the world most of us Xojo developers live in, our clients don’t have the resources to do that. Most of our clients want the software to run, run right, keep their data safe, and talk to us as little as possible.

I’m not trying to pick a fight. You’re just inhabiting a completely different world than what most of deal with. Our clients are small and don’t have ‘people’. They call us when they have a problem and even then reluctantly. Heck, most of my clients don’t even have an IT staff. It’s the boss that has some technical knowledge and that’s it.

scott_boss · June 17, 2015, 2:29am

@Paul Raulerson (the collective) we are just trying to help educate you on the experiences of our (collective) past. We are just trying to make it easier for you to avoid the traps that we have been trapped into in the past.

Norman_P · June 17, 2015, 2:47am

I wrote a side data capture & playback system in RB.
We’d catch several minutes of data and play things back to the expert system so we could see that the changes we were making to it were doing the analysis we expected. Hard to do that if you can’t play things back repeatably.
Did that before we had a chance to rewrite it into something that would be present long term etc
We used Oracle and had a decent system (Sun E3500)

Like I said the bigger issue with “natural” keys is the embedding of business rules in the DB in ways that can be hard to extend or change.
Imagine you start out writing an HR system for a US based employer so you use SSN as a natural key for employees.
Seems sensible.
So you use the 9 raw digits which avoids certain problems.
Then you expand to Canada where we also use 9 digit Social Insurance Numbers (basically the same).
Now the chance you have a duplicate is decently high. And BOTH are legitimate real correct data.
There’s no “people” needed to fix it because the data IS right.
So your primary key is quite possibly useless AND possibly replicated as a foreign key through other tables.
You want to tell your employers they cannot expand to Canada because “the systems” can’t handle the data ?
Or they MUST set up a Canadian subsidiary because “the systems” can’t handle the data but handling it through a subsidiary so you don’t have to mix USA & Canadian data would make it work ?
Your HR folks would probably go nuts because they cant get at every employee in the entire company to see who’s a top performer and should be moved “to head office” in the USA.

On a few hundred million $ merger that was a fun one to try & explain and about a year to fix.
You should have heard the yelling from two VERY pissed off CEO’s when they wanted to know how well this was all going to work

They ran a separate wholly owned sub for a year.

Now with the tax laws here & the way they are there they might just have the Cdn operations buy the USA ones out to make it a Cdn company with a wholly owned USA subsidiary

That one may never heal. Whips & chains & lots of late nights on that one.

Norman_P · June 17, 2015, 2:49am

I work here now because in “corporate land” I got tired of getting beat for other peoples past sins.
Too many late late nights where I still had to drive home shower find something to eat & repeat it all the next day.
Now at least I can get up from my office desk & literally walk about 4 steps & fall in bed

Paul_Raulerson1 · June 17, 2015, 3:10am

[quote=194784:@Bob Keeney]I think you’ve made our point. You have ‘people’ to investigate and resolve issues so bad data does not get into your database. In the world most of us Xojo developers live in, our clients don’t have the resources to do that. Most of our clients want the software to run, run right, keep their data safe, and talk to us as little as possible.

I’m not trying to pick a fight. You’re just inhabiting a completely different world than what most of deal with. Our clients are small and don’t have ‘people’. They call us when they have a problem and even then reluctantly. Heck, most of my clients don’t even have an IT staff. It’s the boss that has some technical knowledge and that’s it.[/quote]

Different strokes for different folks. I do a lot of work from charities that are close to my heart, and believe me, you don’t get cheaper than charity. Ever had a bunch of little old grey haired aunties talk you into making something work “just one more year - think of the people we could feed with that money!..” Well beloved emotional arm twisting grey panthers… (sigh)

Every organization, even one person software shops, have “people.” In the end, it is a person who sits down and decides if something is right or wrong. Even if that happens to be the same person who wrote the code.

In any case, I am not chopping at your thinking and opinions, I reckon most all of them are from experience and represent opinions I respect. Even when I don’t agree with them.

Apologies if I seemed to have come on a little strong, I work with a bunch of very bright people, all of whom I like immensely, and who are always willing to defend their opinions. Kinda like the group of folks here I think.

Paul_Raulerson1 · June 17, 2015, 3:26am

[quote=194792:@Norman Palardy]I wrote a side data capture & playback system in RB.
We’d catch several minutes of data and play things back to the expert system so we could see that the changes we were making to it were doing the analysis we expected. Hard to do that if you can’t play things back repeatably.
Did that before we had a chance to rewrite it into something that would be present long term etc
We used Oracle and had a decent system (Sun E3500)

Like I said the bigger issue with “natural” keys is the embedding of business rules in the DB in ways that can be hard to extend or change.
Imagine you start out writing an HR system for a US based employer so you use SSN as a natural key for employees.
Seems sensible.
So you use the 9 raw digits which avoids certain problems.
Then you expand to Canada where we also use 9 digit Social Insurance Numbers (basically the same).
Now the chance you have a duplicate is decently high. And BOTH are legitimate real correct data.
There’s no “people” needed to fix it because the data IS right.
So your primary key is quite possibly useless AND possibly replicated as a foreign key through other tables.
You want to tell your employers they cannot expand to Canada because “the systems” can’t handle the data ?
Or they MUST set up a Canadian subsidiary because “the systems” can’t handle the data but handling it through a subsidiary so you don’t have to mix USA & Canadian data would make it work ?
Your HR folks would probably go nuts because they cant get at every employee in the entire company to see who’s a top performer and should be moved “to head office” in the USA.

On a few hundred million $ merger that was a fun one to try & explain and about a year to fix.
You should have heard the yelling from two VERY pissed off CEO’s when they wanted to know how well this was all going to work

They ran a separate wholly owned sub for a year.

Now with the tax laws here & the way they are there they might just have the Cdn operations buy the USA ones out to make it a Cdn company with a wholly owned USA subsidiary

That one may never heal. Whips & chains & lots of late nights on that one.[/quote]

Ouch.

Never really tangled with offloading taxable income between countries like you describe, but I bet it is a bear. I would probably have avoided duplicates by federating the tables, and keeping the CA and US information with possible conflicts separate. But that of course, would have depended on a database that could do that… Better you than me.

I do rather think in terms of putting data dependent processing as close to the data as possible, or in other words, embedding business rules into stored procedures, triggers, and so on. That’s a little data centric, and drives the MVC folks half wild, but I have had a lot of success with that. Heck, it’s even object oriented, no?

That’s cool you were able to write a capture/playback system with RB. That’s the kind of things I envision using XoJo for more and more.

Norman_P · June 17, 2015, 3:33am

I feel like one day someone will find my poor old dried out bones & examine me like they did tzi and remark “Hey look hey got that scar in the great BP Amoco Merger of '98!”
Seems that all I managed to do for a bunch of years was move from one energy company merger to another.
Paid well but stressful work.

Norman_P · June 17, 2015, 3:57am

Working with / for accountants in a big company that did this kind of thing over 20 years ago was VERY educational.
Payroll was internationally handled from one source so you don’t have some choices there.
As was HR - mostly where SSN’s bit - even federated can’t deal with the fact that if you have a duplicate key you have a duplicate
We could mask it for the short term but that would not fly long term.
It had to be outright solved so all the normal employee retention, promotion, etc could be handled internationally and correctly.
That turned out to be immensely useful when they started subsidiaries in Vietnam, Peru and others.

Started out that way many many many moons ago. 1993/4 with Sybase & Stored procedures.
Moved away from it.
Once you merge a few companies & get mandated, not asked, to “Be an XXXX shop now” (Oracle, Sybase, MS, pick one) and you have 0 choice & it will cost you
I’ve had to do that kind of migration exactly once from Sybase to Oracle with payroll / accounting.

If you put a lot of business logic in stored procedures now all your data processing is tied not only to a language (PL/SQL, TransactSQL, etc) but to the schema as well. An using the vendors dialect seduces you into using all those fun little extensions that lock you in to their product & make it hard to actually move OR scale up to N-tier (because now you DO need all this logic in some other language / server/ whatever so you can scale it up)
Yes some folks perceive the procedures as that middle tier, they’re not. Really. You might as well talk straight to the DB.

It can put you in a deeper hole in the very long run.

3 tier is overkill for anything really small like a single user database
BUT if you do it that way growing & moving to a new db thats more capable WILL be a lot easier as all the business rules are in the middle tier & the DB is a repository good at what repositories are for. It makes it easy to scale.
And it doesn’t have any business smarts in it.

[quote=194800:@Paul Raulerson]
That’s cool you were able to write a capture/playback system with RB. That’s the kind of things I envision using XoJo for more and more.[/quote]
This was … 1997/1998 running on my little old powerbook
It wasn’t 70000 rows per second or anything - much smaller but it was something I could crank out fast and we could play with specs & have something in an hour
Once we had the spec solidified it did eventually get rewritten into a Java app as it was all back end anyway