How do I count records???

Rick_Araujo · July 24, 2014, 4:42am

Never saw a count(8), first time. Only old engines had advantage of count(1) over count(). Maybe some MS SQL 2000 or older. Currently I’ve saw one example of an Oracle having an advantage of 1 millisecond for 1 million rows. Peanuts. So the current verdict is : There isn’t advantages anymore. I used to use count(1) but don’t care to use count() nowadays.

Norman_P · July 24, 2014, 4:58am

[quote=114439:@Kem Tekinay]Dave, I tried your suggestion in a large Postgres table (actually, two tables joined, just to increase the level of difficulty), and the “explain” output as presented by Valentina Studio looked identical to me. This suggests that some databases already optimize COUNT(*). Do you agree, or am I missing something?

[code]
SELECT count(*) FROM provider_nppe, provider_nppe_identifier
WHERE provider_nppe_identifier.provider_nppe_id = provider_nppe.id

23:10:04 select count(*) from provider_nppe, provider_nppe_identifier …

Aggregate (cost=450245.69…450245.70 rows=1 width=0)
Output: count(*)
-> Hash Join (cost=214849.62…438059.02 rows=4874669 width=0)
Hash Cond: (provider_nppe_identifier.provider_nppe_id = provider_nppe.id)
-> Seq Scan on public.provider_nppe_identifier (cost=0.00…83143.69 rows=4874669 width=4)
Output: provider_nppe_identifier.provider_nppe_id
-> Hash (cost=144814.83…144814.83 rows=4268783 width=4)
Output: provider_nppe.id
-> Seq Scan on public.provider_nppe (cost=0.00…144814.83 rows=4268783 width=4)
Output: provider_nppe.id

SELECT count(8) FROM provider_nppe, provider_nppe_identifier
WHERE provider_nppe_identifier.provider_nppe_id = provider_nppe.id

23:10:48 select count(8) from provider_nppe, provider_nppe_identifier …

Aggregate (cost=450245.69…450245.70 rows=1 width=0)
Output: count(8)
-> Hash Join (cost=214849.62…438059.02 rows=4874669 width=0)
Hash Cond: (provider_nppe_identifier.provider_nppe_id = provider_nppe.id)
-> Seq Scan on public.provider_nppe_identifier (cost=0.00…83143.69 rows=4874669 width=4)
Output: provider_nppe_identifier.provider_nppe_id
-> Hash (cost=144814.83…144814.83 rows=4268783 width=4)
Output: provider_nppe.id
-> Seq Scan on public.provider_nppe (cost=0.00…144814.83 rows=4268783 width=4)
Output: provider_nppe.id
[/code][/quote]

The output is different but the cost of getting that is the same
two nested “seq scans”
which is about what I’d expect
However note that the Hash Join cost for count(*) is significantly higher than that for count(8)
Meaning count(80 is more efficient as all data for all rows does NOT have to be accessed

Markus_Winter · July 24, 2014, 5:39am

[quote=114446:@Karen Atkocius]Marcus,

You always seem to be working on apps for doing protein ID by MS.

What I am curious about is why, as most Mass Spec vendors have very capable software…

We have a Waters QToF with MassLynx along with Protein Lynx Global Server and BioPharmaLynx etc etc… and by your naming of objects/variables it appears that the code you post always seems to be related to things that type of software does…

Out of curiosity what does your code do that the commercial packages don’t?

Karen[/quote]

My last lab was the Krijgsveld Group with the Proteomics Core Facility at EMBL. Not that I know much about Proteomics, I was doing the Molecular Biology and Cell Biology experiments to produce the samples for Mass Spec analysis that also established a new method (we got a Nature paper out of it last year).

They have lots of software too - MaxQuant, Proteome Discoverer, etc. - and the specialists that know how to use them.

However there are some questions in Research that those apps can’t answer.

So one of my ex-colleagues (he is one of those specialists but not a programmer) told me about what he was trying to do and I thought it would be a good training exercise to get into databases - so I offered to write “a little app” for that. Oh the naivity! I thought I’d be done in 2-4 weeks but it has pretty much been a full-time job for over 4 months, working till past midnight most of the time.

So for that app I was creating lots of modules and methods which I hope to be able to re-use for my own apps.

It is unpaid work, but what the heck, it’s good for Science and I learn a lot.

Btw, yesterday another ex-colleague approached me with her problem. This time there is the possibility of a cake at the end

Eli_Ott · July 24, 2014, 6:20am

Sometimes I work like an “Infinite monkey theorem” programmer - and then I have you guys look at my code to find out if I have produced code on a shakespearean level.

Kem_Tekinay · July 24, 2014, 6:25am

[quote=114452:@Norman Palardy]However note that the Hash Join cost for count(*) is significantly higher than that for count(8)
Meaning count(80 is more efficient as all data for all rows does NOT have to be accessed[/quote]

Might you have misread the output or am I missing something? I just diff’d each section in BBEdit and, other than the “*” vs. “8”, the output is identical.

DaveS · July 24, 2014, 1:06pm

I’m gonna chat with my DBA at work… but I’m not sure that EQUAL COST in the explain plan is the same as EQUAL execution time… it just means the approach will be the same… not that the amount of data move or analyzed will be

DaveS · July 24, 2014, 4:57pm

My DBA said most high end database engines in fact optimize COUNT(*) to be equiv to COUNT(8), but not all… He said ORACLE only did this after version 9I

Kem_Tekinay · July 24, 2014, 5:17pm

The conclusion I draw is that, depending on the engine, there may be no difference, but be on the safe side an use COUNT(8).

Thanks.

BTW, a neat trick for counting records for a complex statement is to enclose it in parens with an alias. For example, suppose you had a statement like (pseudo-code):

SELECT a.Field1, b.Field2, c.Field3, c.Field4, c.Field5 FROM a JOIN b JOIN c WHERE complex_conditions GROUP BY fields

You can count that easily with:

SELECT COUNT(8) FROM (
SELECT a.Field1, b.Field2, c.Field3, c.Field4, c.Field5 FROM a JOIN b JOIN c WHERE complex_conditions GROUP BY fields
) AS counter

That lends itself easily to a generic db.Count(sql) function.

DaveS · July 24, 2014, 5:28pm


SELECT COUNT(8) FROM
 (
SELECT 8
 FROM a ,b,c 
WHERE complex_conditions 
GROUP BY fields
) AS counter

Again… why load data that you don’t need.

The only time I would use the above is


SELECT COUNT(8) FROM
 (
SELECT DISTINCT a.Field1, b.Field2, c.Field3, c.Field4, c.Field5
 FROM a ,b,c 
WHERE complex_conditions 
GROUP BY fields
) AS counter

which can also be expressed as

SELECT COUNT(DISTINCT a.Field1 ||  b.Field2 ||  c.Field3 ||  c.Field4 ||c.Field5) AS counter
 FROM a ,b,c 
WHERE complex_conditions

Vanessa_Gregory · July 24, 2014, 5:41pm

I would never use a where clause to join tables. I always use left join (or similar), it gives you better control over the way the join works and clearer logic when you come back to the code later.

SELECT COUNT(DISTINCT a.Field1 || b.Field2 || c.Field3 || c.Field4 ||c.Field5) AS counter
FROM a ,b,c
WHERE complex_conditions

becomes:

SELECT COUNT(DISTINCT a.Field1 || b.Field2 || c.Field3 || c.Field4 ||c.Field5) AS counter
FROM a
left join b on a/b condition
left join c on a/c condition (or b/c etc)

where would only be used if you wished to filter the result set. You’re not limited to left joins either, you can use other joins. The upshot is you are in control.

Norman_P · July 24, 2014, 5:58pm

[quote=114623:@Kem Tekinay]
BTW, a neat trick for counting records for a complex statement is to enclose it in parens with an alias. For example, suppose you had a statement like (pseudo-code):

SELECT a.Field1, b.Field2, c.Field3, c.Field4, c.Field5 FROM a JOIN b JOIN c WHERE complex_conditions GROUP BY fields

You can count that easily with:

SELECT COUNT(8) FROM (
SELECT a.Field1, b.Field2, c.Field3, c.Field4, c.Field5 FROM a JOIN b JOIN c WHERE complex_conditions GROUP BY fields
) AS counter

That lends itself easily to a generic db.Count(sql) function.[/quote]

This still grabs all those fields - i.e. lost of work on the db side to do it
And it won’t necessarily work in all db’s
In fact I’d expect the count(8) version to be slower since it will run the query then the count as well
You’d be better off to do
SELECT COUNT(8) FROM a JOIN b JOIN c WHERE complex_conditions GROUP BY fields

regardless of how the joins are done - which I don’t always agree that the LEFT JOIN syntax is clearer
In fact I’ve found cases where it ONLY works correctly with the non LEFT JOIN syntax (that was fun to figure out)

Vanessa_Gregory · July 24, 2014, 6:08pm

I did say that a Left join wasn’t the only available option. The difference is you state the type of join you wish to achieve, left, right, inner, outer etc. Putting it in the where clause allows the engine to make decisions on how it’s going to join the tables. That is never a good idea, especially as you could move to another engine / db and everything could change.

I find it very difficult to believe that it only works if you don’t specify the join. If the answer was wrong you likely didn’t use the same join as the engine did and that caused the discrepancy. Left join only includes records that exist in the left (main) table, a where clause join includes all records from both tables, even where they don’t match. You can still do that with the “from a join b on xxx” syntax if required.

The other benefit of specifying your joins is that you can see what is there to specify the join on the tables and what is there to filter the result set returned. Surely a good thing for code readability.

DaveS · July 24, 2014, 6:08pm

Ian… on that I will agree to disagree… I have been an Oracle PL/SQL programmer for years… and find using the JOIN keyword to have its uses… but only in a LEFT or RIGHT (mostly LEFT) syntax where I wish to keep ALL the Left records and only data from the RIGHT records the meet the ON condition.

using SELECT from table1 a,table2 b where a.xx=b.xx is no more or less effective, but produces the result of ONLY records the meet the condtion (ie. not all from table1 unless they match table2)… where LEFT indicates ALL from table1 regardless of matching

Vanessa_Gregory · July 24, 2014, 6:13pm

Equally I’ve been an SQL programmer since MySQL and MS SQL came into existence. I’ve never met a query where a specified join of one type or another wouldn’t achieve the required result. As I said Left isn’t the only specified join available.

DaveS · July 24, 2014, 6:18pm

I agree to disagree… and will leave it at that.

Vanessa_Gregory · July 24, 2014, 6:22pm

Fair enough.

Norman_P · July 24, 2014, 6:27pm

I’ve been around and doing it longer than that - with both those engines as well as many others (including Oracle, DB2 and bunch of others) - but that’s not the point.

I’ve run into several cases on different engines with all the variants of LEFT RIGHT FULL INNER and OUTER and CROSS
See sqlite’s docs on SELECT and Side note: Special handling of CROSS JOIN. and 2. WHERE clause filtering.
You’ll get differences in behavior using one vs the other
And I’ve encountered that in most engines

Kem_Tekinay · July 24, 2014, 6:55pm

[quote=114627:@Dave S]Again… why load data that you don’t need.

The only time I would use the above is[/quote]

To be clear, I’m not recommending this over other options, just offering it as a potential shortcut.

Vanessa_Gregory · July 24, 2014, 7:40pm

I never claimed it was, I was merely responding to Dave S similer statement.

[quote=114650:@Norman Palardy]I’ve run into several cases on different engines with all the variants of LEFT RIGHT FULL INNER and OUTER and CROSS
See sqlite’s docs on SELECT and Side note: Special handling of CROSS JOIN. and 2. WHERE clause filtering.
You’ll get differences in behavior using one vs the other
And I’ve encountered that in most engines[/quote]
There are many many odd things about sqllite so I can quite expect issues. As I said earlier I’m happy to agree to disagree…

Norman_P · July 24, 2014, 9:07pm

Oh its not JUST sqlite
One I happened to recall off the top of my head that was documented