I have a simple aggregation query that works as expected:
SELECT report_date, SUM(col1) AS sum_col1
FROM my_table
GROUP BY report_date
ORDER BY report_date;
However, when I add another aggregation like this, the result of sum_col1
changes:
SELECT report_date, SUM(col1) AS sum_col1, COUNT(DISTINCT col2) AS cnt_col2
FROM my_table
GROUP BY report_date
ORDER BY report_date;
For some of the result rows, sum_col1
is now greater, for others it is smaller. (For most, it is the same.) This is reproducible: if I remove the COUNT(DISTINCT)
, I get the values from before. If I re-add it, I get the changed values.
This happens on a somewhat large dataset (thousands of rows); I didn't succeed to reproduce this in a small toy example that I could post here.
AIUI, the addition of an aggregation function shouldn't change the values of the other aggregation functions. The only possible anomaly is that col1
contains NULL
values, but I can't see why this would change the result from one query to the other.
Does someone have an explanation for this? Does the COUNT(DISTINCT ...)
somehow affect the way the grouping works for all aggregations?
Thanks.
Edit: Here are a few of the things I tried while getting to a minimal table that reproduces the problem.
REINDEX TABLE my_table
, which rebuilds PostgreSQL's metadata. The problem persists.(By "slicing" I mean applying OFFSET
and/or LIMIT
and creating a new table with the result of the query.)
Edit 2: Re rounding error: Summing up 1700 values results in a difference like 17395708 vs 17395696. That's a difference of 12. The types are real
. This got me thinking... could this be a rounding error due to different ordering of the summing? Weird is the exact repeatability of two different values and the switching between these exact two values in all scenarios I have tried.
I have a simple aggregation query that works as expected:
SELECT report_date, SUM(col1) AS sum_col1
FROM my_table
GROUP BY report_date
ORDER BY report_date;
However, when I add another aggregation like this, the result of sum_col1
changes:
SELECT report_date, SUM(col1) AS sum_col1, COUNT(DISTINCT col2) AS cnt_col2
FROM my_table
GROUP BY report_date
ORDER BY report_date;
For some of the result rows, sum_col1
is now greater, for others it is smaller. (For most, it is the same.) This is reproducible: if I remove the COUNT(DISTINCT)
, I get the values from before. If I re-add it, I get the changed values.
This happens on a somewhat large dataset (thousands of rows); I didn't succeed to reproduce this in a small toy example that I could post here.
AIUI, the addition of an aggregation function shouldn't change the values of the other aggregation functions. The only possible anomaly is that col1
contains NULL
values, but I can't see why this would change the result from one query to the other.
Does someone have an explanation for this? Does the COUNT(DISTINCT ...)
somehow affect the way the grouping works for all aggregations?
Thanks.
Edit: Here are a few of the things I tried while getting to a minimal table that reproduces the problem.
REINDEX TABLE my_table
, which rebuilds PostgreSQL's metadata. The problem persists.(By "slicing" I mean applying OFFSET
and/or LIMIT
and creating a new table with the result of the query.)
Edit 2: Re rounding error: Summing up 1700 values results in a difference like 17395708 vs 17395696. That's a difference of 12. The types are real
. This got me thinking... could this be a rounding error due to different ordering of the summing? Weird is the exact repeatability of two different values and the switching between these exact two values in all scenarios I have tried.
Kudos to @GuillaumeOutters for bringing up rounding errors.
It seems that different orderings of summing results in (exactly?) two different results, but the difference is something like 17395708 vs 17395696 -- that's a difference of 12, and already way outside the significant digits of the real
type. I shouldn't even have considered this difference as significant.
I confirmed this by casting the values to double precision
before summing, and the problem went away.
Even though it feels strange that the result switches between these exact two values (for one group; the other groups have different values but with similar characteristics), that's likely just up to how the real
type behaves.
Thanks to everybody who chimed in (in the comments to the question).
SELECT
clause should not affect each other. Do note since you do not useORDER BY
, both queries can return different order of rows. – Parfait Commented Mar 6 at 0:36report_date
s)? The former could be explained by the fact that PostgreSQL sums up as soon as it reads values, in theGROUP BY
phase, before theORDER BY
, so if the additional changes the reading order the sum could result a bit differently. – Guillaume Outters Commented Mar 6 at 6:19