knowledge-database (beta)

Current group: pgsql.hackers

Re: Much Ado About COUNT(*)

Re: Much Ado About COUNT(*)  
Mark Cave-Ayland
 Re: Much Ado About COUNT(*)  
Jeff Davis
 Re: Much Ado About COUNT(*)  
Mark Cave-Ayland
 Re: Much Ado About COUNT(*)  
D'Arcy J.M. Cain
 Re: Much Ado About COUNT(*)  
Richard Huxton
 Re: Much Ado About COUNT(*)  
Mark Cave-Ayland
 Re: Much Ado About COUNT(*)  
Richard Huxton
 Re: Much Ado About COUNT(*)  
Alvaro Herrera
 Re: Much Ado About COUNT(*)  
Bruno Wolff III
 Re: Much Ado About COUNT(*)  
Alvaro Herrera
From:Mark Cave-Ayland
Subject:Re: Much Ado About COUNT(*)
Date:Wed, 19 Jan 2005 14:59:17 -0000
> Date: Wed, 12 Jan 2005 18:45:09 -0800
> From: Jeff Davis
> To: Alvaro Herrera
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: Much Ado About COUNT(*)
> Message-ID: <1105584309.2886.410.camel@jeff>

(cut)

> Thanks for the link. It looks like it breaks it up into chunks of about
2KB. I think the
> conversation was mostly assuming the tables were somewhat closer to the
size of an
> index. If you have more than 2KB per tuple, pretty much anything you do
with an index
> would be faster I would think.


Hi Jeff/Alvaro,

I'm considering an application at the moment whereby I would need to do lots
of COUNT(*) on lots of separate tables without a WHERE clause. Would
something like the following help speed up the COUNT(*) by reducing the
tuple size being used for the count?


CREATE SEQUENCE id_seq;

CREATE TABLE person_count (
id int8
);

CREATE TABLE person (
id int8 DEFAULT nextval('id_seq');
first_name text,
surname text,
age int,
address1 text,
address2 text,
address3 text,
address4 text,
postcode text
tel text
);

For each insert:

BEGIN;
INSERT INTO person (first_name, .... Tel) VALUES ('Fred', ....
'12345');
INSERT INTO person_count(id) VALUES (currval('id_seq'));
COMMIT;


So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
know the current number of person records. How much quicker would a COUNT(*)
be if visibility were included in the indices as opposed to a "hacked"
approach like this?


Many thanks,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk




---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
From:Jeff Davis
Subject:Re: Much Ado About COUNT(*)
Date:Wed, 19 Jan 2005 13:32:47 -0800

To fill in some details I think what he's saying is this:

=> create table foo(...);
=> create table foo_count(num int);
=> insert into foo_count values(0);
=> create table foo_change(num int);

then create a trigger "after delete on foo" that does "insert into
foo_change values(-1)" and a trigger "after insert on foo" that inserts
a +1 into foo_change.

Periodically, do:
=> begin;
=> set transaction isolation level serializable;
=> update foo_count set num=num+(select sum(num) from foo_change);
=> delete from foo_change;
=> commit;
=> VACUUM;

And then any time you need the correct count(*) value, do instead:
=> select sum(num) from (select num from foo_count union select num from
foo_change);

And that should work. I haven't tested this exact example, so I may have
overlooked something.

Hope that helps. That way, you don't have huge waste from the second
table, and also triggers maintain it for you and you don't need to think
about it.

Regards,
Jeff Davis

On Wed, 2005-01-19 at 17:40 -0300, Alvaro Herrera wrote:
> On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote:
> > On Wed, Jan 19, 2005 at 14:59:17 -0000,
> > Mark Cave-Ayland wrote:
> >
> > > So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
> > > know the current number of person records. How much quicker would a COUNT(*)
> > > be if visibility were included in the indices as opposed to a "hacked"
> > > approach like this?
> >
> > You are only going to get a constant factor speed up unless the space savings
> > allows much better use of cache. You probably want to look at using
> > triggers to maintain counts in another table.
>
> I'd try using a "start value" and a differences list. So the
> differences list would be initially empty and the start value would be
> 0. On insert or delete, you create a new difference (with +1 or
> whatever). Periodically, the differences would be added to the start
> value and the records deleted. Thus the time to calculate the total
> count is much smaller, and it follows MVCC rules. Of course there are
> lots of minor details not mentioned here.
>
> Not sure if I'd model this with a single table or two.
>


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
From:Mark Cave-Ayland
Subject:Re: Much Ado About COUNT(*)
Date:Thu, 20 Jan 2005 10:12:17 -0000

> -----Original Message-----
> From: Jeff Davis [mailto:jdavis-pgsql@empires.org]
> Sent: 19 January 2005 21:33
> To: Alvaro Herrera
> Cc: Mark Cave-Ayland; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Much Ado About COUNT(*)
>
>
>
> To fill in some details I think what he's saying is this:
>
> => create table foo(...);
> => create table foo_count(num int);
> => insert into foo_count values(0);
> => create table foo_change(num int);
>
> then create a trigger "after delete on foo" that does "insert
> into foo_change values(-1)" and a trigger "after insert on
> foo" that inserts a +1 into foo_change.
>
> Periodically, do:
> => begin;
> => set transaction isolation level serializable;
> => update foo_count set num=num+(select sum(num) from
> foo_change); => delete from foo_change; => commit; => VACUUM;
>
> And then any time you need the correct count(*) value, do
> instead: => select sum(num) from (select num from foo_count
> union select num from foo_change);
>
> And that should work. I haven't tested this exact example, so
> I may have overlooked something.
>
> Hope that helps. That way, you don't have huge waste from the
> second table, and also triggers maintain it for you and you
> don't need to think about it.
>
> Regards,
> Jeff Davis


Hi Jeff,

Thanks for the information. I seem to remember something similar to this
being discussed last year in a similar thread. My only real issue I can see
with this approach is that the trigger is fired for every row, and it is
likely that the database I am planning will have large inserts of several
hundred thousand records. Normally the impact of these is minimised by
inserting the entire set in one transaction. Is there any way that your
trigger can be modified to fire once per transaction with the number of
modified rows as a parameter?


Many thanks,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk




---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
From:D'Arcy J.M. Cain
Subject:Re: Much Ado About COUNT(*)
Date:Thu, 20 Jan 2005 06:55:00 -0500
On Thu, 20 Jan 2005 10:12:17 -0000
"Mark Cave-Ayland" wrote:
> Thanks for the information. I seem to remember something similar to
> this being discussed last year in a similar thread. My only real issue
> I can see with this approach is that the trigger is fired for every
> row, and it is likely that the database I am planning will have large
> inserts of several hundred thousand records. Normally the impact of
> these is minimised by inserting the entire set in one transaction. Is
> there any way that your trigger can be modified to fire once per
> transaction with the number of modified rows as a parameter?

I don't believe that such a facility exists but before dismissing it you
should test it out. I think that you will find that disk buffering (the
system's as well as PostgreSQL's) will effectively handle this for you
anyway.

--
D'Arcy J.M. Cain | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html
From:Richard Huxton
Subject:Re: Much Ado About COUNT(*)
Date:Thu, 20 Jan 2005 12:44:36 +0000
D'Arcy J.M. Cain wrote:
> On Thu, 20 Jan 2005 10:12:17 -0000
> "Mark Cave-Ayland" wrote:
>
>>Thanks for the information. I seem to remember something similar to
>>this being discussed last year in a similar thread. My only real issue
>>I can see with this approach is that the trigger is fired for every
>>row, and it is likely that the database I am planning will have large
>>inserts of several hundred thousand records. Normally the impact of
>>these is minimised by inserting the entire set in one transaction. Is
>>there any way that your trigger can be modified to fire once per
>>transaction with the number of modified rows as a parameter?
>
>
> I don't believe that such a facility exists but before dismissing it you
> should test it out. I think that you will find that disk buffering (the
> system's as well as PostgreSQL's) will effectively handle this for you
> anyway.

Well, it looks like ROW_COUNT isn't set in a statement-level trigger
function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame, otherwise
it would be easy to handle. It should be possible to expose this
information though, since it gets reported at the command conclusion.

--
Richard Huxton
Archonet Ltd

-- stmt_trig_test.sql --
BEGIN;

CREATE TABLE trigtest (
a int4 NOT NULL,
b text,
PRIMARY KEY (a)
);

CREATE FUNCTION tt_test_fn() RETURNS TRIGGER AS '
DECLARE
nr integer;
ro integer;
nr2 integer;
BEGIN
GET DIAGNOSTICS nr = ROW_COUNT;
GET DIAGNOSTICS ro = RESULT_OID;
SELECT count(*) INTO nr2 FROM trigtest;

RAISE NOTICE ''nr = % / ro = % / nr2 = %'',nr,ro,nr2;

RETURN NULL;
END;
' LANGUAGE plpgsql;

CREATE TRIGGER tt_test AFTER INSERT OR UPDATE ON trigtest
FOR EACH STATEMENT
EXECUTE PROCEDURE tt_test_fn();

INSERT INTO trigtest VALUES (1,'a');
INSERT INTO trigtest VALUES (2,'b');
UPDATE trigtest SET b = 'x';

ROLLBACK;

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
From:Mark Cave-Ayland
Subject:Re: Much Ado About COUNT(*)
Date:Thu, 20 Jan 2005 13:33:10 -0000


> -----Original Message-----
> From: Richard Huxton [mailto:dev@archonet.com]
> Sent: 20 January 2005 12:45
> To: D'Arcy J.M. Cain
> Cc: Mark Cave-Ayland; jdavis-pgsql@empires.org;
> alvherre@dcc.uchile.cl; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Much Ado About COUNT(*)
>
>
> D'Arcy J.M. Cain wrote:
> > On Thu, 20 Jan 2005 10:12:17 -0000
> > "Mark Cave-Ayland" wrote:
> >
> >>Thanks for the information. I seem to remember something similar to
> >>this being discussed last year in a similar thread. My only
> real issue
> >>I can see with this approach is that the trigger is fired for every
> >>row, and it is likely that the database I am planning will
> have large
> >>inserts of several hundred thousand records. Normally the impact of
> >>these is minimised by inserting the entire set in one
> transaction. Is
> >>there any way that your trigger can be modified to fire once per
> >>transaction with the number of modified rows as a parameter?
> >
> >
> > I don't believe that such a facility exists but before
> dismissing it
> > you should test it out. I think that you will find that disk
> > buffering (the system's as well as PostgreSQL's) will effectively
> > handle this for you anyway.
>
> Well, it looks like ROW_COUNT isn't set in a statement-level trigger
> function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame,
> otherwise
> it would be easy to handle. It should be possible to expose this
> information though, since it gets reported at the command conclusion.


Hi Richard,

This is more the sort of approach I would be looking for. However I think
even in a transaction with ROW_COUNT defined, the trigger will still be
called once per insert. I think something like this would require a new
syntax like below, and some supporting code that would keep track of the
tables touched by a transaction :(

CREATE TRIGGER tt_test AFTER TRANSACTION ON trigtest
FOR EACH TRANSACTION
EXECUTE PROCEDURE tt_test_fn();

I am sure that Jeff's approach will work, however it just seems like writing
out one table entry per row is going to slow large bulk inserts right down.


Kind regards,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk



---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
From:Richard Huxton
Subject:Re: Much Ado About COUNT(*)
Date:Thu, 20 Jan 2005 15:09:24 +0000
Mark Cave-Ayland wrote:
>
>
>
>>-----Original Message-----
>>From: Richard Huxton [mailto:dev@archonet.com]
>>Sent: 20 January 2005 12:45
>>To: D'Arcy J.M. Cain
>>Cc: Mark Cave-Ayland; jdavis-pgsql@empires.org;
>>alvherre@dcc.uchile.cl; pgsql-hackers@postgresql.org
>>Subject: Re: [HACKERS] Much Ado About COUNT(*)
>>
>>
>>D'Arcy J.M. Cain wrote:
>>
>>>On Thu, 20 Jan 2005 10:12:17 -0000
>>>"Mark Cave-Ayland" wrote:
>>>
>>>
>>>>Thanks for the information. I seem to remember something similar to
>>>>this being discussed last year in a similar thread. My only
>>
>>real issue
>>
>>>>I can see with this approach is that the trigger is fired for every
>>>>row, and it is likely that the database I am planning will
>>
>>have large
>>
>>>>inserts of several hundred thousand records. Normally the impact of
>>>>these is minimised by inserting the entire set in one
>>
>>transaction. Is
>>
>>>>there any way that your trigger can be modified to fire once per
>>>>transaction with the number of modified rows as a parameter?
>>>
>>>
>>>I don't believe that such a facility exists but before
>>
>>dismissing it
>>
>>>you should test it out. I think that you will find that disk
>>>buffering (the system's as well as PostgreSQL's) will effectively
>>>handle this for you anyway.
>>
>>Well, it looks like ROW_COUNT isn't set in a statement-level trigger
>>function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame,
>>otherwise
>>it would be easy to handle. It should be possible to expose this
>>information though, since it gets reported at the command conclusion.
>
>
>
> Hi Richard,
>
> This is more the sort of approach I would be looking for. However I think
> even in a transaction with ROW_COUNT defined, the trigger will still be
> called once per insert. I think something like this would require a new
> syntax like below, and some supporting code that would keep track of the
> tables touched by a transaction :(

Well, a statement-level trigger would be called once per statement,
which can be much less than per row.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
From:Alvaro Herrera
Subject:Re: Much Ado About COUNT(*)
Date:Thu, 20 Jan 2005 10:44:40 -0300
On Thu, Jan 20, 2005 at 01:33:10PM -0000, Mark Cave-Ayland wrote:

> I am sure that Jeff's approach will work, however it just seems like writing
> out one table entry per row is going to slow large bulk inserts right down.

I don't see how it is any slower than the approach of inserting one
entry per row in the narrow table the OP wanted to use. And it will be
faster for the scans.

--
Alvaro Herrera ()
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
From:Bruno Wolff III
Subject:Re: Much Ado About COUNT(*)
Date:Wed, 19 Jan 2005 10:16:38 -0600
On Wed, Jan 19, 2005 at 14:59:17 -0000,
Mark Cave-Ayland wrote:
> BEGIN;
> INSERT INTO person (first_name, .... Tel) VALUES ('Fred', ....
> '12345');
> INSERT INTO person_count(id) VALUES (currval('id_seq'));
> COMMIT;
>
>
> So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
> know the current number of person records. How much quicker would a COUNT(*)
> be if visibility were included in the indices as opposed to a "hacked"
> approach like this?

You are only going to get a constant factor speed up unless the space savings
allows much better use of cache. You probably want to look at using
triggers to maintain counts in another table.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
From:Alvaro Herrera
Subject:Re: Much Ado About COUNT(*)
Date:Wed, 19 Jan 2005 17:40:53 -0300
On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote:
> On Wed, Jan 19, 2005 at 14:59:17 -0000,
> Mark Cave-Ayland wrote:
>
> > So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
> > know the current number of person records. How much quicker would a COUNT(*)
> > be if visibility were included in the indices as opposed to a "hacked"
> > approach like this?
>
> You are only going to get a constant factor speed up unless the space savings
> allows much better use of cache. You probably want to look at using
> triggers to maintain counts in another table.

I'd try using a "start value" and a differences list. So the
differences list would be initially empty and the start value would be
0. On insert or delete, you create a new difference (with +1 or
whatever). Periodically, the differences would be added to the start
value and the records deleted. Thus the time to calculate the total
count is much smaller, and it follows MVCC rules. Of course there are
lots of minor details not mentioned here.

Not sure if I'd model this with a single table or two.

--
Alvaro Herrera ()
"I would rather have GNU than GNOT." (ccchips, lwn.net/Articles/37595/)

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
   

Copyright © 2006 knowledge-database   -   All rights reserved