|
|
 | | From: | Mark Cave-Ayland | | Subject: | Re: Much Ado About COUNT(*) | | Date: | Wed, 19 Jan 2005 14:59:17 -0000 |
|
|
 | > Date: Wed, 12 Jan 2005 18:45:09 -0800 > From: Jeff Davis > To: Alvaro Herrera > Cc: pgsql-hackers@postgresql.org > Subject: Re: Much Ado About COUNT(*) > Message-ID: <1105584309.2886.410.camel@jeff>
(cut)
> Thanks for the link. It looks like it breaks it up into chunks of about 2KB. I think the > conversation was mostly assuming the tables were somewhat closer to the size of an > index. If you have more than 2KB per tuple, pretty much anything you do with an index > would be faster I would think.
Hi Jeff/Alvaro,
I'm considering an application at the moment whereby I would need to do lots of COUNT(*) on lots of separate tables without a WHERE clause. Would something like the following help speed up the COUNT(*) by reducing the tuple size being used for the count?
CREATE SEQUENCE id_seq;
CREATE TABLE person_count ( id int8 );
CREATE TABLE person ( id int8 DEFAULT nextval('id_seq'); first_name text, surname text, age int, address1 text, address2 text, address3 text, address4 text, postcode text tel text );
For each insert:
BEGIN; INSERT INTO person (first_name, .... Tel) VALUES ('Fred', .... '12345'); INSERT INTO person_count(id) VALUES (currval('id_seq')); COMMIT;
So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to know the current number of person records. How much quicker would a COUNT(*) be if visibility were included in the indices as opposed to a "hacked" approach like this?
Many thanks,
Mark.
------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT
T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend
|
|
 | | From: | Jeff Davis | | Subject: | Re: Much Ado About COUNT(*) | | Date: | Wed, 19 Jan 2005 13:32:47 -0800 |
|
|
 | To fill in some details I think what he's saying is this:
=> create table foo(...); => create table foo_count(num int); => insert into foo_count values(0); => create table foo_change(num int);
then create a trigger "after delete on foo" that does "insert into foo_change values(-1)" and a trigger "after insert on foo" that inserts a +1 into foo_change.
Periodically, do: => begin; => set transaction isolation level serializable; => update foo_count set num=num+(select sum(num) from foo_change); => delete from foo_change; => commit; => VACUUM;
And then any time you need the correct count(*) value, do instead: => select sum(num) from (select num from foo_count union select num from foo_change);
And that should work. I haven't tested this exact example, so I may have overlooked something.
Hope that helps. That way, you don't have huge waste from the second table, and also triggers maintain it for you and you don't need to think about it.
Regards, Jeff Davis
On Wed, 2005-01-19 at 17:40 -0300, Alvaro Herrera wrote: > On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote: > > On Wed, Jan 19, 2005 at 14:59:17 -0000, > > Mark Cave-Ayland wrote: > > > > > So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to > > > know the current number of person records. How much quicker would a COUNT(*) > > > be if visibility were included in the indices as opposed to a "hacked" > > > approach like this? > > > > You are only going to get a constant factor speed up unless the space savings > > allows much better use of cache. You probably want to look at using > > triggers to maintain counts in another table. > > I'd try using a "start value" and a differences list. So the > differences list would be initially empty and the start value would be > 0. On insert or delete, you create a new difference (with +1 or > whatever). Periodically, the differences would be added to the start > value and the records deleted. Thus the time to calculate the total > count is much smaller, and it follows MVCC rules. Of course there are > lots of minor details not mentioned here. > > Not sure if I'd model this with a single table or two. >
---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
|
|
 | | From: | Mark Cave-Ayland | | Subject: | Re: Much Ado About COUNT(*) | | Date: | Thu, 20 Jan 2005 10:12:17 -0000 |
|
|
 | > -----Original Message----- > From: Jeff Davis [mailto:jdavis-pgsql@empires.org] > Sent: 19 January 2005 21:33 > To: Alvaro Herrera > Cc: Mark Cave-Ayland; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Much Ado About COUNT(*) > > > > To fill in some details I think what he's saying is this: > > => create table foo(...); > => create table foo_count(num int); > => insert into foo_count values(0); > => create table foo_change(num int); > > then create a trigger "after delete on foo" that does "insert > into foo_change values(-1)" and a trigger "after insert on > foo" that inserts a +1 into foo_change. > > Periodically, do: > => begin; > => set transaction isolation level serializable; > => update foo_count set num=num+(select sum(num) from > foo_change); => delete from foo_change; => commit; => VACUUM; > > And then any time you need the correct count(*) value, do > instead: => select sum(num) from (select num from foo_count > union select num from foo_change); > > And that should work. I haven't tested this exact example, so > I may have overlooked something. > > Hope that helps. That way, you don't have huge waste from the > second table, and also triggers maintain it for you and you > don't need to think about it. > > Regards, > Jeff Davis
Hi Jeff,
Thanks for the information. I seem to remember something similar to this being discussed last year in a similar thread. My only real issue I can see with this approach is that the trigger is fired for every row, and it is likely that the database I am planning will have large inserts of several hundred thousand records. Normally the impact of these is minimised by inserting the entire set in one transaction. Is there any way that your trigger can be modified to fire once per transaction with the number of modified rows as a parameter?
Many thanks,
Mark.
------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT
T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
|
|
 | | From: | D'Arcy J.M. Cain | | Subject: | Re: Much Ado About COUNT(*) | | Date: | Thu, 20 Jan 2005 06:55:00 -0500 |
|
|
 | On Thu, 20 Jan 2005 10:12:17 -0000 "Mark Cave-Ayland" wrote: > Thanks for the information. I seem to remember something similar to > this being discussed last year in a similar thread. My only real issue > I can see with this approach is that the trigger is fired for every > row, and it is likely that the database I am planning will have large > inserts of several hundred thousand records. Normally the impact of > these is minimised by inserting the entire set in one transaction. Is > there any way that your trigger can be modified to fire once per > transaction with the number of modified rows as a parameter?
I don't believe that such a facility exists but before dismissing it you should test it out. I think that you will find that disk buffering (the system's as well as PostgreSQL's) will effectively handle this for you anyway.
-- D'Arcy J.M. Cain | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faqs/FAQ.html
|
|
 | | From: | Richard Huxton | | Subject: | Re: Much Ado About COUNT(*) | | Date: | Thu, 20 Jan 2005 12:44:36 +0000 |
|
|
 | D'Arcy J.M. Cain wrote: > On Thu, 20 Jan 2005 10:12:17 -0000 > "Mark Cave-Ayland" wrote: > >>Thanks for the information. I seem to remember something similar to >>this being discussed last year in a similar thread. My only real issue >>I can see with this approach is that the trigger is fired for every >>row, and it is likely that the database I am planning will have large >>inserts of several hundred thousand records. Normally the impact of >>these is minimised by inserting the entire set in one transaction. Is >>there any way that your trigger can be modified to fire once per >>transaction with the number of modified rows as a parameter? > > > I don't believe that such a facility exists but before dismissing it you > should test it out. I think that you will find that disk buffering (the > system's as well as PostgreSQL's) will effectively handle this for you > anyway.
Well, it looks like ROW_COUNT isn't set in a statement-level trigger function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame, otherwise it would be easy to handle. It should be possible to expose this information though, since it gets reported at the command conclusion.
-- Richard Huxton Archonet Ltd
-- stmt_trig_test.sql -- BEGIN;
CREATE TABLE trigtest ( a int4 NOT NULL, b text, PRIMARY KEY (a) );
CREATE FUNCTION tt_test_fn() RETURNS TRIGGER AS ' DECLARE nr integer; ro integer; nr2 integer; BEGIN GET DIAGNOSTICS nr = ROW_COUNT; GET DIAGNOSTICS ro = RESULT_OID; SELECT count(*) INTO nr2 FROM trigtest;
RAISE NOTICE ''nr = % / ro = % / nr2 = %'',nr,ro,nr2;
RETURN NULL; END; ' LANGUAGE plpgsql;
CREATE TRIGGER tt_test AFTER INSERT OR UPDATE ON trigtest FOR EACH STATEMENT EXECUTE PROCEDURE tt_test_fn();
INSERT INTO trigtest VALUES (1,'a'); INSERT INTO trigtest VALUES (2,'b'); UPDATE trigtest SET b = 'x';
ROLLBACK;
---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
|
|
 | | From: | Mark Cave-Ayland | | Subject: | Re: Much Ado About COUNT(*) | | Date: | Thu, 20 Jan 2005 13:33:10 -0000 |
|
|
 |
> -----Original Message----- > From: Richard Huxton [mailto:dev@archonet.com] > Sent: 20 January 2005 12:45 > To: D'Arcy J.M. Cain > Cc: Mark Cave-Ayland; jdavis-pgsql@empires.org; > alvherre@dcc.uchile.cl; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Much Ado About COUNT(*) > > > D'Arcy J.M. Cain wrote: > > On Thu, 20 Jan 2005 10:12:17 -0000 > > "Mark Cave-Ayland" wrote: > > > >>Thanks for the information. I seem to remember something similar to > >>this being discussed last year in a similar thread. My only > real issue > >>I can see with this approach is that the trigger is fired for every > >>row, and it is likely that the database I am planning will > have large > >>inserts of several hundred thousand records. Normally the impact of > >>these is minimised by inserting the entire set in one > transaction. Is > >>there any way that your trigger can be modified to fire once per > >>transaction with the number of modified rows as a parameter? > > > > > > I don't believe that such a facility exists but before > dismissing it > > you should test it out. I think that you will find that disk > > buffering (the system's as well as PostgreSQL's) will effectively > > handle this for you anyway. > > Well, it looks like ROW_COUNT isn't set in a statement-level trigger > function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame, > otherwise > it would be easy to handle. It should be possible to expose this > information though, since it gets reported at the command conclusion.
Hi Richard,
This is more the sort of approach I would be looking for. However I think even in a transaction with ROW_COUNT defined, the trigger will still be called once per insert. I think something like this would require a new syntax like below, and some supporting code that would keep track of the tables touched by a transaction :(
CREATE TRIGGER tt_test AFTER TRANSACTION ON trigtest FOR EACH TRANSACTION EXECUTE PROCEDURE tt_test_fn();
I am sure that Jeff's approach will work, however it just seems like writing out one table entry per row is going to slow large bulk inserts right down.
Kind regards,
Mark.
------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT
T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk
---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
|
|
 | | From: | Richard Huxton | | Subject: | Re: Much Ado About COUNT(*) | | Date: | Thu, 20 Jan 2005 15:09:24 +0000 |
|
|
 | Mark Cave-Ayland wrote: > > > >>-----Original Message----- >>From: Richard Huxton [mailto:dev@archonet.com] >>Sent: 20 January 2005 12:45 >>To: D'Arcy J.M. Cain >>Cc: Mark Cave-Ayland; jdavis-pgsql@empires.org; >>alvherre@dcc.uchile.cl; pgsql-hackers@postgresql.org >>Subject: Re: [HACKERS] Much Ado About COUNT(*) >> >> >>D'Arcy J.M. Cain wrote: >> >>>On Thu, 20 Jan 2005 10:12:17 -0000 >>>"Mark Cave-Ayland" wrote: >>> >>> >>>>Thanks for the information. I seem to remember something similar to >>>>this being discussed last year in a similar thread. My only >> >>real issue >> >>>>I can see with this approach is that the trigger is fired for every >>>>row, and it is likely that the database I am planning will >> >>have large >> >>>>inserts of several hundred thousand records. Normally the impact of >>>>these is minimised by inserting the entire set in one >> >>transaction. Is >> >>>>there any way that your trigger can be modified to fire once per >>>>transaction with the number of modified rows as a parameter? >>> >>> >>>I don't believe that such a facility exists but before >> >>dismissing it >> >>>you should test it out. I think that you will find that disk >>>buffering (the system's as well as PostgreSQL's) will effectively >>>handle this for you anyway. >> >>Well, it looks like ROW_COUNT isn't set in a statement-level trigger >>function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame, >>otherwise >>it would be easy to handle. It should be possible to expose this >>information though, since it gets reported at the command conclusion. > > > > Hi Richard, > > This is more the sort of approach I would be looking for. However I think > even in a transaction with ROW_COUNT defined, the trigger will still be > called once per insert. I think something like this would require a new > syntax like below, and some supporting code that would keep track of the > tables touched by a transaction :(
Well, a statement-level trigger would be called once per statement, which can be much less than per row.
-- Richard Huxton Archonet Ltd
---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend
|
|
 | | From: | Alvaro Herrera | | Subject: | Re: Much Ado About COUNT(*) | | Date: | Thu, 20 Jan 2005 10:44:40 -0300 |
|
|
 | On Thu, Jan 20, 2005 at 01:33:10PM -0000, Mark Cave-Ayland wrote:
> I am sure that Jeff's approach will work, however it just seems like writing > out one table entry per row is going to slow large bulk inserts right down.
I don't see how it is any slower than the approach of inserting one entry per row in the narrow table the OP wanted to use. And it will be faster for the scans.
-- Alvaro Herrera () Officer Krupke, what are we to do? Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")
---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
|
|
 | | From: | Bruno Wolff III | | Subject: | Re: Much Ado About COUNT(*) | | Date: | Wed, 19 Jan 2005 10:16:38 -0600 |
|
|
 | On Wed, Jan 19, 2005 at 14:59:17 -0000, Mark Cave-Ayland wrote: > BEGIN; > INSERT INTO person (first_name, .... Tel) VALUES ('Fred', .... > '12345'); > INSERT INTO person_count(id) VALUES (currval('id_seq')); > COMMIT; > > > So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to > know the current number of person records. How much quicker would a COUNT(*) > be if visibility were included in the indices as opposed to a "hacked" > approach like this?
You are only going to get a constant factor speed up unless the space savings allows much better use of cache. You probably want to look at using triggers to maintain counts in another table.
---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
|
|
 | | From: | Alvaro Herrera | | Subject: | Re: Much Ado About COUNT(*) | | Date: | Wed, 19 Jan 2005 17:40:53 -0300 |
|
|
 | On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote: > On Wed, Jan 19, 2005 at 14:59:17 -0000, > Mark Cave-Ayland wrote: > > > So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to > > know the current number of person records. How much quicker would a COUNT(*) > > be if visibility were included in the indices as opposed to a "hacked" > > approach like this? > > You are only going to get a constant factor speed up unless the space savings > allows much better use of cache. You probably want to look at using > triggers to maintain counts in another table.
I'd try using a "start value" and a differences list. So the differences list would be initially empty and the start value would be 0. On insert or delete, you create a new difference (with +1 or whatever). Periodically, the differences would be added to the start value and the records deleted. Thus the time to calculate the total count is much smaller, and it follows MVCC rules. Of course there are lots of minor details not mentioned here.
Not sure if I'd model this with a single table or two.
-- Alvaro Herrera () "I would rather have GNU than GNOT." (ccchips, lwn.net/Articles/37595/)
---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend
|
|
|