knowledge-database (beta)

Current group: pgsql.hackers

Locale agnostic unicode text

Locale agnostic unicode text  
Dawid Kuroczko
 Re: Locale agnostic unicode text  
Dawid Kuroczko
 Re: Locale agnostic unicode text  
Tom Lane
From:Dawid Kuroczko
Subject:Locale agnostic unicode text
Date:Fri, 21 Jan 2005 23:14:55 +0100
------=_Part_1035_12329331.1106345695178
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hello!

One of least liked by me features of PostgreSQL is a need to specify
LC_CTYPE an LC_COLLATE at initdb time. Especially if you intend to
put into DB texts in
different languages (say, Polish, French, German and Russian) and use
functions like lower() or ORDER BY these texts. :)

I guess the need to select these two locales at initdb time is to
avoid problems with corrupted indexes (column first indexed
with lower() function when setlocale('C'), then accessed when
setlocale('ru_RU')... oops. etc.). Probably there are more of
those. :)

To solve this I thought about creating function lower(text, locale), say:
lower ('Name', 'pl_PL.utf8');
Simple enough, I used plperl (plperlu actually) to make it happen and
while doing so I've noticed that perl does unicode-lowercasing/uppercasing on
its own accord, doesn't need locales to it, and does it pretty well.
So the resulting function is:

CREATE DOMAIN unitext text;
CREATE OR REPLACE FUNCTION lower(unitext) RETURNS unitext AS $$
utf8::decode($_[0]);
return lc($_[0]);
$$ LANGUAGE plperlu IMMUTABLE;

And it seems to work fine regardless of locale set.


So... I thoght, why not use this unitext to sort texts?
So I've created functions, operators and operator class,
This time setlocale() was needed to get the behaviour
I needed (database initdb'ed to 'C', my order set to 'pl_PL',
or whatever locale I need at given moment).

I've attached a 'draft' of unitext,sql, which should create
unitext datatype capable of sorting according Polish locale.
It does not work as it should and I don't know how to make
it work. For example:

SELECT * FROM uni_tab ORDER BY uni_column;

....sorts according to 'C' (initdb's) locale.

I can force my way by issuing:

SELECT * FROM uni_tab ORDER BY uni_column USING <;

....but I would like to force ORDER BY using operators
provided by me without this 'USING <' clause.

Any hints how to do it?

Regards,
Dawid

PS: I like perl's lc() and uc() behaviour in utf8 mode. I'm thinking about
trying to "port" it from perl source as a C-language extension for PostgreSQL.
What do you think about it?

------=_Part_1035_12329331.1106345695178
Content-Type: text/x-sql; name="unitext.sql"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="unitext.sql"

SET search_path =3D public;
CREATE DOMAIN unitext text;
BEGIN;

CREATE OR REPLACE FUNCTION lower(unitext) RETURNS unitext AS $$
utf8::decode($_[0]);
return lc($_[0]);
$$ LANGUAGE plperlu IMMUTABLE;

CREATE OR REPLACE FUNCTION upper(unitext) RETURNS unitext AS $$
utf8::decode($_[0]);
return uc($_[0]);
$$ LANGUAGE plperlu IMMUTABLE;


CREATE OR REPLACE FUNCTION unitext_lt(unitext,unitext) RETURNS boolean AS $$
utf8::decode($_[0]); utf8::decode($_[1]);

use POSIX qw(setlocale LC_ALL);
my $loc =3D setlocale(LC_ALL);
setlocale(LC_ALL, 'pl_PL.utf8');
use locale;

my $ret =3D ($_[0] lt $_[1]) ? 't' : 'f';

setlocale(LC_ALL, $loc);

return $ret;
$$ LANGUAGE plperlu;

CREATE OR REPLACE FUNCTION unitext_le(unitext,unitext) RETURNS boolean AS $$
utf8::decode($_[0]); utf8::decode($_[1]);

use POSIX qw(setlocale LC_ALL);
my $loc =3D setlocale(LC_ALL);
setlocale(LC_ALL, 'pl_PL.utf8');
use locale;

my $ret =3D ($_[0] le $_[1]) ? 't' : 'f';

setlocale(LC_ALL, $loc);

return $ret;
$$ LANGUAGE plperlu STABLE;

CREATE OR REPLACE FUNCTION unitext_gt(unitext,unitext) RETURNS boolean AS $$
utf8::decode($_[0]); utf8::decode($_[1]);

use POSIX qw(setlocale LC_ALL);
my $loc =3D setlocale(LC_ALL);
setlocale(LC_ALL, 'pl_PL.utf8');
use locale;

my $ret =3D ($_[0] gt $_[1]) ? 't' : 'f';

setlocale(LC_ALL, $loc);

return $ret;
$$ LANGUAGE plperlu STABLE;

CREATE OR REPLACE FUNCTION unitext_ge(unitext,unitext) RETURNS boolean AS $$
utf8::decode($_[0]); utf8::decode($_[1]);

use POSIX qw(setlocale LC_ALL);
my $loc =3D setlocale(LC_ALL);
setlocale(LC_ALL, 'pl_PL.utf8');
use locale;

my $ret =3D ($_[0] ge $_[1]) ? 't' : 'f';

setlocale(LC_ALL, $loc);

return $ret;
$$ LANGUAGE plperlu STABLE;

CREATE OR REPLACE FUNCTION unitext_eq(unitext,unitext) RETURNS boolean AS $$
utf8::decode($_[0]); utf8::decode($_[1]);

use POSIX qw(setlocale LC_ALL);
my $loc =3D setlocale(LC_ALL);
setlocale(LC_ALL, 'pl_PL.utf8');
use locale;

my $ret =3D ($_[0] eq $_[1]) ? 't' : 'f';

setlocale(LC_ALL, $loc);

return $ret;
$$ LANGUAGE plperlu STABLE;

CREATE OR REPLACE FUNCTION unitext_ne(unitext,unitext) RETURNS boolean AS $$
utf8::decode($_[0]); utf8::decode($_[1]);

use POSIX qw(setlocale LC_ALL);
my $loc =3D setlocale(LC_ALL);
setlocale(LC_ALL, 'pl_PL.utf8');
use locale;

my $ret =3D ($_[0] ne $_[1]) ? 't' : 'f';

setlocale(LC_ALL, $loc);

return $ret;
$$ LANGUAGE plperlu;

CREATE OR REPLACE FUNCTION unitext_cmp(unitext,unitext) RETURNS integer AS =
$$
utf8::decode($_[0]); utf8::decode($_[1]);

use POSIX qw(setlocale LC_ALL);
my $loc =3D setlocale(LC_ALL);
setlocale(LC_ALL, 'pl_PL.utf8');
use locale;

my $ret =3D $_[0] cmp $_[1];

setlocale(LC_ALL, $loc);

return $ret;
$$ LANGUAGE plperlu;


COMMIT;

CREATE OPERATOR < (
LEFTARG =3D unitext,
RIGHTARG =3D unitext,
-- COMMUTATOR =3D >,
-- NEGATOR =3D >=3D,
PROCEDURE =3D unitext_lt,
RESTRICT =3D scalarltsel,
JOIN =3D scalarltjoinsel
);

CREATE OPERATOR <=3D (
LEFTARG =3D unitext,
RIGHTARG =3D unitext,
-- COMMUTATOR =3D >=3D,
-- NEGATOR =3D >,
PROCEDURE =3D unitext_le,
RESTRICT =3D scalarltsel,
JOIN =3D scalarltjoinsel
);

CREATE OPERATOR >=3D (
LEFTARG =3D unitext,
RIGHTARG =3D unitext,
COMMUTATOR =3D <=3D,
NEGATOR =3D <,
PROCEDURE =3D unitext_ge,
RESTRICT =3D scalargtsel,
JOIN =3D scalargtjoinsel
);

CREATE OPERATOR > (
LEFTARG =3D unitext,
RIGHTARG =3D unitext,
COMMUTATOR =3D <,
NEGATOR =3D <=3D,
PROCEDURE =3D unitext_gt,
RESTRICT =3D scalargtsel,
JOIN =3D scalargtjoinsel
);

CREATE OPERATOR <> (
LEFTARG =3D unitext,
RIGHTARG =3D unitext,
COMMUTATOR =3D <>,
-- NEGATOR =3D =3D,
PROCEDURE =3D unitext_ne,
RESTRICT =3D neqsel,
JOIN =3D neqjoinsel
);
CREATE OPERATOR =3D (
LEFTARG =3D unitext,
RIGHTARG =3D unitext,
COMMUTATOR =3D =3D,
NEGATOR =3D <>,
HASHES, MERGES,
RESTRICT =3D eqsel,
JOIN =3D eqjoinsel,
SORT1 =3D <,
SORT2 =3D <,
LTCMP =3D <,
GTCMP =3D >,
PROCEDURE =3D unitext_eq,
RESTRICT =3D eqsel,
JOIN =3D eqjoinsel
);


CREATE OPERATOR CLASS unitext_ops
DEFAULT FOR TYPE unitext USING btree=20
AS
OPERATOR 1 < ,
OPERATOR 2 <=3D ,
OPERATOR 3 =3D ,
OPERATOR 4 >=3D ,
OPERATOR 5 > ,
FUNCTION 1 unitext_cmp(unitext, unitext);



------=_Part_1035_12329331.1106345695178
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

------=_Part_1035_12329331.1106345695178--
From:Dawid Kuroczko
Subject:Re: Locale agnostic unicode text
Date:Sun, 23 Jan 2005 16:54:22 +0100
On Sat, 22 Jan 2005 17:09:42 -0500, Tom Lane wrote:
> > This time setlocale() was needed to get the behaviour
> > I needed (database initdb'ed to 'C', my order set to 'pl_PL',
> > or whatever locale I need at given moment).
> I would imagine that the performance is spectacularly awful :-(.
> Have you benchmarked it? A large sort on a unitext column,
> for instance, would be revealing.

True. Yet it would be still better than nothing ("C"). Actually
I was thinking that maybe functional indexes could be
used to boost the speed (at least for ordering).

> > ...but I would like to force ORDER BY using operators
> > provided by me without this 'USING <' clause.
> Hmm, the existence of the default btree operator class should be
> sufficient.

If You (or anyone) could try that SQL file and try to find
missing clause... :)

I guess that the case is that DOMAIN unitext is not quite
another type, so text's default operators sometimes take
precedence over unitext's own. :)

> > CREATE OR REPLACE FUNCTION lower(unitext) RETURNS unitext AS $$
> > utf8::decode($_[0]);
> > return lc($_[0]);
> > $$ LANGUAGE plperlu IMMUTABLE;
>
> AFAIK upper/lower cannot be considered to be locale-independent
> (see Turkish I/i business for a counterexample).

I imagine it is not possible to make 'one size fits all' lower(),
yet perl's uc()/lc() in my opinion for some cases is still
better than choosing one locale or using "C" locale.

Regards,
Dawid

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq
From:Tom Lane
Subject:Re: Locale agnostic unicode text
Date:Sat, 22 Jan 2005 17:09:42 -0500
Dawid Kuroczko writes:
> So... I thoght, why not use this unitext to sort texts?
> So I've created functions, operators and operator class,
> This time setlocale() was needed to get the behaviour
> I needed (database initdb'ed to 'C', my order set to 'pl_PL',
> or whatever locale I need at given moment).

I would imagine that the performance is spectacularly awful :-(.
Have you benchmarked it? A large sort on a unitext column,
for instance, would be revealing.

> ...but I would like to force ORDER BY using operators
> provided by me without this 'USING <' clause.

Hmm, the existence of the default btree operator class should be
sufficient.

> CREATE OR REPLACE FUNCTION lower(unitext) RETURNS unitext AS $$
> utf8::decode($_[0]);
> return lc($_[0]);
> $$ LANGUAGE plperlu IMMUTABLE;

AFAIK upper/lower cannot be considered to be locale-independent
(see Turkish I/i business for a counterexample).

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
   

Copyright © 2006 knowledge-database   -   All rights reserved