 | Paul,
While I agree with you, that a hash solution is much the best, I cannot agree that it can be done in one pass. I tested your code with
data t1 ; input Policy $ Tel_no $ Address$; cards; P1 T1 A1 P2 T1 A2 P3 T2 A2 P4 T2 A3 P5 T3 A3 P6 T4 A4 P7 T4 A1 run;
Note that the correct ID for P6 is not known until P7 is read. Here is your output showing P6 ad P7 have the same telephone, but not the same ID.
Obs Policy Tel_no Address id 1 P1 T1 A1 1 2 P2 T1 A2 1 3 P3 T2 A2 2 4 P4 T2 A3 2 5 P5 T3 A3 3 6 P6 T4 A4 4 7 P7 T4 A1 1
Your code is easily fixed by updating the hash tables as you get new information with the minimum possible ID, but then a second pass in the same step is needed to write the records.
Now for a practical test of whether hashing yields the best answer, consider that no non-hashing solution presented in this discussion works, and it cannot be fixed without completely scrapping the plan. On the other hand, any hashing solution presented either works or can be easily fixed. That seems to be a better test of technique than the somewhat arbitrary - cleverness criterion.
As for SQL solutions, I do not believe that one can do it without much looping. I did give an SQL/macro solution to a similar problem about ten years ago on SAS-L and in a Coder's corner paper for SESUG.
I feel sufficiently secure in the difficulty of the problem to offer the challenge to anyone interested to find any solution that does not involve hashing. Of course, one could use a linear lookup in an array instead of a hash lookup and get pretty much the same kind of code as the hash lookup at a much slower speed and more complexity because the lookup is via a loop in the consumer's code instead of hidden magic in the hash FIND function.
Ian Whitlock ======================= Date: Sun, 16 Jan 2005 20:19:21 -0500 Reply-To: sashole@bellsouth.net Sender: "SAS(r) Discussion" From: "Paul M. Dorfman" Organization: Sashole of Florida Subject: Re: Help on SAS Programming Comments: To: Alan Churchill In-Reply-To: Content-Type: text/plain; charset="us-ascii" Alan, New times not only deliver inexpensive disk space and faster controllers, but also new SAS features and vast cheap memory. Because of the latter two, this "problem" can be addressed in a single pass without any need to update the hardware, unless the memory on the machine in question is uncapable of swallowing the phones and addresses from the whole file: data id (drop = _:) ; if _n_ = 1 then do ; if 0 then set t1 ; dcl hash a(); a.definekey('tel_no' ); a.definedata('id'); a.definedone(); dcl hash t(); t.definekey('address'); t.definedata('id'); t.definedone(); end ; set t1 ; if a.find() = 0 or t.find() = 0 then return ; _new_id ++ 1 ; id = _new_id ; a.replace() ; t.replace() ; run ; Before V9, code based on this idea would probably fail your anti-cleverness criterion (with the validity of which I, sinful as I am, agree), but the above ought to pass it, for it does not get any simpler than this: 1. read the next record 2. look at the phone table: if the phone is there, dump ID from the table into the ID host variable [.find() does it automatically if the search is successful] 3. repeat #2 for the address table 4. otherwise if neither table already contains its key from the current record, increment ID and stick it into both tables 5. output the record 6. go to #1 Methinks (in fact, I assert) that for this class of problems, using a dynamic data dictionary (implemented above in the form of built-in hash/AVL combo) in a single pass is actually much simpler both conceptually and implementation- wise than any multi-pass approach based on comparing consecutive keys. Here it is exacerbated by the necessity of checking each record against both phone and address, so there will be at least two sorts involved, and even then the identifying algorithm is immediately transparent. I confess that even without the availability of the canned hash object, I would hand-code hash, rather than waste time thinking of such a scheme, especially if the comparison criteria involved more than two keys. Kind regards, ---------------- Paul M. Dorfman Jacksonville, FL ---------------- > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On > Behalf Of Alan Churchill > Sent: Sunday, January 16, 2005 3:27 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: Help on SAS Programming > > Toby, > > I agree with you on this issue and would also say that this > should be viewed as a multipass type of problem. I think too > often cleverness makes for no maintainability and sacrifices > time on development and maintenance. Since a development > machine is so inexpensive nowadays and I can get a TB of fast > disk space for under $1,000 with a lightening quick RAID > controller, not sure what a single pass buys. It did in the > day but time is worth a lot for a person versus the > inexpensive machine parts. > > Another option would be to look at data cleansing software > which have highly optimized algorithms for doing this sort of > stuff. SAS sells DataFlux as an example. Not sure of the > other tools out there so I would google data cleansing. > > Thanks, > Alan > Savian > "Bridging SAS and Microsoft Technologies" > ? > > nntp://news.qwest.net/comp.soft-sys.sas/ > Zhu, et. all...... > > Using a retain statement or lag function for this given > problem is just bad design. If one could say the data is > perfect or even well behaved all the time then using a lag > function would be okay. But since data is rarely if ever well > behaved or perfect we cannot assume it will be. To that > extent any solution which does not some how encompass looking > at every value of telephone number or address to determine > which family it should go too will result in code that will > work some times and wont at other times. > > Consider the case where the the seventh observation is > > P7 T1 A7 > > Your code would not correctly put it in family F1. > > Here the lag function breaks, however a retain statement > possibly could be used but would require considerable more > code and system resources. > > A hash function would probrably be the bet and possibly a SQL > solution would do the trick. > > > Toby Dunn > > > > > From: Jianping Zhu > Reply-To: Jianping Zhu > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: Help on SAS Programming > Date: Sun, 16 Jan 2005 14:35:35 -0500 > data t1; > input Policy $ Tel_no $ Address$; > cards; > P1 T1 A1 > P2 T1 A2 > P3 T2 A2 > P4 T2 A3 > P5 T3 A3 > P6 T4 A4 > run; > data t2; > set t1; > retain family 0; > if (lag(Tel_no) ne Tel_no) and (lag(Address) ne Address) then > family=family+1; run; > > > ---- Original message ---- > >Date: Sat, 15 Jan 2005 09:23:10 -0800 > >From: hee_haa@YAHOO.COM > >Subject: Help on SAS Programming > >To: SAS-L@LISTSERV.UGA.EDU > > > >Hi All, > > > >I have a little problem in assigning a family ID for a > group of data. > >According to my database, I would like to assign a unique > family_ID for > >data with either same tel_no or same address. > > > >For example, data as follow: > >Policy Tel_no Address > >P1 T1 A1 > >P2 T1 A2 > >P3 T2 A2 > >P4 T2 A3 > >P5 T3 A3 > >P6 T4 A4 > > > >I would like to identify P1 to P5 are actually from the > same family. So > >the output as follow: > > > >Policy Tel_no Address Family_ID > >P1 T1 A1 F1 > >P2 T1 A2 F1 > >P3 T2 A2 F1 > >P4 T2 A3 F1 > >P5 T3 A3 F1 > >P6 T4 A4 F2 > >Would anyone kindly help me on this ...Thanks a lot!!! > > > >holly > > [comp.soft-sys.sas] >
--------------------------------------------------------------------------- ----- Back to: Top of message | Previous page | Main SAS-L page --------------------------------------------------------------------------- ----- listserv.uga.edu Enterprise Information Technology Services The University of Georgia listhelp@uga.edu
|
|