* In response to posting . * Date: Tue, 20 Jul 2004 19:47:15 -0700 . * Reply-To: sw wang . * Sender: "SPSSX(r) Discussion" . * From: sw wang . * Subject: data manipulation question . * "There are two variables in my data file, one is GROUP and the . * other is ID. The value of ID is supposed to be unique in my . * final result, which means you cannot find same value of ID in . * two GROUPs. If there is any value in two or more groups, we . * will merge them together and create a new group number. This . * process will keep doing until there is no same ID in different . * groups." . * Or, restated by WRR (I hope correctly): . * "If the same ID occurs in two groups, the two are to be . * combined to form a single group. (The duplicate IDs may then . * be eliminated from the combined groups, but that's a detail.)" . * Data input section. Probably do not post to the list. . * ---------------------------------------------------- . FILE HANDLE TEST_INP /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-TEST_INP.sav'. FILE HANDLE DESIRED /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-DESIRED.sav'. PRESERVE. SET MXWARNS=0. DATA LIST LIST SKIP=1 /GROUP (F2) ID (A1) REMARKS(A35). BEGIN DATA. GROUP ID REMARKS 1 A 'From sw wang' 1 B 2 A 2 N 3 D 3 E 4 E 4 M 5 F 5 G 6 C 6 N 10 P 'Added by W.R.Ristow' 10 Q 10 R 10 S 12 Q 'Direct to 10' 12 R 'Direct to 10' 14 S 'Direct to 10' 14 T 16 T 'To 14, then 10' 13 U 13 V 15 W 17 V 'Direct to 13' 19 W 'Direct to 13' 20 Z 21 Z 21 Y 22 Y 22 X 23 X 23 A 'All the way back to 1' 40 T 'Many direct pairings' 41 T 42 T 43 T 44 T 45 T 50 N 'More many direct pairings' 51 N 52 N 53 N 54 N 55 N 60 G 'Long iterative chain' 60 H 61 H 61 I 62 I 62 J 63 J 63 K 64 K 65 M END DATA. RESTORE. /*-- LIST /*-*/. SAVE OUTFILE=TEST_INP. DATA LIST LIST SKIP=1 /GROUP (F2) ID (A1). BEGIN DATA. Group id 1 A 1 B 1 C 1 N 2 D 2 E 2 M 3 F 3 G END DATA. SAVE OUTFILE=DESIRED. * Test contents of "DESIRED" . GET FILE=DESIRED. /*-- LIST /*-*/. * File handles for output and scratch files . FILE HANDLE SORT_INP /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-SORT_INP.sav'. FILE HANDLE DRCT_PR /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-DRCT_PR.sav'. FILE HANDLE COMB_PR /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-PR_1REC.sav'. FILE HANDLE ALL_PR /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-ALL_PR.sav'. FILE HANDLE IMPL_PR /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-IMPL_PR.sav'. * Begin code and output for posting . * ---------------------------------------------------- . * File handles: . * Files containing input or other data . * TEST_INP Test data, from the posting . * DESIRED Desired output, from the posting . * Scratch and output files . * SORT_INP Test data, sorted by GROUP and ID. . * DRCT_PR "Direct" pairs, i.e. pairs of groups that share . * at least one member . * PR_1REC KLUDGE: for each existing group, all HIGHER- . * NUMBERED groups with which it will be combined, . * in a single SPSS rcord. . * ALL_PR Work dataset, adding pairs one pass at a time, . * eventually holding all pairs of groups that will . * be merged into a single group . * IMPL_PR Pairs implied by the current set of known pairs . * plus the set of direct pairs, under the rule . * if A pairs with B, and B pairs with C, then . * A pairs with C. . * I. Load and print the test data . GET FILE=TEST_INP. /**/ LIST /*-*/. * II. Identify "direct pairs", i.e. groups that share a member . * A. Group data by member, rather than group. Within each . * member, sort by ascending group number, to list the . * the lower-numbered group first, in each pair sharing a . * common member. . SORT CASES BY ID GROUP. MATCH FILES /FILE=*/BY ID/FIRST=ID_FRST. * B. The "Low" member of each pair sharing a common ID is the . * "High" member of the preceding pair. The first member of . * the set for a single member does not produce a pair. . * CHANGE LOGIC 02 Aug 2004: pair each member with the first,. * lowest-numbered, group in a set sharing a member, rather . * than with the preceding member of the set. Either is . * correct and should work, but the change makes the records . * in PR1_REC, III.A.1 below, "wider", i.e. using more . * elements of vector HI_GRP. In return, it takes fewer . * iterations for transitive closure (III.B & C) . NUMERIC LO_GROUP HI_GROUP (F2). * Old code: Pair each group with its immediate predecessor . *--DO IF (ID_FRST). *--. COMPUTE LO_GROUP = $SYSMIS. *--. COMPUTE HI_GROUP = GROUP. *--ELSE. *--. COMPUTE LO_GROUP = LAG(HI_GROUP). *--. COMPUTE HI_GROUP = GROUP. *--END IF. * New code: Pair each group with its earliest predecessor . * (02 Aug 2004. Hmmmm. Much simpler code, too.) . . LEAVE LO_GROUP. . IF (ID_FRST) LO_GROUP = GROUP. . COMPUTE HI_GROUP = GROUP. /**/ LIST ID ID_FRST LO_GROUP HI_GROUP GROUP ID /*-*/. * C. Drop non-pairs, i.e. the first record (which may be the . * only record) for each ID. . SELECT IF (ID_FRST = 0). * D. Put back in order by group numbers (important!). . SORT CASES BY LO_GROUP HI_GROUP ID. /**/ LIST LO_GROUP HI_GROUP ID /*-*/. * E. Two groups may share several IDs, and so be recorded as . * several pairs. Keep only one instance of each pair. . MATCH FILES /FILE=*/BY LO_GROUP HI_GROUP/FIRST=NEW_PAIR. SELECT IF (NEW_PAIR = 1). /**/ LIST LO_GROUP HI_GROUP /*-*/. * F. Save the file of "direct pairs", keeping only the . * variables characterizing the pairs. . SAVE OUTFILE=DRCT_PR /KEEP = LO_GROUP HI_GROUP. * III. "Transitive closure": if groups i and j are combined, and . * j and k are combined, i and k are combined even if they . * have no common member. . * THIS IS THE MOST COMPLICATED AND DIFFICULT PART. . * IMPLEMENTATION NOTES: . * B.: Using From the set of all pairs found to date, add . * the pairs implied by those and the "direct pairs". This . * is a many-to-many join, which may be expressed naturally . * in SQL, something like, . * CREATE TABLE NEW_PAIR . * AS SELECT /* The new pair consists of*/. * prv.LO_GROUP as LO_GROUP, /* left pair's LO group */. * new.HI_GROUP as HI_GROUP, /* & right pair's HI group */. * FROM * ALL_PR as prv, /* The last step's output */. * DRCT_PRJ as new /* The "direct pairs" */. * WHERE * prv.HI_GROUP /* Left pair's HI group */. * =new.LO_GROUP /* same as right pair's LO */. * ORDER BY /* pair dataset order */. * LO_GROUP, HI_GROUP /* for the later merge */. * . * SPSS doesn't have SQL or any other "many-to-many" merge . * facility. To simulate it, use a kludge: file PR_1REC, in . * which all "direct pairs" with the same LO_GROUP are a . * *single* record, containing the HI_GROUP values of all . * these pairs. (See A.1, which creates PR_1REC.) . * WARNING: This means setting a limit on the number of . * other groups a group can be "directly paired" with. . * That limit can be changed; I'm using 50 to illustrate. . * . * C. Repeat the above until no new pairs are found. This is . * best done with some sort of test for "number of pairs did . * not change". No such test is available in SPSS syntax or . * macros (it is in SAS macros). It probably could be done . * with an SPSS script to drive the syntax and make the test.. * I'm working around it by simply doing 10 iterations. . * (The limit's in a macro call, and can readily be changed.). * WARNING: This will take some unnecessary time in most . * cases, and may fail to find all pairs in others. There's . * a "SHOW N" at the end of each step. If it's the same for . * the last two steps, all pairs have been found. . * A. Set up files to prepare for iteration . * 1. Preparation: Create a file in which all pairs (up to 50) . * for each LO_GROUP are in a single record . MATCH FILES /FILE=DRCT_PR /BY LO_GROUP /FIRST=NEW_LO /LAST =LO_END. NUMERIC NUMPAIRS (F3). VAR LABEL NUMPAIRS 'No. of higher-numbered groups LO_GROUP is directly paired with'. LEAVE NUMPAIRS. VECTOR HI_GRP(50,F2). LEAVE HI_GRP1 TO HI_GRP50. DO IF NEW_LO. . COMPUTE NUMPAIRS = 0. . LOOP #I = 1 TO 50. . COMPUTE HI_GRP(#I) = $SYSMIS. . END LOOP. END IF. COMPUTE NUMPAIRS = NUMPAIRS + 1. IF (NUMPAIRS LE 50) HI_GRP(NUMPAIRS) = HI_GROUP. SELECT IF (LO_END = 1). SAVE OUTFILE=PR_1REC/DROP=NEW_LO LO_END HI_GROUP. /**/ LIST LO_GROUP NUMPAIRS HI_GRP1 TO HI_GRP6 /*-*/. * 2. Preparation: Initialize -- the "pairs known so far" are . * the "direct pairs". Make this set the working file AND . * save it to disk; both copies are needed. . GET FILE=DRCT_PR. SAVE OUTFILE=ALL_PR. SHOW N. * B. Single transitive-closure step: . * NOTE THAT THIS IS WRAPPED IN MACRO "TC_Step" . DEFINE !TC_Step() * 1. First portion of step: Find all pairs implied by the . * current known pairs and the "direct pairs". . * (See IMPLEMENTATION NOTES, above.) . SORT CASES BY HI_GROUP LO_GROUP. MATCH FILES /FILE=* /RENAME=(HI_GROUP=BRIDGE) /TABLE=PR_1REC/RENAME=(LO_GROUP=BRIDGE) /BY BRIDGE. VECTOR HI_GRP = HI_GRP1 TO HI_GRP50. * If there's no match with the direct pairs, there are no . * output records to write. . * And you're limited to 50 output records, no matter how many . * groups did match the BRIDGE group. . RECODE NUMPAIRS (MISSING = 0) (50 THRU HI = 50). * NOTE: Output file "IMPL_PR" will not, in general, be sorted. . NUMERIC HI_GROUP (F3). LOOP #OUT_REC = 1 TO NUMPAIRS. . COMPUTE HI_GROUP = HI_GRP(#OUT_REC). . XSAVE OUTFILE = IMPL_PR/KEEP=LO_GROUP HI_GROUP. END LOOP. EXECUTE. * 2. Second portion of step: Combine the newly-found implied . * pairs with the old list of known pairs. Drop duplicates, . * save, and report the new count of known pairs. . GET FILE=IMPL_PR. SORT CASES BY LO_GROUP HI_GROUP. ADD FILES /FILE=ALL_PR /FILE=* /BY LO_GROUP HI_GROUP /FIRST=KEEP_IT. SELECT IF (KEEP_IT). * Get rid of KEEP_IT; the name must be free for the next . * iteration. . ADD FILES /FILE=* /KEEP = LO_GROUP HI_GROUP. SAVE OUTFILE=ALL_PR. SHOW N. !ENDDEFINE. DISPLAY MACROS. *. * End of macro TC_Step, which encompasses all the working . * code of step III.B. . * C. Repeat the steps to get the complete transitive closure. . * Since no test for "completion" is available, arbitrarily . * repeat 10 times. (Change that limit in the macro call as . * desired.) If "SHOW N" is the same on any two successive . * iteractions, all pairs have been found. . * (See IMPLEMENTATION NOTES, above.) . DEFINE !MacRept ( Repeat =!NOEXPAND !TOKENS(1) /Times =!DEFAULT(1) !TOKENS(1)) !DO !COUNT = 1!TO !Times * Repetition !COUNT . !Repeat. !DOEND !ENDDEFINE. PRESERVE. /*-- SET MPRINT ON /*-*/. !MacRept Repeat=!TC_Step Times=10 RESTORE. /**/ LIST /*-*/. * D. Pair each group only with the LOWEST group it is . * paired with. . AGGREGATE OUTFILE=* /BREAK=HI_GROUP /LO_GROUP = MIN(LO_GROUP). SORT CASES BY LO_GROUP HI_GROUP. SAVE OUTFILE=ALL_PR/KEEP=LO_GROUP HI_GROUP. /**/ LIST LO_GROUP HI_GROUP /*-*/. * IV. From the complete list of pairings, complete the job. . * A. Assign group members to their new groups. Re-number the . * combined groups, starting from 1. . * 1. Sort input by group and ID -- it was not necessarily that . * way in the beginning. . GET FILE=TEST_INP. /* Remember this one from the beginning? */ SORT CASES BY GROUP ID. SAVE OUTFILE=SORT_INP. * 2. The file of pairs, re-ordered by higher-numbered group . * (to be re-assigned), rather than lower-numbered (which . * are the reassignment destingations). . GET FILE=ALL_PR /RENAME=(LO_GROUP HI_GROUP =NEW_GRP OLD_GRP) /KEEP =OLD_GRP NEW_GRP. * This variable isn't useful now, but it will be in the next . * step. . NUMERIC GROUP(F2). COMPUTE GROUP=0. SORT CASES BY OLD_GRP. * 3. Assign members to their new groups, based on the pairings,. * and compute the new numbers. . MATCH FILES /FILE=SORT_INP /RENAME=(GROUP=OLD_GRP) /TABLE=* /BY OLD_GRP /KEEP = GROUP ID NEW_GRP OLD_GRP. * Unpaired groups keep their old numbers . IF (MISSING(NEW_GRP)) NEW_GRP = OLD_GRP. * Put members into the new order . SORT CASES BY NEW_GRP ID OLD_GRP. * Assign the new group numbers . DO IF ($CASENUM = 1). . COMPUTE GROUP = 1. ELSE IF (NEW_GRP > LAG(NEW_GRP)). . COMPUTE GROUP = LAG(GROUP) + 1. ELSE. . COMPUTE GROUP = LAG(GROUP). END IF. /**/ LIST /*-*/. * B. Simplify to just new group number, and member. Compare . * with the orginal "desired" output. . * 1. Simplify: Keep only one copy of each member. Drop . * NEW_GRP and OLD_GRP, which are 'history' variables -- . * indicate where members came from. . * (This does leave variable "FRST_ID" in the file) . ADD FILES /FILE=*/BY GROUP ID /KEEP=GROUP ID /FIRST=FRST_ID. SELECT IF (FRST_ID = 1). * 2. Combine and compare . MATCH FILES /FILE = * /IN=AS_IS /FILE = DESIRED /IN=WANTED /BY GROUP ID. STRING QUALITY (A8). DO IF (AS_IS & WANTED). . COMPUTE QUALITY = 'YES'. ELSE IF (WANTED). . COMPUTE QUALITY = 'PROBLEM'. ELSE. . COMPUTE QUALITY = 'New'. END IF. /**/ LIST GROUP ID QUALITY /*-*/.