*  In response to posting                                         .
*  Date:         Tue, 20 Jul 2004 19:47:15 -0700                  .
*  Reply-To:     sw wang <swwangjj@hotmail.com>                   .
*  Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU> .
*  From:         sw wang <swwangjj@hotmail.com>                   .
*  Subject:      data manipulation question                       .

*  "There are two variables in my data file, one is GROUP and the .
*  other is ID. The value of ID is supposed to be unique in my    .
*  final result, which means you cannot find same value of ID in  .
*  two GROUPs. If there is any value in two or more groups, we    .
*  will merge them together and create a new group number. This   .
*  process will keep doing until there is no same ID in different .
*  groups."                                                       .

*  Or, restated by WRR (I hope correctly):                        .
*  "If the same ID occurs in two groups, the two are to be        .
*  combined to form a single group. (The duplicate IDs may then   .
*  be eliminated from the combined groups, but that's a detail.)" .


*  Data input section. Probably do not post to the list.          .
*  ----------------------------------------------------           .
FILE HANDLE TEST_INP
     /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-TEST_INP.sav'.
FILE HANDLE DESIRED
     /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-DESIRED.sav'.

PRESERVE.
SET MXWARNS=0.
DATA LIST LIST SKIP=1 /GROUP (F2) ID (A1) REMARKS(A35).
BEGIN DATA.
     GROUP  ID     REMARKS
       1    A      'From sw wang'
       1    B
       2    A
       2    N
       3    D
       3    E
       4    E
       4    M
       5    F
       5    G
       6    C
       6    N
      10    P      'Added by W.R.Ristow'
      10    Q
      10    R
      10    S
      12    Q      'Direct to 10'
      12    R      'Direct to 10'
      14    S      'Direct to 10'
      14    T
      16    T      'To 14, then 10'
      13    U
      13    V
      15    W
      17    V      'Direct to 13'
      19    W      'Direct to 13'
      20    Z
      21    Z
      21    Y
      22    Y
      22    X
      23    X
      23    A      'All the way back to 1'
      40    T      'Many direct pairings'
      41    T
      42    T
      43    T
      44    T
      45    T
      50    N      'More many direct pairings'
      51    N
      52    N
      53    N
      54    N
      55    N
      60    G      'Long iterative chain'
      60    H
      61    H
      61    I
      62    I
      62    J
      63    J
      63    K
      64    K
      65    M
END DATA.
RESTORE.
/*--    LIST                                                 /*-*/.

SAVE OUTFILE=TEST_INP.

DATA LIST LIST SKIP=1 /GROUP (F2) ID (A1).
BEGIN DATA.
      Group id
       1     A
       1     B
       1     C
       1     N
       2     D
       2     E
       2     M
       3     F
       3     G
END DATA.
SAVE OUTFILE=DESIRED.

*  Test contents of "DESIRED"                                     .
GET FILE=DESIRED.
/*--    LIST                                                 /*-*/.

*  File handles for output and scratch files                      .
FILE HANDLE SORT_INP
   /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-SORT_INP.sav'.
FILE HANDLE DRCT_PR
   /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-DRCT_PR.sav'.
FILE HANDLE COMB_PR
   /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-PR_1REC.sav'.
FILE HANDLE ALL_PR
   /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-ALL_PR.sav'.
FILE HANDLE IMPL_PR
   /NAME='c:\TempStor\SPSS\2004-07-20 wang-Combine groups-IMPL_PR.sav'.


*  Begin code and output for posting                              .
*  ----------------------------------------------------           .
*  File handles:                                                  .
*  Files containing input or other data                           .
*     TEST_INP  Test data, from the posting                       .
*     DESIRED   Desired output, from the posting                  .
*  Scratch and output files                                       .
*     SORT_INP  Test data, sorted by GROUP and ID.                .
*     DRCT_PR   "Direct" pairs, i.e. pairs of groups that share   .
*               at least one member                               .
*     PR_1REC   KLUDGE: for each existing group, all HIGHER-      .
*               NUMBERED groups with which it will be combined,   .
*               in a single SPSS rcord.                           .
*     ALL_PR    Work dataset, adding pairs one pass at a time,    .
*               eventually holding all pairs of groups that will  .
*               be merged into a single group                     .
*     IMPL_PR   Pairs implied by the current set of known pairs   .
*               plus the set of direct pairs, under the rule      .
*               if A pairs with B, and B pairs with C, then       .
*               A pairs with C.                                   .

*  I.   Load and print the test data                              .
GET FILE=TEST_INP.
/**/    LIST                                                 /*-*/.

*  II.  Identify "direct pairs", i.e. groups that share a member  .

*   A.  Group data by member, rather than group. Within each      .
*       member, sort by ascending group number, to list the       .
*       the lower-numbered group first, in each pair sharing a    .
*       common member.                                            .

SORT CASES BY ID GROUP.
MATCH FILES /FILE=*/BY ID/FIRST=ID_FRST.

*   B.  The "Low" member of each pair sharing a common ID is the  .
*       "High" member of the preceding pair. The first member of  .
*       the set for a single member does not produce a pair.      .
*       CHANGE LOGIC 02 Aug 2004: pair each member with the first,.
*       lowest-numbered, group in a set sharing a member, rather  .
*       than with the preceding member of the set. Either is      .
*       correct and should work, but the change makes the records .
*       in PR1_REC, III.A.1 below, "wider", i.e. using more       .
*       elements of vector HI_GRP. In return, it takes fewer      .
*       iterations for transitive closure (III.B & C)             .

NUMERIC LO_GROUP HI_GROUP (F2).

*  Old code: Pair each group with its immediate predecessor       .
*--DO IF   (ID_FRST).
*--.  COMPUTE LO_GROUP = $SYSMIS.
*--.  COMPUTE HI_GROUP = GROUP.
*--ELSE.
*--.  COMPUTE LO_GROUP = LAG(HI_GROUP).
*--.  COMPUTE HI_GROUP = GROUP.
*--END IF.

*  New code: Pair each group with its earliest  predecessor       .
*  (02 Aug 2004. Hmmmm. Much simpler code, too.)                  .
.  LEAVE LO_GROUP.
.  IF      (ID_FRST) LO_GROUP = GROUP.
.  COMPUTE           HI_GROUP = GROUP.

/**/    LIST ID ID_FRST LO_GROUP HI_GROUP GROUP ID           /*-*/.

*   C.  Drop non-pairs, i.e. the first record (which may be the   .
*       only record) for each ID.                                 .

SELECT  IF (ID_FRST = 0).

*   D.  Put back in order by group numbers (important!).          .

SORT CASES BY LO_GROUP HI_GROUP ID.
/**/    LIST LO_GROUP HI_GROUP ID                            /*-*/.

*   E.  Two groups may share several IDs, and so be recorded as   .
*       several pairs. Keep only one instance of each pair.       .

MATCH FILES /FILE=*/BY LO_GROUP HI_GROUP/FIRST=NEW_PAIR.
SELECT   IF (NEW_PAIR = 1).
/**/    LIST LO_GROUP HI_GROUP                               /*-*/.

*   F.  Save the file of "direct pairs", keeping only the         .
*       variables characterizing the pairs.                       .

SAVE OUTFILE=DRCT_PR /KEEP = LO_GROUP HI_GROUP.


*  III. "Transitive closure": if groups i and j are combined, and .
*       j and k are combined, i and k are combined even if they   .
*       have no common member.                                    .
*       THIS IS THE MOST COMPLICATED AND DIFFICULT PART.          .

*       IMPLEMENTATION NOTES:                                     .
*       B.: Using From the set of all pairs found to date, add    .
*       the pairs implied by those and the "direct pairs". This   .
*       is a many-to-many join, which may be expressed naturally  .
*       in SQL, something like,                                   .
*       CREATE TABLE NEW_PAIR                                     .
*       AS SELECT                    /* The new pair consists of*/.
*         prv.LO_GROUP as LO_GROUP,  /*   left  pair's LO group */.
*         new.HI_GROUP as HI_GROUP,  /* & right pair's HI group */.
*        FROM
*         ALL_PR   as prv,           /* The last step's output  */.
*         DRCT_PRJ as new            /* The "direct pairs"      */.
*        WHERE
*           prv.HI_GROUP             /* Left pair's HI group    */.
*          =new.LO_GROUP             /* same as right pair's LO */.
*        ORDER BY                    /* pair dataset order      */.
*           LO_GROUP, HI_GROUP       /* for the later merge     */.
*                                                                 .
*       SPSS doesn't have SQL or any other "many-to-many" merge   .
*       facility. To simulate it, use a kludge: file PR_1REC, in  .
*       which all "direct pairs" with the same LO_GROUP are a     .
*       *single* record, containing the HI_GROUP values of all    .
*       these pairs. (See A.1, which creates PR_1REC.)            .
*       WARNING: This means setting a limit on the number of      .
*       other groups a group can be "directly paired" with.       .
*       That limit can be changed; I'm using 50 to illustrate.    .
*                                                                 .
*       C. Repeat the above until no new pairs are found. This is .
*       best done with some sort of test for "number of pairs did .
*       not change". No such test is available in SPSS syntax or  .
*       macros (it is in SAS macros). It probably could be done   .
*       with an SPSS script to drive the syntax and make the test..
*       I'm working around it by simply doing 10 iterations.      .
*       (The limit's in a macro call, and can readily be changed.).
*       WARNING: This will take some unnecessary time in most     .
*       cases, and may fail to find all pairs in others. There's  .
*       a "SHOW N" at the end of each step. If it's the same for  .
*       the last two steps, all pairs have been found.            .

*  A.   Set up files to prepare for iteration                     .

*  1.   Preparation: Create a file in which all pairs (up to 50)  .
*       for each LO_GROUP are in a single record                  .

MATCH FILES
       /FILE=DRCT_PR
       /BY LO_GROUP
       /FIRST=NEW_LO
       /LAST =LO_END.

NUMERIC    NUMPAIRS (F3).
VAR LABEL  NUMPAIRS
  'No. of higher-numbered groups LO_GROUP is directly paired with'.
LEAVE      NUMPAIRS.

VECTOR HI_GRP(50,F2).
LEAVE  HI_GRP1 TO HI_GRP50.

DO IF    NEW_LO.
.  COMPUTE NUMPAIRS = 0.
.  LOOP #I = 1 TO 50.
.     COMPUTE HI_GRP(#I) = $SYSMIS.
.  END LOOP.
END IF.

COMPUTE  NUMPAIRS = NUMPAIRS + 1.
IF      (NUMPAIRS LE 50) HI_GRP(NUMPAIRS) = HI_GROUP.

SELECT IF (LO_END = 1).

SAVE OUTFILE=PR_1REC/DROP=NEW_LO LO_END HI_GROUP.

/**/    LIST LO_GROUP NUMPAIRS HI_GRP1 TO HI_GRP6            /*-*/.

*  2.   Preparation:  Initialize -- the "pairs known so far" are  .
*       the "direct pairs". Make this set the working file AND    .
*       save it to disk; both copies are needed.                  .

GET     FILE=DRCT_PR.
SAVE OUTFILE=ALL_PR.
SHOW N.

*  B.   Single transitive-closure step:                           .
*       NOTE THAT THIS IS WRAPPED IN MACRO "<!>TC_Step"           .

DEFINE !TC_Step()

*  1.   First portion of step: Find all pairs implied by the      .
*       current known pairs and the "direct pairs".               .
*       (See IMPLEMENTATION NOTES, above.)                        .

SORT CASES BY HI_GROUP LO_GROUP.
MATCH FILES
   /FILE=*       /RENAME=(HI_GROUP=BRIDGE)
   /TABLE=PR_1REC/RENAME=(LO_GROUP=BRIDGE)
   /BY BRIDGE.

VECTOR HI_GRP = HI_GRP1 TO HI_GRP50.

*  If there's no match with the direct pairs, there are no        .
*  output records to write.                                       .
*  And you're limited to 50 output records, no matter how many    .
*  groups did match the BRIDGE group.                             .

RECODE NUMPAIRS (MISSING = 0) (50 THRU HI = 50).

*  NOTE: Output file "IMPL_PR" will not, in general, be sorted.  .

NUMERIC HI_GROUP (F3).
LOOP #OUT_REC = 1 TO NUMPAIRS.
.  COMPUTE HI_GROUP = HI_GRP(#OUT_REC).
.  XSAVE OUTFILE = IMPL_PR/KEEP=LO_GROUP HI_GROUP.
END LOOP.
EXECUTE.

*  2.   Second portion of step: Combine the newly-found implied   .
*       pairs with the old list of known pairs. Drop duplicates,  .
*       save, and report the new count of known pairs.            .

GET     FILE=IMPL_PR.
SORT CASES BY LO_GROUP HI_GROUP.

ADD FILES
   /FILE=ALL_PR
   /FILE=*
   /BY LO_GROUP HI_GROUP
   /FIRST=KEEP_IT.

SELECT IF (KEEP_IT).

*  Get rid of KEEP_IT; the name must be free for the next         .
*  iteration.                                                     .

ADD FILES
   /FILE=*
   /KEEP = LO_GROUP HI_GROUP.

SAVE OUTFILE=ALL_PR.
SHOW N.

!ENDDEFINE.
DISPLAY MACROS.
*.
*  End of macro <!>TC_Step, which encompasses all the working     .
*  code of step III.B.                                            .

*  C.   Repeat the steps to get the complete transitive closure.  .
*       Since no test for "completion" is available, arbitrarily  .
*       repeat 10 times. (Change that limit in the macro call as  .
*       desired.) If "SHOW N" is the same on any two successive   .
*       iteractions, all pairs have been found.                   .
*       (See IMPLEMENTATION NOTES, above.)                        .

DEFINE !MacRept ( Repeat =!NOEXPAND           !TOKENS(1)
                 /Times  =!DEFAULT(1)         !TOKENS(1))
!DO !COUNT = 1!TO !Times
*  Repetition !COUNT .
!Repeat.
!DOEND
!ENDDEFINE.

PRESERVE.
/*--    SET MPRINT ON                                        /*-*/.
!MacRept Repeat=!TC_Step Times=10
RESTORE.

/**/    LIST                                                 /*-*/.

*  D.   Pair each group only with the LOWEST group it is          .
*       paired with.                                              .

AGGREGATE OUTFILE=*
   /BREAK=HI_GROUP
   /LO_GROUP = MIN(LO_GROUP).
SORT CASES BY LO_GROUP HI_GROUP.
SAVE OUTFILE=ALL_PR/KEEP=LO_GROUP HI_GROUP.
/**/    LIST LO_GROUP HI_GROUP                               /*-*/.


*  IV.  From the complete list of pairings, complete the job.     .

*  A.   Assign group members to their new groups. Re-number the   .
*       combined groups, starting from 1.                         .

*  1.   Sort input by group and ID -- it was not necessarily that .
*       way in the beginning.                                     .

GET     FILE=TEST_INP.  /* Remember this one from the beginning? */
SORT CASES BY GROUP ID.
SAVE OUTFILE=SORT_INP.

*  2.   The file of pairs, re-ordered by higher-numbered group    .
*       (to be re-assigned), rather than lower-numbered (which    .
*       are the reassignment destingations).                      .

GET     FILE=ALL_PR
      /RENAME=(LO_GROUP HI_GROUP
              =NEW_GRP  OLD_GRP)
      /KEEP   =OLD_GRP  NEW_GRP.
*  This variable isn't useful now, but it will be in the next    .
*  step.                                                         .
NUMERIC GROUP(F2).
COMPUTE GROUP=0.
SORT CASES BY OLD_GRP.

*  3.   Assign members to their new groups, based on the pairings,.
*       and compute the new numbers.                              .

MATCH FILES
   /FILE=SORT_INP
      /RENAME=(GROUP=OLD_GRP)
   /TABLE=*
   /BY OLD_GRP
   /KEEP = GROUP ID NEW_GRP OLD_GRP.

*  Unpaired groups keep their old numbers                        .
IF (MISSING(NEW_GRP)) NEW_GRP = OLD_GRP.
*  Put members into the new order                                .
SORT CASES BY NEW_GRP ID OLD_GRP.
*  Assign the new group numbers                                  .
DO IF   ($CASENUM = 1).
.  COMPUTE GROUP = 1.
ELSE IF (NEW_GRP > LAG(NEW_GRP)).
.  COMPUTE GROUP = LAG(GROUP) + 1.
ELSE.
.  COMPUTE GROUP = LAG(GROUP).
END IF.

/**/    LIST                                                /*-*/.

*  B.   Simplify to just new group number, and member. Compare    .
*       with the orginal "desired" output.                        .

*  1.   Simplify: Keep only one copy of each member. Drop         .
*       NEW_GRP and OLD_GRP, which are 'history' variables --     .
*       indicate where members came from.                         .
*       (This does leave variable "FRST_ID" in the file)          .

ADD FILES /FILE=*/BY GROUP ID
    /KEEP=GROUP ID
    /FIRST=FRST_ID.
SELECT IF (FRST_ID = 1).

*  2.   Combine and compare                                       .

MATCH FILES
  /FILE = *       /IN=AS_IS
  /FILE = DESIRED /IN=WANTED
  /BY GROUP ID.

STRING QUALITY (A8).
DO IF   (AS_IS & WANTED).
.  COMPUTE QUALITY = 'YES'.
ELSE IF (WANTED).
.  COMPUTE QUALITY = 'PROBLEM'.
ELSE.
.  COMPUTE QUALITY = 'New'.
END IF.

/**/    LIST GROUP ID QUALITY                               /*-*/.