Algorithm Descriptions

#### Jim Gray, Alex Szalay, Robert Lupton, Jeff Munn

##### May 2003, revised January, May, June, July, December 2004

The SDSS data can be used for temporal studies of objects that are re-observed at different times. The SDSS survey observes about 10% of the Northern survey area 2 or more times, and observes the Southern stripe more than a dozen times.

The match table is intended to make temporal queries easy by providing a precomputed list of all objects that were observed multiple times. More formally,

 Match = { (ObjID1,ObjID2) | Objid1 and ObjID2 are both from different runs (i.e. different observations). And they are within 1 arcsecond of one another And are both good (star or galaxy or unknown) And are both fully deblended (no children) And they are primary or secondary (not family or outside) }

But, as always, there are complications.

Green, Yellow, Red: What if ObjID2 in Run2 is missing?
It could be missing because it was not seen or because it was masked. What about the "edge cases" at the edge of run1 or of run2. Perhaps it is an edge object. We color-code these edge cases as "yellow" objects and the masked objects as "red" objects. And of course, we flag the missing but matched objects as "green".

Surrogate: When an object is missing in Run2, what do we put in the match table?
We could fabricate ObjID2, or we could find the closest ObjID2 in Run2 and just flag it and record the distance to it (which will be more than 1 arcsecond).

### Computing the Match table

The Match table is computed by using the Neighbors table and has a very similar schema (the Neighbors table only stores mode (1,2) (aka primary/secondary) and type (3,5,6) (aka galaxy, unknown, star) objects.

```CREATE TABLE Match (
objID1 bigint not null, objID2 bigint not null, -- object pair
run1 smallint not null, run2 smallint not null, -- their run numbers
type1 tinyint not null, type2 tinyint not null, -- star, galaxy,...
mode1 tinyint not null, mode2 tinyint not null, -- primary, secondary,...
distance float not null,			       -- in arcminutes
miss char not null,			       -- " " no miss, RGY:red,green,yellow
matchHead bigint not null,		       -- see below.
primary key (objID1, ObjID2)
) ON [Neighbors]
-- now populate the table
INSERT Match
SELECT objID as objID1, neighborObjID as objID2,
(objID & 0x0000FFFF00000000)/power(cast(2 as bigint),32) as run1,
(NeighborObjID & 0x0000FFFF00000000)/power(cast(2 as bigint),32) run2,
type as type1, neighborType as type2,
mode as mode1, neighborMode as mode2,
distance,  ' ' as miss, 0 as matchHead
FROM Neighbors
WHERE distance < 1.0/60.0	-- within 1 arcsecond of one another
```

One arcsecond is a large error in Sloan Positioning - the vast majority is within 0.5 arcsecond (95%). But a particular cluster may not form a complete graph (all members connected to all others). To make the graph fully transitive, we repeatedly execute the query to add the "curved" arcs in Figure 1. Notice that that figure shows two objects observed in four runs, and that the two objects are observed only once in the middle two runs. The whole collection is closed to make a "bundle" that will have a matchHead object (the smallest objID of the bundle).
 Figure 1: Six objects from 4 runs forming a bundle.

```declare @Trip table (
objid1 bigint,ObjID2 bigint,
run1 smallint, run2 smallint,
type1 tinyint, type2 tinyint,
mode1 tinyInt, mode2 tinyInt,
primary key (objID1,ObjID2)
)

WHILE (1=1)
BEGIN
delete @Trip	-- empty the triples table
insert @Trip	-- compute new pairs based on triples
select distinct a.objID1, b.ObjID2,
a.run1, b.run2,
a.type1, b.type2,
a.mode1, b.mode2
from Match a join Match b on a.ObjID2 = b.objID1
where a.objID1 != b.ObjID2   -- different objects
and not exists (
select objID1 -- and pair not already in Match
from match c
where c.objID1 = a.objID1
and c.objID2 = b.objID2)

IF @@rowcount = 0 BREAK	    -- stop if none found.
select "adding " + cast(count(*) as varchar(20)) + " triples."
from @Trip
-- now add these into Match and repeat till no more rows.
insert Match	 -- else add pair to Match.
select T.*,N.distance, 0,0
from @Trip T join Neighbors N
on t.objID1 = N.objID and t.ObjID2 = N.NeighborObjID
END
```

Now each cluster of objects in the Match table is fully connected. We can name the clusters in the Match table by the minimum (non zero) objID in the cluster and can compute the MatchHead table that describes the global properties of the cluster: its name, its average RA and DEC and the variance in RA, DEC.

```-- build a table of cluster IDs (minimum object ID of each cluster).

objID bigint not null primary key,	-- id of match head object
averageRa float not null default 0,	-- average right ascension
averageDec float not null default 0,	-- average declination
varRa float not null default 0,		-- variance in RA
varDec float not null default 0,		-- variance in DEC
matchCount tinyInt not null default 0,	-- real number in cluster
missCounttinyInt not null default 0	-- fake objects in cluster
) ON [Neighbors]
```

 Figure 2: Green area of A must match B, yellow area may match B. Red areas of B (masks) have a good reason for the A object to have a missing object. In these missing cases, we pick the "nearest" object in B to be a surrogate for A in the match table.

```-- compute the minimum object IDs.

declare @MinID table (
objID bigint
primary key
)

insert @MinID
select distinct objID1
from Match MinId
where 0 = (
select count(*)
from Match m
where
MinId.objID1 = m.objID1
and
MinId.objID1 > m.ObjID2
)

-- compute all pairs of objIDs in a
-- cluster (including x,x for headID)

declare @pairs table (
objID1 bigint not null,
objID2 bigint not null,
primary key(objID1,objID2)
)

insert @pairs
select minID.objID, m.ObjID2
from @MinID minID join Match m on
minID.objID = m.objID1

insert @pairs select objID, objID from @MinID

-- now populate the MatchHead table with minObjID and statistics

select MinID.objID,		-- for each MatchHead ID
avg(ra), avg(dec),	-- get avg ra,dec of group
coalesce(stdev(ra),0),
coalesce(stdev(dec),0),	-- get stddev of group
count(distinct		 -- count runs
(p.objid2 & 0x0000FFFF00000000)),
0			-- count misses later
From @MinID as MinID
join @pairs as p on MinID.objID = p.objID1
join PhotoObjAll as o on p.objID2 = o.objID
group by MinID.objID

-- now set MatchHead on each object

UPDATE Match
FROM Match M join @Pairs P on M.objID1 = P.ObjID2
```
 Figure 3: A green surrogate for the right 4 objects and a red (masked) surrogate for the leftmost one.
The number missing from the cluster is computed in the next section.

### Matching the Missing Objects

There may be an object in camcol A that should have matching objects in an overlapping camcol B (see figure 2). In particular, any object in the green part of A should have a matching object in B (in Figure 2). Objects in A that are near the edge of B (10 pixels ~4 arc seconds = the yellow part of B) may have matching objects in B.In some cases the B area is masked (red) and that explains why there is not a match.

If a "green" A object does not match a B object then either the object is moving or variable or masked. We can check the masks to see if the (A.ra, A.dec) is masked in B. If not, we assume that A is just "missing."

Similarly, if a "yellow" A object does not match a B object, then either the object is moving or variable or masked or the edge effects caused the object to be missing.In these edge cases we check to see if (A.ra, A.dec) is masked in B, if not we call the object missing-edge.

So, missing objects come in 3 varieties:
 Object Color Flag Hit 0 Missing Green 1 Missing Edge Yellow 2 Masked Red 3

In each of these cases we create a match object as the closest object in B to A and Match.flag is set to Green, Red or Yellow.These "fake" objects do not contribute to the cluster average or variance or centroid.

We add this object to A's cluster (along with all the edges), and we increment the cluster miss count by the number of records we add to the cluster.
 Figure 4: Graph showing distance vs frequency of missies of various colors. This is data from a small sample of the BestDR1 database.

The logic for computing missing objects is as follows.

For each RunA in the Regions table.
For each RunB overlapping RunA other than RunA
Let R_ABYellow be the Region RunAÇ((runB+ ε))
Find (x,y) where x and y in R_ABYellow and run.x = RunA and run.y = RunB
And x not in match and y the closest object in RunB.
If the position of x is in runB - ε we mark the pair as green.
If the position of x is masked in RunB then we mark the pair as red.
We now add the (x,y) pairs to the match table and
If x is not in match, we add x to matchhead with a miss count of 1
If x is in match, we propagate x.matchHead to this match entry
and increment the matchHead miss count

The actual code is a little more complex (about 700 lines of SQL). In the personal SkyServerDR1 there are about 20,000 matches and 10,000 object misses, so it seems that the misses will make an interesting study.
Figure 5: Statistics from full DR1 dataset showing distribution of miss distances. Red objects are not being computed. On BestDR1, the statistics are:
 Match 12,431,518 Green 5,446,930 Yellow 912,571 Red ??

The results of this are that a bundle can have dangling pointers to these surrogate objects.Figure 3 shows the diagram of Figure 1 where a fifth overlapping run has been added. The leftmost object is masked in this new run and so we find a surrogate "red" object for it. The other objects are also have no match in this run but are not masked and are closest to the green (right) object in the figure.

It takes 4 minutes to compute on the personal SkyServer DR1, It will take a bit longer on the thousand times larger Dr2, but ...

As per Robert's request, surrogate match objects are found rather than invented. Sometimes we have to look far away for them (500 arcseconds in some cases).

Misses are painted Yellow (near the edge), Red (masked), and Green (well inside the overlap). Most misses are Green.
 color count Y 3,516 R 58 G 15,264

The graphs of distances are shown in Figure 4.