Computing the Match table
Jim Gray, Alex Szalay, Robert Lupton, Jeff Munn, Ani Thakar
Aug 20, 2003
The SDSS data can be used for temporal studies of objects that are re-observed at different times. The SDSS survey observes about 10% of the Northern survey area 2 or more times, and observes the Southern stripe more than a dozen times.
The match table is intended to make temporal queries easy by providing a precomputed list of all objects that were observed multiple times. More formally,
Match = { (ObjID1,ObjID2) | Objid1 and ObjID2 are both from different runs (==observations)
And they are within 1 arcsecond of one another
And are both good (star or galaxy or unknown)
And are both fully deblended (no children)
And they are primary or secondary (not
family or outside)
The following count from the DR1 dataset says gives.
Mode | Total | nChild=0 |
primary | 52,525,576 | 52,525,576 |
secondary | 14,596,931 | 14,596,931 |
family | 17,074,000 | 6,153,714 |
outside | 126,819 | 126,819 |
And here are the flag counts for DR1
Dr1 Count | Flag | Description |
72,926,906 |
SET |
Object's status has been set in reference to its own run |
72,926,906 |
GOOD |
Object is good as determined by its object flags. Absence implies bad. |
10,186,591 |
DUPLICATE |
Object has one or more duplicate detections in an adjacent field of the same Frames Pipeline Run. |
67,029,849 |
OK_RUN |
Object is usable, it is located within the primary range of rows for this field. |
66,894,914 |
RESOLVED |
Object has been resolved against other runs. |
66,839,376 |
PSEGMENT |
Object Belongs to a PRIMARY segment. This does not imply that this is a primary object. |
387,964 |
FIRST_FIELD |
Object belongs to the first field in its segment. Used to distinguish objects in fields shared by two segments. |
62,728,244 |
OK_SCANLINE |
Object lies within valid nu range for its scanline. |
53,60,3453 |
OK_STRIPE |
Object lies within valid eta range for its stripe. |
Computing the Match table
The Match table is computed by using the Neighbors table and has a very similar schema (the Neighbors table only stores mode (1,2) (aka primary/secondary) and type (3,5,6) (aka galaxy, unknown, star) objects;
Create table Match (objID bigint not null,
matchObjID bigint not null,
distance float not null,
type tinyint not null,
matchType tinyint not null,
Mode tinyint not null,
matchMode tinyint not null,
primary key (objID, matchObjID)
) ON [Neighbors]
-- now populate the table
insert Match
select N.*
from (Neighbors N join PhotoObj P1 on N.objID = P1.objID)
join PhotoObj P2 on N.NeighborObjID = P2.objID
where ((N.objID ^ N.neighborObjID) & 0x0000FFFF00000000) != 0 -- dif runs
and distance < 1.0/60.0 -- within 1 arcsecond of one another
One arcsecond is a large error in Sloan Positioning - the vast majority
(95%) are within 0.5 arcsecond. But a particular cluster may not form a
complete graph (all members connected to all others). To make the graph fully
transitive, we repeatedly execute the query to add the "curved" arcs in the figure below.
-- compute triples
create table ##Trip(objid bigint, matchObjID bigint, distance float,
type tinyint, neighborType tinyint,
mode tinyInt, matchMode tinyInt,
primary key (objID, matchObjID))
again: truncate table ##trip
-- compute triples
insert ##trip
select distinct a.objID, b.matchObjID, 0,
a.type, b.matchType, a.mode, b.matchMode
from Match a join Match b on a.matchObjID = b.objID
where a.objID != b.matchObjID
and (a.objid & 0x0000FFFF00000000)!=
(b.matchObjID& 0x0000FFFF00000000) -- Different runs
-- now delete the pairs we already have in Match
delete ##trip
where 0 != (
select count(*)
from Match p
where p.objID = ##trip.objID and p.matchObjID = ##trip.matchObjID
)
-- compute the distance between the remaining tripples
select 'adding ' + cast(count(*) as varchar(20)) + ' tripples.'
update ##trip
set distance =
(select min(N.distance)
from ##trip t join Neighbors N
on t.objID = N.objID and t.matchObjID = N.NeighborObjID)
-- now add these into Match and repeat till no more rows.
insert Match select * from ##trip
if @@rowcount > 0 goto again
drop table ##trip
Computing the MatchHead table
Now each cluster of objects in the Match table is fully connected. We can name the clusters in the Match table by the minimum objID in the cluster. We can compute the MatchHead table that describes the global properties of the cluster: its name, its average RA and DEC and the variance in RA, DEC.
-- build a table of cluster IDs (minimum object ID of each cluster).
Create table MatchHead (
objID bigint not null primary key,
averageRa float not null default 0,
averageDec float not null default 0,
varRa float not null default 0, -- variance in RA
varDec float not null default 0, -- variance in DEC
matchCount tinyInt not null default 0, -- number in cluster
missCount tinyInt not null default 0 -- runs missing from cluster
) ON [Neighbors]
-- compute the minimum object IDs.
Create table ##MinID (objID bigint primary key)
Insert ##MinID
select distinct objID
from Match MinId
where 0 = ( select count(*)
from Match m
where MinId.objID = m.objID
and MinId.objID > m.matchObjID)
-- compute all pairs of objIDs in a cluster (including x,x for the headID)
create table ##pairs (objID bigint not null,
matchObjID bigint not null
primary key(objID, matchObjID))
insert ##pairs
select h.objID, m.matchObjID
from ##MinID h join Match m on h.objID = m.objID
insert ##pairs select objID, objID from ##MinID
-- now populate the MatchHead table with minObjID and statistics
Insert MatchHead
Select MinID.objID, avg(ra), avg(dec),
coalesce(stdev(ra),0), coalesce(stdev(dec),0),
count(m.objid & 0x0000FFFF00000000), -- count runs
0 -- count misses later
from ##MinID as MinID,
##pairs as m,
PhotoObj as o
where MinID.objID = m.objID
and m.matchObjID = o.objID
group by MinID.objID
order by MinID.objID
-- cleanup
Drop table ##MinID
Drop table ##pairs
The number missing from the cluster is computed in the next section.
Computing the MatchMiss table
It is also of interest to have a list of objects that are in areas that were observed multiple times but that were only observed once. To do this we need:
a description of each multiple-observation region.
A count of how many times it was observed.
An efficient way to test if a point is in a region
Alex will provide 1 and 2, jim will provide 3 (right?).
We will create a table of "dropouts", places where a match cluster
should have an object but does not.
Create table MatchMiss (objID bigint not null, --- the unique ID of the cluster
Run int not null, -- the run that is missing a member of this cluster.
Primary key (objID, Run)
)
Logic:
From Match find all pairs of runs that overlap
Form the domain that is the union of the intersection of these pairs.
Now build T, a list of all objects primary/secondary type (3,5, 6) objects that are in this domain.
Subtract from T all objects that appear in Match
Add these objects and the missing run number(s) to MatchMiss
For each object in MatchHead, count the number of overlaps it is a member of. (MatchHead, runs)
If this is equals the number of runs the match list then
Performance
Building Match and MatcHead takes about an hour on SdssDr1 with the Best database of 85M objects. The cardinalities of each step are:
Match |
12,294,016 |
add from triples |
19,040 |
add from triples |
322 |
add from triples |
16 |
add from triples |
2 |
add from triples |
0 |
MinID |
5,545,446 |
Mirror Pairs |
5,849,459 |
Paris from match |
5,545,446 |
MatchHead |
5,545,446 |
|