|
| a | b | a AND b | a OR b |
|---|---|---|---|
| TRUE | TRUE | TRUE | TRUE |
| TRUE | FALSE | FALSE | TRUE |
| TRUE | NULL | NULL | TRUE |
| FALSE | FALSE | FALSE | FALSE |
| FALSE | NULL | FALSE | NULL |
| NULL | NULL | NULL | NULL |
When comparing values, you will use the COMPARISON operators:
| Operator | Description |
|---|---|
| < | less than |
| > | greater than |
| <= | less than or equal to |
| >= | greater than or equal to |
| = | equal |
| <> or != | not equal |
In addition to the comparison operators, the special BETWEEN construct is available.
a BETWEEN x AND y is equivalent to
a >= x AND a <= y
Similarly,
a NOT BETWEEN x AND y is equivalent to
a < x OR a > y
Finally, the MATHEMATICAL operators (both numeric and bitwise) are:
| Name | Description | Example | Result |
|---|---|---|---|
| + | Addition | 2 + 3 | 5 |
| - | Subtraction | 2 - 3 | -1 |
| * | Multiplication | 2 * 3 | 6 |
| / | Division | 4 / 2 | 2 |
| % | Modulo (remainder) | 5 % 4 | 1 |
| POWER | Exponentiation | POWER (2.0,3.0) | 8.0 |
| SQRT | Square root | SQRT (25.0) | 5.0 |
| ABS | Absolute value | ABS (-5.0) | 5.0 |
| & | Bitwise AND | 91 & 15 01011011 & 00001111 | 11 00001011 |
| | | Bitwise OR | 32 | 3 00100000 | 00000011 | 35 00100011 |
| ^ | Bitwise XOR | 17 # 5 00010001 # 00000101 | 20 00010100 |
| ~ | Bitwise NOT | ~1 | -2 |
| AVG | Average | AVG(ModelMag_r) | |
| MIN | Minimum | MIN(ModelMag_r) | |
| MAX | Maximum | MAX(ModelMag_r) |
In addition, the usual mathematical and trigonometric functions are available in SQL, such as COS, SIN, TAN, ACOS, etc..
You may wish to obtain quantities from multiple tables, or place constraints on quantities in one table while obtaining measurements from another. For instance, you may want magnitudes (from PhotoObj) from all objects spectroscopically identified (SpecObj) as galaxies. To perform these types of queries, you must use a join. You can join any two (or more) tables in the databases as long as they have some quantity in common (typically an object or field ID). To actually perform the join, you must have a constraint in the WHERE clause of your query forcing the common quantity to be equal in the two tables.Here is an example, getting the g magnitudes for stars in fields where the PSF fitting worked well:
SELECT s.psfMag_g FROM Star s, Field f WHERE s.fieldID = f.fieldID and s.psfMag_g < 20 and f.pspStatus = 2
Notice how we define abbreviations for the table names in the FROM clause; this is not necessary but makes for a lot less typing. Also, you do not have to ask for quantities to be returned from all the tables. You must specify all the tables on which you place constraints (including the join) in the FROM clause, but you can use any subset of these tables in the SELECT. If you use more than two tables, they do not all need to be joined on the same quantity. For instance, this three way join is perfectly acceptable:
SELECT p.objID,f.field,g.run FROM PhotoObj p, Field f, Segment g WHERE f.fieldid = p.fieldid and f.segmentid = g.segmentid
The type of joins shown above are called inner joins. In the above examples, we only return those objects which are matched between the multiple tables. If we want to include all rows of one of the tables, regardless of whether or not they are matched to another table, we must perform an outer join. One example is to get photometric data for all objects, while getting the spectroscopic data for those objects that have spectroscopy.
In the example below, we perform a left outer join, which means that we will get all entries (regardless of matching) from the table on the left side of the join. In the example below, the join is on P.objID = s.BestObjID; therefore, we will get all photometric (P) objects, with data from the spectroscopy if it exists. If there is no spectroscopic data for an object, we'll still get the photometric measurements but have nulls for the corresponding cpectroscopy.
select P.objID, P.ra, P.dec, S.SpecObjId, S.ra, S.dec from PhotoObj as P left outer join SpecObjAll as S on P.objID = s.BestObjID
You can join across more than one table, as long as every pair you are joining has a quantity in common; not all tables need be joined on the same quantity. For example:
SELECT TOP 1000 g.run, f.field, p.objID FROM photoObj p, field f, segment g WHERE f.fieldid = p.fieldid and f.segmentid = g.segmentid and f.psfWidth_r > 1.2 and p.colc > 400.0
Note how the Field and PhotoObj are joined on the fieldID, while the join between Field and Segment uses segmentID.
When using table valued functions, you must do the join explicitly (rather than using "="). To do this, we use the syntax
SELECT quantities
FROM table1
JOIN table2 on table1.quantity = table2.quantity
WHERE constraints
For instance, in the example below, we use the function dbo.fGetNearbyObjEq to get all objects within a given radius (in this case, 1') of a specified coordinate. This is a table-valued, so it returns a table, containing the ObjIDs and distances of nearby objects. We want to get further photometric parameters on the returned objects, so we must join the output table with PhotoObj.:
SELECT G.objID, GN.distance
FROM Galaxy as G
JOIN dbo.fGetNearbyObjEq(115.,32.5, 1) as GN
on G.objID = GN.objID
WHERE (G.flags & dbo.fPhotoFlags('saturated')) = 0
SQL provides a number of ways to reorder, group, or otherwise arrange the output of your queries. Some of these options are:
SELECT count(r) FROM Galaxy
SELECT distinct run FROM Field
SELECT top 100 r FROM Star
SELECT top 1000 u,g,r FROM Star order by g,r desc
SELECT min(r),max(r) FROM PhotoPrimary group by typeYou can use this to count how many of each object type is loaded as primary photometric objects, for instance:
SELECT count(r) FROM PhotoPrimary group by type
It is easy to construct very complex queries which can take a long time to execute. When writing queries, one can often rewrite them to run faster. This is called optimization.
The first, and most trivial, optimization trick is to use the minimal Table or View for your query. For instance, if all you care about are galaxies, use the Galaxy view in your FROM clause, instead of PhotoObj. We have also created a 'miniature' version of PhotoObjAll, called PhotoTag. This miniature contains all the objects in PhotoObjAll, but only a subset of the measured quantities. Using the PhotoTag table to speed up the query only makes sense if you do NOT want parameters that are only available in the full PhotoObjAll.
It is extremely useful to think about how a database handles queries, rather than trying to write a plain, sequential list of constraints. NOT every query that is syntactically correct will necessarily be efficient; the built-in query optimizer is not perfect! Thus, writing queries such that they use the tricks below can produce significant speed improvements.
Here is a staggering example of the importance of optimization:
Find the positions and magnitudes of photometric objects that have been Targeted for spectroscopy as possible QSOs.
A user's first instinct would be to get the desired objects from the PhotoObj table within the TARGDR1 database (which contains the information, including targeting decisions, for objects when they were targeted (chosen) for spectroscopy). So, this query might look like:
SELECT p.ra, p.dec, p.modelMag_i, p.extinction_i FROM TARGDR1..PhotoObjAll p WHERE (p.primtarget & 0x00000002 > 0) or (p.primtarget & 0x00000004 > 0)
That's really simple - all you are doing is checking if the primary target flags (primtarget) are set for the two types of QSO targets. This query can take hours, because a sequential scan of every object in the photometric database is required!
One quick change which makes a difference is to simplify the WHERE clause, to get rid of the or, by masking everything but bits 2,4, and checking if the result is non zero. This changes the WHERE clause to:
WHERE (primtarget & 0x00000006) > 0
This helps a little, but not much - we are still scanning the entire PhotoObj table. We can make our lives a lot better by realizing that the database developers have anticipated that people will be interested in targetting information, and created a smaller table TargetInfo, that contains only the Targetted objects, which is a small subset of the entire photometric database! Using this table, we can rewrite our query as:
SELECT p.ra, p.dec, p.i, p.extinction_i
FROM TargetInfo t, PhotoObjAll p
WHERE (t.primtarget & 0x00000006>0)
and p.objid=t.targetobjid
Note how most of the WHERE clause is performed using the Targetinfo table; the SQL optimizer immediately recognizes that this table is much smaller than PhotoObj, and does this part of the search first. The query now runs in 33 seconds, and returns 32931 rows. That is two orders of magnitude improvement over the initial method!.
Finally, we can recognize that all the quantities of interest are also in the PhotoTag table, which contains all the objects in PhotoObjAll, but not all measured quantities. The query will be:
SELECT p.ra, p.dec, p.ModelMag_i, p.extinction_i
FROM TargetInfo t, PhotoTag p
WHERE (t.primtarget & 0x00000006>0)
and p.objid=t.targetobjid
This runs in 18 sec, and returns the same 32931 rows. Another factor of two in speed! Note how PhotoObjTag does not contain the simplified i magnitude, and we must use ModelMag_i instead.
Another of the simplest ways to make queries faster is to first perform a query using only indexed quantities, and then select those parameters from the returned subset of objects. An indexed quantity is one where a look-up table has effectively been calculated, so that the database software does not have to do a time-consuming sequential search through all the objects in the table. For instance, sky coordinates cx,cy,cz are indexed using a Hierarchical Triangular Mesh (HTM). So, you can make a query faster by rewriting it such that it is nested; the inner query grabs the entire row for objects of interest based on the indexed quantities, while the outer query then gets the specific quantities desired.
Finally, a caution about using function calls in queries. If your query is going to match a large number of objects (million or more), using a function call, especially one that operates on a constant or literal, in the WHERE clause is not a good idea, because the function will be called once per matching row in that table, resulting in a significant performance hit. Here is an example of this:
SELECT ...
FROM PhotoObj
WHERE
flags & dbo.fPhotoFlags('BLENDED') > 0
In this case, it would be better to first do the pre-query:
SELECT dbo.fPhotoFlags('BLENDED')
to get the bitmask value for that flag, and then rewrite the above query as:
SELECT ...
FROM PhotoObj
WHERE
flags & 8 > 0
This will avoid the wastefully repeated function call for each and every photobj in the table.
Performance is usually only an issue when the PhotoObjAll table (and associated views) is involved in a query, either directly or with a join. We have built in some features to enhance performance for queries on this table. The first and foremost, and the most effective performance enhancer, is the Hierarchical Triangular Mesh (HTM) spatial index that we have developed at JHU and incorporated into each of the SDSS databases. This is a multi-dimensional index that speeds up searches by spatial decomposition of the sky.
In addition to the HTM, there are several indices built in the database on columns of the various tables, including primary key, foreign key and other indices that group frequently used columns.
PhotoTag is a 10% subset of PhotoObjAll that has the 60 most "popular" fields.
Both PhotoObj and PhotoTag are indexed and those indices are each a 2% subset of PhotoObj.
The nice thing about the indices is that they get picked for you automatically and they run 50x faster than reading the whole PhotoObj table and 5x faster than reading the PhotoTag table.
The next version of the SQL Server database product will allow us to eliminate PhotoTag (it will be an automatically selected index). But for now, cognoscenti will have to use it if they can (if their question is covered by that 10% of the most popular fields).
In an ideal world you would not have to know about indices. Unfortunately we do not live in an ideal world (yet).
The strategy for selecting a few (less than 10,000) objects in a certain part of the sky using the dbo.fGetObjFromRect() function works very well. But, when the patch gets LARGE (more than 10,000 objects) then your ra-dec limit predicate is probably going to be more efficient because it will be a linear scan over the data.
The Stars/Galaxy/PhotoPrimary/... Views all benefit from the indices on the base tables. You should feel free to use them.
At current disk speeds (~ 400 MB/s peak), it should take about 10 minutes to do a sequential scan of the entire BESTDR1 database (200+ GB), and about 20 minutes for BESTDR2 (400+ GB), on an unloaded server. So even queries that scan the entire photoobj table should run in about half an hour if they are not requesting a very large number of rows (in which case it takes a long time to get the results back over the network).
Sometimes queries can run much slower than normal (5-10 times slower) if the server is loaded down, so you should always try a slow query at a few different times.
If after applying the advice given above and trying your best to optimize your query, you find that it still runs very slowly (no output returned in more than 20 minutes or so), you may have run into the dreaded bookmark lookup bug of the SQL Server query optimizer. Basically, this means that the optimizer has chosen the incorrect plan for executing the query.
While there is no reliable way to predict what causes the bookmark bug to be invoked, usually it happens when there are several constraints on non-indexed quantities in a given table. For example, in the query
SELECT objID FROM PhotoObj
WHERE
(flags & 0x40006) = 0
AND
rowv*rowv + colv*colv > 4*(rowvErr*rowvErr + colvErr*colvErr)
if you include only one of the two constraints separated by the "and", the optimizer will chose the correct plan, but if you include both, the optimizer goofs big-time and opts to do a random search of the PhotoObj table instead of a sequential scan. It decides that it will use the photoobj primary key index and for each entry in the index, it will follow the link to the data (the "bookmark") and find the flags and rowv, colv fields from the data page. This means a random disk access for each object in the photoobj table. Naturally, this will be excruciatingly slow since random access is several times slower than sequential access in disks.
If instead the optimizer picked a sequential scan of the whole photoobj table, the query could be completed within half an hour (assuming the server isnt badly loaded down). But with the chosen plan, it will take hours if not days!
Unfortunately, we are stuck with the bookmark bug for the time being. If you suspect that you are running into it, one way around it is to force the optimizer to ignore all the indices defined on that table. For example, you would rewrite the above query as follows:
SELECT objID FROM PhotoObj WITH (index(0)) -- ensure no index is used
WHERE
(flags & 0x40006) = 0
AND
rowv*rowv + colv*colv > 4*(rowvErr*rowvErr + colvErr*colvErr)
The bookmark bug is probably not going to be fixed in the next SQL Server release (Yukon), but hopefully the one after that.
Roy, Gal,
Ani Thakar,
Jim Gray, Alex Szalay
Last updated Jan 13, 2004.