|Cooking with Sloan|
|Contact Help Desk|
|SQL in SkyServer|
|Sample SQL Queries|
|About the Database|
SQL is the Structured Query Language, a standard means of asking for data from databases, and is used to query the Catalog Archive Server (CAS). This page provides a brief overview of SQL. Query examples are also available, with comments, as well as a page of links to more detailed off-site documentation.
The CAS provides access to SDSS-III catalog data that is stored in a relational databse managment system (DBMS) for data integrity and access speed. The data is organized into tables in a relational database. The SkyServer is the Web portal to the CAS databases and allows you to submit SQL queries to extract the data that you need from these databases. However, you do not usually need to specify which database your query is run on, since the SkyServer is configured by default to submit your queries to a particular database. This site is configured to submit your queries to the BESTDR13 database, which contains the best data and most recent processing for the entire released sky area.
The BESTDR13 database contains a large number of tables, some of which contain photometric measurements (such as PhotoObj), spectroscopic measurements (such as SpecObj), or information about the observing conditions (Field) or survey geometry(TileBoundary). See the data model page for more details.
In addition to the tables, we have defined Views, which can be thought of as virtual tables and are subsets or combinations of the data stored in the tables. Views are queried the same way Tables are; they exist just to make your life easier. For instance, the view Galaxy can be used to get photometric data on objects we classify as galaxies, without having to specify the classification in your query.
Both the Skyserver and CasJobs interfaces have a Schema Browser. It shows you all of the available databases, the tables in each database, and the quantities stored in each column of the tables.
Finally, we have created a variety of functions and stored procedures which let you easily perform some common operations. Usually, their names are prefixed by f or sp, like in fPhotoStatus or spGetFiberList. The full list of functions and store procedures is found in the Schema Browser. Note that some functions are scalar-valued, meaning that they return a single value, while others (such as the commonly used dbo.fGetNearbyObjEq, are table-valued; they actually return a table of data, and not a single number. This is important when interpreting the returned data and performing joins.Please note the caution about using function calls as noted in the Optimizing Queries section when attempting queries over that return a large number of objects.
Now that you have an overview of the database structure, how can you actually get data out? You can either used one of the webform-based tools
on the Data/Search page, or you can choose to write your own query using SQL
using the SQL Search tool. Running a SQL query is the most direct and powerful way to
interact with the database. The following is a brief introduction to writing SkyServer SQL queries. You can view other help pages
like Cooking With Sloan and SQL Tutorial for additional help
on writing SQL queries in SkyServer.
The WHERE clause is not necessary if you want to retrieve parameters of all objects in a specified table, but this typically will be an overwhelming amount of data!
Note that the query language is insensitive to splitting the query over many lines. It is also not case sensitive. To make queries more readable, it is common practice to write the distinct query clauses on separate lines. The Sample Queries button on the CasJobs Query page provides a variety of samples, ordered in complexity. For instance, to obtain the list of unique Fields that have been loaded into the database, we use:
SELECT FieldID FROM Field
If we want to retrieve multiple parameters from the database, we separate them with commas:
SELECT ra,dec FROM Galaxy
Of course, the parameters you request must be included in the table(s) you are querying! Now, let's say we want magnitudes of all bright galaxies. We will need to specify a magnitude range to do this:
SELECT u,g,r,i,z FROM Galaxy WHERE r<12 and r>0
Here, we have used the WHERE clause to provide a magnitude range. The and operator is used to require that multiple limits be met. This leads us to...
Not only can we place limits on individual parameters, we can place multiple limits using logical operators, as well as place limits on the results of mathematical operations on multiple parameters. We may also retrieve results that are logical joins of multiple queries. Here we list the logical, comparison, and mathematical operators.
The LOGICAL operators are AND,OR,NOT; they work as follows:
When comparing values, you will use the COMPARISON operators:
In addition to the comparison operators, the special BETWEEN construct is available.
Finally, the MATHEMATICAL operators (both numeric and bitwise) are:
In addition, the usual mathematical and trigonometric functions are available in SQL, such as COS, SIN, TAN, ACOS, etc..
Several SDSS tables contain bit-encoded flags to indicate various types of information about the object or quantity in question (e.g., the PhotoObjAll table and the PhotoTag view each have the flags column, SpecObj has zWarning and various targeting flags etc.).
One of the most important uses of bit flags is to indicate why an object was targeted for spectroscopy. A list of spectroscopic target flags is available on the Spectroscopic Target Flags page.This section describes how you can test for flag values in your query. For sample queries that demonstrate the use of flags, see the Using flags, Diameter limited sample, LRG sample, Clean photometry with flags - Stars, and Clean photometry with flags - Galaxies sample queries for examples on how to use flags.
To return rows for which the flag is set, the basic syntax for the constraint is:
and to return rows for which the flag is not set:
where bitmask is the binary value in which the bit corresponding to the flag is 1 and all other bits are 0. You can use the flag functions provided by the SkyServer (listed in Schema Browser) to get the bitmask for a given flag, but if you are scanning a large fraction of a large table like PhotoObj, you are better off not making a function call for each row, and in that case you should first get the actual binary value of the bitmask first and substitute that instead. This is described in the Using dbo functions in your query subsection of the Optimizing Queries section below.
For example, to select objects for which the BLENDED flag is set in PhotoTag, you would use a query like:
and to select only objects for which the flag is NOT set, use
To select objects for which all of several flags are set, generate the combined bitmask by adding the individual flag bitmasks, then compare the result of ANDing the combined bitmask with the flag column with the combined bitmask itself, e.g.,
To select objects for which at least one of several flags is set, you just need to check that ANDing the combined bitmask with the flag column returns a non-zero result, e.g.,
To select objects for which none of several flags is set, the result of ANDing the flag column with the combined bitmask must be 0, e.g.,
As mentioned above, if you are running a query that is expected to match a large number of rows (millions), it is better
to first obtain the binary bitmask resulting from the multiple flag arithmetic and using that single bitmask instead of
repeated function calls to the flag functions, as described in the Using dbo
functions in your query subsection of the Optimizing Queries section below.
The SDSS photo pipeline sets a number of flags that indicate the quality of the photometry for a given object in the catalog. If you desire objects with only clean photometry for science, you should be aware that you need to filter out unwanted objects yourself in your query. This is not done automatically for you (e.g. with a view of the PhotoObjAll table). The main reason is that the flag constraints that are required for this filtering often impose a significant performance penalty on your query .
There is a single up or down flag that is available in the PhotoObjAll table (and its views) called "clean" that is
set to 1 if the photometry meets our definition of good photometry, and the use of this shorthand flag is illustrated
in the Clean Photometry sample query. This is meant to provide a simple way to select
objects with clean photometry. However, if you do not trust this or you want to be more specific and use the
individual photo flags to select objects that meet your criteria for "clean photometry", this is illustrated in two other
sample queries: the Clean photometry with flags - Stars, and
the Clean photometry with flags - Galaxies sample queries.
As mentioned in the EDR Paper, the database designates quantities that are not calculated for a particular object in a table with special values, as follows:
Use the STR(column,n,d) SQL construct (where n is the total number of digits and d is the number of decimal places) to set the precision of the column that your query requests. The SkyServer returns values with a default precision that is set for each data type, and this may not be enough for columns like ra, dec etc. See the Selected neighbors in run or the Uniform Quasar Sample sample queries for examples of how to use STR.
You may wish to obtain quantities from multiple tables, or place constraints on quantities in one table while obtaining measurements from another. For instance, you may want magnitudes (from PhotoObj) from all objects spectroscopically identified (SpecObj) as galaxies. To perform these types of queries, you must use a join. You can join any two (or more) tables in the databases as long as they have some quantity in common (typically an object or field ID). To actually perform the join, you must have a JOIN subclause in the FROM clause of your query that specifies the common quantity to be equal in the two tables. Here is an example, getting the g magnitudes for stars in fields where the PSF fitting worked well:
Notice how we define abbreviations for the table names in the FROM clause; this is not necessary but makes for a lot less typing. Also, you do not have to ask for quantities to be returned from all the tables. You must specify all the tables on which you place constraints (including the join) in the FROM clause, but you can use any subset of these tables in the SELECT. If you use more than two tables, they do not all need to be joined on the same quantity. For instance, this three way join is perfectly acceptable:
The type of joins shown above are called inner joins. In the above examples, we only return those objects which are matched between the multiple tables. If we want to include all rows of one of the tables, regardless of whether or not they are matched to another table, we must perform an outer join. One example is to get photometric data for all objects, while getting the spectroscopic data for those objects that have spectroscopy.
In the example below, we perform a left outer join, which means that we will get all entries (regardless of matching) from the table on the left side of the join. In the example below, the join is on P.objID = s.BestObjID; therefore, we will get all photometric (P) objects, with data from the spectroscopy if it exists. If there is no spectroscopic data for an object, we'll still get the photometric measurements but have nulls for the corresponding cpectroscopy.
When using table valued functions, you must do the join explicitly (rather than using "="). To do this, we use the syntax
For instance, in the example below, we use the function dbo.fGetNearbyObjEq to get all objects within a given radius (in this case, 1') of a specified coordinate. This is a table-valued, so it returns a table, containing the ObjIDs and distances of nearby objects. We want to get further photometric parameters on the returned objects, so we must join the output table with PhotoObj.:
SQL provides a number of ways to reorder, group, or otherwise arrange the output of your queries. Some of these options are:
It is easy to construct very complex queries which can take a long time to execute. When writing queries, one can often rewrite them to run faster. This is called optimization.
The first, and most trivial, optimization trick is to use the minimal Table or View for your query. For instance, if all you care about are galaxies, use the Galaxy view in your FROM clause, instead of PhotoObj. We have also created a 'thin' version of PhotoObjAll, called PhotoTag. This vertical subset contains all the objects in PhotoObjAll, but only a subset of the measured quantities. Using the PhotoTag view to speed up the query only makes sense if you do NOT want parameters that are only available in the full PhotoObjAll.
It is extremely useful to think about how a database handles queries, rather than trying to write a plain, sequential list of constraints. NOT every query that is syntactically correct will necessarily be efficient; the built-in query optimizer is not perfect! Thus, writing queries such that they use the tricks below can produce significant speed improvements.
Another simple way to make queries faster is to use indexed quantities to search on. There are two types of indices in the CAS, indices built into the database, and an external spatial index that we have added to make spatial searches much faster, called the Hierarchical Triangular Mesh.. The latter is explicitly invoked by using the built-in spatial search functions like fGetNearbyObjEq (does a radial search in equatorial coordinates), or fGetObjFromRectEq (searches in a rectangular area). Database indices are invoked automatically when you include columns in your search (in the WHERE clause) that have indices built on them. There are 3 types of database indices:
If you must search on non-indexed columns in addition to the indexed ones, you can still benefit by first performing a query using only the indexed quantities, and then select those parameters from the returned subset of objects. An indexed quantity is one where a look-up table has effectively been calculated, so that the database software does not have to do a time-consuming sequential search through all the objects in the table. For instance, sky coordinates cx,cy,cz are indexed using a Hierarchical Triangular Mesh (HTM). So, you can make a query faster by rewriting it such that it is nested; the inner query grabs the entire row for objects of interest based on the indexed quantities, while the outer query then gets the specific quantities desired.
In this case, it would be better to first do the pre-query:
to get the bitmask value for that flag, and then rewrite the above query as:
This will avoid the wastefully repeated function call for each and every photobj in the table.This is even more important when you are using multiple flags and you can reduce the comparison to a single bitmask using flag arithmetic. In the final example above in the Querying Bit Flags section, you can replace the original query:
SELECT top 10 objid, flags FROM PhotoTag WHERE ( flags & (dbo.fPhotoFlags('NODEBLEND') + dbo.fPhotoFlags('BINNED1') + dbo.fPhotoFlags('BINNED2')) ) = 0with a more efficient version by first running the following pre-query:
SELECT (dbo.fPhotoFlags('NODEBLEND') + dbo.fPhotoFlags('BINNED1') + dbo.fPhotoFlags('BINNED2'))which returns the bitmask value 805306432, which can in turn be substituted back in the original query as follows:
SELECT top 10 objid, flags FROM PhotoTag WHERE ( flags & 805306432 ) = 0so as to save 3 function calls and make the query significantly more efficient. (In this particular example it does not matter because we are only asking for 10 rows, but if the "TOP 10" were to be removed and the query was run on millions of rows, it would make a difference).
Roy Gal, Ani Thakar, Jim Gray, Alex Szalay
Last updated Apr 8, 2004.