SQL is the Structured Query Language, a standard means of asking for data from databases, and is used to query the Catalog Archive Server (CAS). This page provides a brief overview of SQL. Query examples are also available, with comments, as well as a page of links to more detailed off-site documentation.
Database FundamentalsThe CAS provides access to SDSS-III catalog data that is stored in a relational databse managment system (DBMS) for data integrity and access speed. The data is organized into tables in a relational database. The SkyServer is the Web portal to the CAS databases and allows you to submit SQL queries to extract the data that you need from these databases. However, you do not usually need to specify which database your query is run on, since the SkyServer is configured by default to submit your queries to a particular database. This site is configured to submit your queries to the BESTDR10 database, which contains the best data and most recent processings for the entire released sky area. The BESTDR10 database contains a large number of tables, some of which contain photometric measurements (such as PhotoObj), spectroscopic measurements (such as SpecObj), or information about the observing conditions (Field) or survey geometry(TileBoundary). See the data model page for more details. In addition to the tables, we have defined Views, which are subsets or combinations of the data stored in the tables. Views are queried the same way Tables are; they exist just to make your life easier. For instance, the view Galaxy can be used to get photometric data on objects we classify as galaxies, without having to specify the classification in your query. Both the Skyserver and CasJobs interfaces have a Schema Browser. It shows you all of the available databases, the tables in each database, and the quantities stored in each column of the tables. Finally, we have created a variety of functions and stored procedures which let you easily perform some common operations. Usually, their names are prefixed by f or sp, like in fPhotoStatus or spGetFiberList. The full list of functions and store procedures is found in the Schema Browser. Note that some functions are scalar-valued, meaning that they return a single value, while others (such as the commonly used dbo.fGetNearbyObjEq, are table-valued; they actually return a table of data, and not a single number. This is important when interpreting the returned data and performing joins. Please note the caution about using function calls as noted in the Optimizing Queries section when attempting queries over that return a large number of objects.Query FundamentalsNow that we have an overview of the database structure, how do we actually get data out? You will have to write a query using SQL. The most basic query consists of three parts:
The WHERE clause is not necessary if you want to retrieve parameters of all objects in a specified table, but this typically will be an overwhelming amount of data! Note that the query language is insensitive to splitting the query over many lines. It is also not case sensitive. To make queries more readable, it is common practice to write the distinct query clauses on separate lines. The Sample Queries button on the CasJobs Query page provides a variety of samples, ordered in complexity. For instance, to obtain the list of unique Fields that have been loaded into the database, we use: SELECT FieldID FROM Field You can just copy and paste this (or any other) query into the SQL Search window of SkyServer, and press submit, or into the CasJobs query window, and press the submit button. If we want to retrieve multiple parameters from the database, we separate them with commas: SELECT ra,dec FROM Galaxy Of course, the parameters you request must be included in the table(s) you are querying! Now, let's say we want magnitudes of all bright galaxies. We will need to specify a magnitude range to do this: SELECT u,g,r,i,z FROM Galaxy WHERE r<12 and r>0 Here, we have used the WHERE clause to provide a magnitude range. The and operator is used to require that multiple limits be met. This leads us to... Simple Logical and Mathematical OperatorsNot only can we place limits on individual parameters, we can place multiple limits using logical operators, as well as place limits on the results of mathematical operations on multiple parameters. We may also retrieve results that are logical joins of multiple queries. Here we list the logical, comparison, and mathematical operators. The LOGICAL operators are AND,OR,NOT; they work as follows:
When comparing values, you will use the COMPARISON operators:
In addition to the comparison operators, the special BETWEEN construct is available. Similarly, Finally, the MATHEMATICAL operators (both numeric and bitwise) are:
In addition, the usual mathematical and trigonometric functions are available in SQL, such as COS, SIN, TAN, ACOS, etc.. Querying Bit FlagsSeveral SDSS tables contain bit-encoded flags to indicate various types of information about the object or quantity in question (e.g., the PhotoObjAll table and the PhotoTag view each have the flags field, SpecObj has zWarning flags etc.). One of the most important uses of bit flags is to indicate why an object was targeted for spectroscopy. A list of spectroscopic target flags is available on the DR9 Spectroscopic Target Flags page. This section describes how you can test for flag values in your query. For sample queries that demonstrate the use of flags, see the Errors using flags, Elliptical galaxies with model fits, Diameter limited sample, LRG sample, and Clean photometry with flags sample queries for examples on how to use flags.Checking a single flagTo return rows for which the flag is set, the basic syntax for the constraint is: and to return rows for which the flag is not set: where bitmask is the binary value in which the bit corresponding to the flag is 1 and all other bits are 0. You can use the flag functions provided by the SkyServer (listed in Schema Browser) to get the bitmask for a given flag, but if you are scanning a large fraction of a large table like PhotoObj, you are better off not making a function call for each row, and in that case you should first get the actual binary value of the bitmask first and substitute that instead. This is described in the Using dbo functions in your query subsection of the Optimizing Queries section below. For example, to select objects for which the BLENDED flag is set in PhotoTag, you would use a query like:
and to select only objects for which the flag is NOT set, use
Checking multiple flagsTo test if multiple flags are set, you can combine the values by adding them and then testing the result.To select objects for which all of several flags are set, generate the combined bitmask by adding the individual flag bitmasks, then compare the result of ANDing the combined bitmask with the flag column with the combined bitmask itself, e.g.,
To select objects for which at least one of several flags is set, you just need to check that ANDing the combined bitmask with the flag column returns a non-zero result, e.g.,
To select objects for which none of several flags is set, the result of ANDing the flag column with the combined bitmask must be 0, e.g.,
Clean PhotometryThe SDSS photo pipeline sets a number of flags that indicate the quality of the photometry for a given object in the catalog. If you desire objects with only clean photometry for science, you should be aware that you need to filter out unwanted objects yourself in your query. This is not done automatically for you (e.g. with a view of the PhotoObjAll table). The main reason is that the flag constraints that are required for this filtering often impose a significant performance penalty on your query .
Please see the Clean Photometry sample query for help on how to use the photometry
flags to select only objects with clean photometry.
Excluding Invalid Data ValuesAs mentioned in the EDR Paper, the database designates quantities that are not calculated for a particular object in a table with special values, as follows:
Changing Precision of Query OutputUse the STR(column,n,d) SQL construct (where n is the total number of digits and d is the number of decimal places) to set the precision of the column that your query requests. The SkyServer returns values with a default precision that is set for each data type, and this may not be enough for columns like ra, dec etc. See the Selected neighbors in run or the Uniform Quasar Sample sample queries for examples of how to use STR. Joins: Querying With Multiple TablesYou may wish to obtain quantities from multiple tables, or place constraints on quantities in one table while obtaining measurements from another. For instance, you may want magnitudes (from PhotoObj) from all objects spectroscopically identified (SpecObj) as galaxies. To perform these types of queries, you must use a join. You can join any two (or more) tables in the databases as long as they have some quantity in common (typically an object or field ID). To actually perform the join, you must have a JOIN subclause in the FROM clause of your query that specifies the common quantity to be equal in the two tables. Here is an example, getting the g magnitudes for stars in fields where the PSF fitting worked well:
Notice how we define abbreviations for the table names in the FROM clause; this is not necessary but makes for a lot less typing. Also, you do not have to ask for quantities to be returned from all the tables. You must specify all the tables on which you place constraints (including the join) in the FROM clause, but you can use any subset of these tables in the SELECT. If you use more than two tables, they do not all need to be joined on the same quantity. For instance, this three way join is perfectly acceptable:
The type of joins shown above are called inner joins. In the above examples, we only return those objects which are matched between the multiple tables. If we want to include all rows of one of the tables, regardless of whether or not they are matched to another table, we must perform an outer join. One example is to get photometric data for all objects, while getting the spectroscopic data for those objects that have spectroscopy. In the example below, we perform a left outer join, which means that we will get all entries (regardless of matching) from the table on the left side of the join. In the example below, the join is on P.objID = s.BestObjID; therefore, we will get all photometric (P) objects, with data from the spectroscopy if it exists. If there is no spectroscopic data for an object, we'll still get the photometric measurements but have nulls for the corresponding cpectroscopy.
When using table valued functions, you must do the join explicitly (rather than using "="). To do this, we use the syntax For instance, in the example below, we use the function dbo.fGetNearbyObjEq to get all objects within a given radius (in this case, 1') of a specified coordinate. This is a table-valued, so it returns a table, containing the ObjIDs and distances of nearby objects. We want to get further photometric parameters on the returned objects, so we must join the output table with PhotoObj.:
Manipulating Query OutputSQL provides a number of ways to reorder, group, or otherwise arrange the output of your queries. Some of these options are:
Optimizing QueriesIt is easy to construct very complex queries which can take a long time to execute. When writing queries, one can often rewrite them to run faster. This is called optimization. The first, and most trivial, optimization trick is to use the minimal Table or View for your query. For instance, if all you care about are galaxies, use the Galaxy view in your FROM clause, instead of PhotoObj. We have also created a 'thin' version of PhotoObjAll, called PhotoTag. This vertical subset contains all the objects in PhotoObjAll, but only a subset of the measured quantities. Using the PhotoTag view to speed up the query only makes sense if you do NOT want parameters that are only available in the full PhotoObjAll. It is extremely useful to think about how a database handles queries, rather than trying to write a plain, sequential list of constraints. NOT every query that is syntactically correct will necessarily be efficient; the built-in query optimizer is not perfect! Thus, writing queries such that they use the tricks below can produce significant speed improvements. Using indices in your queryAnother simple way to make queries faster is to use indexed quantities to search on. There are two types of indices in the CAS, indices built into the database, and an external spatial index that we have added to make spatial searches much faster, called the Hierarchical Triangular Mesh.. The latter is explicitly invoked by using the built-in spatial search functions like fGetNearbyObjEq (does a radial search in equatorial coordinates), or fGetObjFromRectEq (searches in a rectangular area). Database indices are invoked automatically when you include columns in your search (in the WHERE clause) that have indices built on them. There are 3 types of database indices:
If you must search on non-indexed columns in addition to the indexed ones, you can still benefit by first performing a query using only the indexed quantities, and then select those parameters from the returned subset of objects. An indexed quantity is one where a look-up table has effectively been calculated, so that the database software does not have to do a time-consuming sequential search through all the objects in the table. For instance, sky coordinates cx,cy,cz are indexed using a Hierarchical Triangular Mesh (HTM). So, you can make a query faster by rewriting it such that it is nested; the inner query grabs the entire row for objects of interest based on the indexed quantities, while the outer query then gets the specific quantities desired. Using dbo functions in your queryFinally, a word of caution about using function calls in queries. If your query is going to match a large number of objects (million or more), using a function call, especially one that operates on a constant or literal, in the WHERE clause is not a good idea, because the function will be called once per matching row in that table, resulting in a significant performance hit. Here is an example of this:
In this case, it would be better to first do the pre-query:
to get the bitmask value for that flag, and then rewrite the above query as:
This will avoid the wastefully repeated function call for each and every photobj in the table.
Roy Gal, Ani Thakar, Jim Gray, Alex Szalay Last updated Apr 8, 2004. |