Sloan Digital Sky Survey
Data Distribution to the Astronomy Committee
 
Revision 1
September 8, 2000
 

 1.       Background
The NSF Astronomy Division recognized the value that the SDSS data would have to the astronomy community when it began its support of the construction and commissioning of the SDSS in 1994. The Astrophysical Research Consortium (ARC), which manages the SDSS, was required to provide the NSF with an acceptable plan for the distribution of the data to the astronomy community. The SDSS management prepared a public data distribution plan and submitted it to the Program Manager for Advanced Technologies and Instrumentation in the late fall of 1998. The plan was subsequently peer reviewed by six astronomers, under the Program Manager’s direction, and their comments were incorporated into the plan. After several iterations between the reviewers and the Program Manager, the SDSS management produced the plan presented in this document, which we will refer to as the 1999 NSF-SDSS Plan. The ATI Program Manager approved the plan in April 1999 and it has been the foundation of our schedule to release the data to the astronomy community. We propose to augment this schedule with an early release in mid 2001, which will precede the first scheduled release in January 2003 as specified in the 1999 NSF-SDSS Plan. This document describes the 1999 NSF-SDSS Plan, including the schedule and the data products, and then describes the early data release .

The SDSS data products vary in complexity and size. The full volume of data is beyond the capability of most users to store or utilize effectively. Nevertheless, a single user can undertake a significant research project with just the positions and redshifts of “only” one million galaxies. Some research projects do not require data from the complete SDSS area and hence can be accomplished before the observing phase is complete by using data that are processed and calibrated as the survey progresses. Some require significantly more information about each object than the simple redshift catalog provides, such as the measured photometry of objects and corrected image frames. Some projects may choose to use different calibration procedures or even different processing algorithms. These latter projects require the type of computing facilities that only major computing centers possess. This data distribution plan takes these various possibilities into consideration.

There is also a trade-off between the prompt availability of the data to users during the survey and the integrity of the calibrations. Recalling data is like recalling Firestone Tires, it damages the credibility of the entire Survey. It is in everyone’s best interests to insure that that the released data are of high quality and have reliable calibrations.

The schedule for the release to the astronomy community, as defined in the 1999 NSF-SDSS Plan, is shown in the milestone-chart in Figure 1. Each milestone corresponds to a specific date and the release of a well-defined percentage of the final data sample. The size of the final data sample is determined by the five-year baseline survey.

 
2.       Natural Time-scale: Point of No Return
The interplay between the photometric survey and the spectroscopic survey defines the SDSS observing strategy by imposing two well-defined “points of no return” on the data processing. The first occurs when a set of imaging data is determined to be good enough to allow target selection. The second occurs when the spectroscopic reductions are good enough to complete a particular “tile” on the sky. The first event is a particularly hard boundary: once we drill plates with hole positions fixed for individual targets and obtain their spectra, it would be very costly to discover that we had made an erroneous selection of objects and be obliged to re-tile, re-drill, and re-observe. Thus it is essential for the efficiency of the survey that we have a clearly defined "point of no return" for target selection. In effect, the scientific requirements related to the homogeneity of the spectroscopic survey define the timing and other procedures for acceptance of the photometric data. The schedule for the data distribution is referenced to these two points of no return.
 
3.       Quantized Data Release
In order to provide a statistically meaningful version of the data archive, we will release the data in yearly quanta. The complexity of the SDSS data and the need for time consuming, repeated verifications of the calibrations creates the latency; the time interval between the time a quantum of data is processed and calibrated and the time the data quality has been determined to have met survey requirements. The latency also includes the time it takes to package the data for distribution. This latency, which is similar to the one adopted by COBE, was expected to be eighteen months at the time of the first release. We expect that it could be gradually decreased to one year by the fifth year of the survey. As noted later, the time to bring the calibration up to survey requirements has taken longer than we had planned when the NSF approved the plan.
 
4.       The Schedule for the 1999 NSF-SDSS Plan
Figure 1 shows the milestones for the original 1999 plan. They determine the date of each data release and the fraction of the final data sample released. The triangles in Figure 1 define the critical milestones. These are the beginning of the survey, the dates that define the last observation to be included within each yearly quantum of data that will be released to the astronomy community, and the end of survey observations. We show these for the two main data components, the photometric catalog and the spectroscopic sample. The intermediate dates were chosen to be July 1, since the survey’s primary focus is the North Galactic Cap. The planning assumptions that determined the schedule are: observations of the northern sky are made during the first two quarters of every year, the third quarter is largely lost to the monsoon season, and the northern sky can be observed for about one month during the fourth quarter. The gray line shows the accumulation of imaging data. The black line shows the accumulation of processed and calibrated imaging data. This line shows when plates can be drilled thus enabling the spectroscopic survey. The dashed line shows the accumulation of spectroscopic data. The data obtained prior to the “points-of-no-return” is quantized by the mid-year milestones and will be released at the times shown by the tip of the arrows. The time to process the spectroscopic data is included in the length of the gray arrow. The date of the first release specified in the 1999 NSF-SDSS plan will occur in January 2003 and it will follow the July 2001 milestone by eighteen months. The vertical positions of the arrows define the total percentage of the final data sample available to the astronomy community at the time of the data release. The initial latency was intended to provide extra time to completely revise the calibration procedure if a problem were to be discovered during the first year of the survey.

 
Figure 1. Milestones and data fractions for the release of the SDSS data
as specified in the original data release plan.
 
 5.       Data Products
Table 1.  Data Products
Product
Size
Form
 1. Complete Redshift Catalog    2 GB  CD-ROM, ftp
 2. Compact Photometric Catalog    60 GB  CD-ROM, ftp
 3. Survey Description (Status, Calibrations)    1 GB  CD-ROM, www
 4. Full photometric catalog    400 GB  On-line, SX
 5. Atlas Images    1.5 TB  On-line, SX
 6. Compressed Sky Map    300 GB  On-line, ftp
 7. ID Spectra    60 GB  On-line, SX
 8. Calibrations    5 GB  On-line, SX, ftp
 9. Corrected Imaging Frames    15 GB  On-line, ftp
 
  1. Complete Redshift Catalog. The objects targeted for spectroscopy include galaxies, quasars, stars of various properties, and sources from the ROSAT and FIRST catalogs. The catalog will contain all relevant photometric information, as defined in the Compact Photometric Catalog, and redshifts for each object.
  2. Compact Photometric Catalog. This product contains most of the scientifically useful photometric information for all objects, in a particularly compact form, to facilitate easy distribution. The number of attributes is kept to a minimum (id, position, magnitudes, colors, size, ellipticity, position angle, errors, classification, flags), a total of about 400 bytes/object. It contains only one (the primary) observation for each object even if there are multiple epoch detections. This data set will be released on CD-ROM/DVD/ftp.
  3. Survey Description and Status. The description of the survey is a document. Most of it is already contained on the SDSS Sampler#1 CD-ROM, also available at http://www.sdss.org/cdrom1/index.htm. Updates will be available on-line, as well as on CD-ROM. The survey status will be an ensemble of many individual data products. These include a list of the stripes/strips observed to date, status of data processing and targeting, the status of spectroscopic data processing, status of the instruments, weather logs, instrument logs. Once the reporting system has been developed, it will be on-line via the SDSS web site. This proposal requests funds to speed the completion of the website and the status reports and to make them accessible to the astronomy community.
  4. Full Photometric Catalog. Parameters including positions, magnitudes, radial profiles and shape parameters for more than 100 million objects in 5 bands to the detection limit of the survey. The biggest difference between the compact and full catalogs is that the full catalog contains several different kinds of magnitudes, radial profiles in logarithmic bins and their errors, survey coordinates, pixel coordinates, detailed calibration parameters, and their versions, various instrumental records, a large set of flags set during processing. All observations of the objects are contained in this catalog, not just the primaries. Additionally, mask files define in detail the sections of an image frame that could not be processed and the various reasons. This catalog and the mask files will be placed on-line in a searchable database. This proposal requests funds to produce a user-friendly method of accessing and searching this database.
  5. Atlas Images. Cutouts of the images of detected objects from the full image frames in 5 colors, 1 billion images in total. These are ready after the final photometric processing of the data. Due to their size they will be available on-line and accessible by their locations, organized into fields. This proposal requests funds to put this data product on-line in a form that can be used by the astronomy community.
  6. Compressed Sky Map. A 4x4 compressed version of the image frames after removing objects. This map, along with the atlas images can be used to approximately reconstruct the original image frames. This proposal requests funds to develop the software and distribution tools for a compressed sky map that could be included with the annual data release.
  7. 1-D Spectra. These are extracted from the 2-D images, and the blue and red halves have been merged together. They are created during the spectroscopic reduction process, and will be will be released on line, in sync with the redshift catalog, (ftp/www).
  8. Calibrations. This product contains an astrometric (position) and photometric (flux) calibration coefficients for the full image frames along with miscellaneous information such as the seeing and sky background. These will be versioned, so that the actual calibrations used for target selection will still be available even after additional work produces better calibrations. This proposal requests funds to put this data product on-line in a form that can be used by the astronomy community.
  9. Corrected Imaging Frames. Until recently we did not envisaged distributing the flat-fielded, corrected imaging frames, which were used for object detection, widely, even to the SDSS collaboration. They are currently stored at Fermilab, in an Enstore tape robot with a capacity of 50 TB. The astronomy community could access the tape robot once we have developed the software to enable the type of access that astronomers want. The recent sharp drop in the price/GB for hard disks may enable us to keep this data product on disk, thereby making it available on line. This proposal requests funds for the development of the code that will put make this data product accessible from the Internet by the astronomy community.
6.       The Updated Data Distribution Schedule
The survey started three months later than we had forecast eighteen months ago. We are also finding that photometric calibrations with a precision of 2% are very difficult and are taking much more time than we had planned. All of these factors would be expected to delay the date of the first release of data. While the observing phase of the survey did not begin until April of this year, science quality data was obtained during the commissioning period, late 1998 and the first three quarters of 1999. This data sample has yielded impressive results and we have concluded that the data, once properly processed and calibrated, is worthy of distributing to the astronomy community. We expect to apply survey quality calibrations to the commissioning data before March of 2001. For these reasons, we now plan to release a statistically significant sample of the commissioning data during the second quarter of 2001, more than a year ahead of the date of the initial release specified in the 1999 NSF-SDSS Plan. Moreover, we intend to maintain the intermediate milestones for data release specified in the 1999 NSF-SDSS Plan, albeit with smaller fractions of the total data sample. We will maintain the date for the final data release even though observations will continue until the end of March 2005.

We propose a new data distribution schedule because the long commissioning period allowed us to accumulate about 400 square degrees of scientifically valuable data before we resumed observations after repairing the secondary mirror. The chart in Figure 2 shows the milestones for the new schedule for releasing the data. Somewhat arbitrarily, we made January 1, 2000 the effective starting date of the survey in Figure 2, because it simplifies the graphical presentation. Nevertheless, we plan to use that date as the starting point for the schedule for the release of data to the astronomy community. Regular accumulation of raw photometric data began in April 2000, and will end in the summer of 2004, except for some limited opportunities at the end of 2004. As the observing phase draws to a close, the opportunity to take imaging data, carry out target selection, drill plates and take spectra of the associated portion of the sky vanishes. Observations for the spectroscopy are now expected to begin in the last month of 2000 and are scheduled to continue to the end of March 2005. In order to meet the final milestone for the release of the complete sample we will have to decrease the latency to nine months. 

Table 2. Dates for SDSS Data Release
 
 
Release date
Photometry
Spectroscopy
Early release
1-July-2001
5%
0%
Release 1
1-Jan-2003
15%
7%
Release 2
1-Jan-2004
47%
33%
Release 3
1-Oct-2004
68%
60%
Release 4
1-July-2005
88%
85%
Final 
1-July-2006
100%
100%
 
Figure 2.  Milestones and data fractions for the new SDSS Data Distribution Plan.

 6.1.    Details of the Early Data Release
The early data release will contain nearly 400 square degrees of area on the equator, in both the Northern and Southern skies. There will also be a small selected area of about 5 square degrees in the Northern Galactic Cap, which we observed in Spring 2000, to support the First Look Survey (FLS) of the SIRTF program. It will also contain the early spectra that were obtained from the same areas of the sky. We propose to use two processes to make the data available to the astronomy community: open access and controlled access In the former case we will put all of the products, except the Atlas Images, the Science Data Base (SX), and the Corrected Frames, on the SDSS website. The latter (controlled access) will contain SX and the Atlas Images, as shown on Figure 3. These are exactly the same services as the ones provided for the SDSS Collaboration. We feel that the step from supporting the approximately 200 users in the Collaboration to supporting the whole astronomy community of several thousand is a rather major one, thus we need to proceed carefully – many of the Fermilab resources are shared with the entire experimental program at Fermilab. 


Open Access will consist of a web-based interface, containing a database of gif images of the data, with a clickable access to the catalog information, and a simple search engine to the full photometric catalog. This will serve as a finding chart (hereafter: chart). At the same time we will also provide on-line ftp access to the Compact Photometric Catalog and Calibration information, and to Status Information via the SDSS web site. These services will be built in collaboration with Jim Gray (Microsoft), and the necessary hardware will be provided by a grant from Microsoft Research. The ftp site will also contain the corrected frames for all the internal files for the 5 square degree FLS area of the sky, with the documentation required to use the system. 


Controlled Access will provide the same data products as will be found on the web-based interface. In addition, it will provide access to the high-performance search engine (SDSS Science Archive – SX) built for the survey. This will enable much larger and more sophisticated queries. We propose to create an atlas image server that will be accessible to the astronomy community. It will not only supply the individual atlas images, but it will be able to reconstruct a corrected frame in FITS format from the atlas images.
Each user will be requested to apply for an account to use the Fermilab Computing Resources, just as all members of the Collaboration have done. If we find that the resources are easily managed within the available personnel and funds, we will make SX accessible through our website. The funds to implement the SX and create the Atlas Image Server (not yet implemented) for the astronomy community were requested in the August 2000 proposal to the NSF.
 6.2.    Services of the Early Data Release
Figure 3 shows the services the SDSS public data distribution system will provide. The Open Access components are the Finding Chart, the FTP Server and the WWW server. The Controlled Access side will provide access to the Science Archive (SX) and later to the Atlas Image Server. As we understand the usage patterns better, we may consider moving the Atlas Image service to the Open Access side. The facilities for these early services should be considered as a beta test of our ultimate Data Distribution System. These will be continued until the final Data Distribution System is in place.
 



1: This document was extracted from NSF Proposal 0096900 (submitted August 31, 2000) and has been corrected for typographical errors.

Legal Notices