Although those in the second group are also in the province of the providers,
following simple rules can make a lot of difference as to how easily the
required observations can be found by the VO and supplied to the scientist.
As much data as practical should be made available.
From an analysis standpoint, a "regular cadence" with a minimum number of several observations
per hour (6+) is desirable; this would make it possible to track the general evolution of phenomena
although rapid changes would be missed.
Access Method:
The protocol used for the interface into the archive is not critical –
a virtual observatory should be able to handle whatever protocol the data provider adopts.
Standard options include FTP, HTTP, Web Service, etc.
– potentially the first two require least effort by the provider.
In relation to this,
EGSO has developed the concept of resource-rich and resource-poor providers:
- Resource-rich providers – e.g. data centres – should be able to respond to requests through a simple interface.
For resource-rich providers, how the data are stored in an internal issue; catalogues can be used to determine exact access path...
- For resource-poor, if VO needs to find the data itself then logically named files
within a hierarchical directory structure are desirable – see below.
Simple access through FTP or HTTP is the easiest to use.
File Format:
A virtual observatory should be able to accommodate the use of data in any file format.
For quick-look purposes simple image files are adequate – e.g. JPEG, PNG, GIF, etc.
However, the lack of metadata associated such formats with makes it difficult
to use this type of file for serious research.
If the objective is to compare data from different instruments
then files with formats such as FITS, CDF or equivalent are strongly preferred;
these should contain fully formed metadata – see below.
Processing of the data in the file need not be to a high level,
but appropriate software and calibration information must be provided if data needs
to be "manipulated" before use.
As volume of data available increases, and the number of data sets grows,
it is becoming increasingly important that the data be ready for use – i.e. calibrated – although this is by no means obligatory.
File Names & Metadata:
There are no hard and fast rules on the file names but the
name needs to be sufficiently unique that:
- The type and origin of the file can easily be identified, and
- It can exist without causing confusion when removed from the context of where it is normally
stored (on the source archive system)
Ideally the name should identify the "date/time" that the observations were made
and the "observatory/instrument" that made them –
an indication of the type of observation can also be useful.
The "date/time" need not be a full specification, some kind of
a sequential numbering might be sufficient.
However, if file naming is not based on time, a catalogue or simple
translation table is needed to allow the VO to select the appropriate file.
The SOHO mission developed a "convention" for the names of files in its summary and synoptic databases
– see
Naming Convention for Files
(SOHO with BBSO extensions).
A simpler convention might be sufficient, but this provides a gold standard for how things can be done.
Note that the file name on its own is not enough when the data are to be used for analysis.
It is essential that all files contain good metadata describing in detail how the observations were made;
if the metadata are not properly formed, it may be impossible to use the data in some circumstances.
Again a "convention" was established during the time of SOHO
– see Solarsoft Standard.
Directory Structure within the Archive:
A hierarchical structure to the data directories makes it easier to find files and is strongly preferred.
This is essential for resource-poor providers and is also beneficial for a data centre.
Ideally the directory structure should be a tree based on date (and time?):
yyyy/mm/dd
yyyy/mm
yyyy_week
yyyy
...
The number of directory levels really depends on number of files generated by the instrument.
If only one file is produced per day, the number of levels of subdirectories can be reduced.
On Unix-based archives, if the directory structure is different to the one suggest above,
it is possible to map to a more compliant structure using symbolic links without having to
reorder the data themselves. The mapped directory structure can then be presented to the
external interface.
Note: If it is not be possible to make all data available on-line, it is desirable to provide a
catalogue that contains information on other data holdings.
This route (via catalogues) could also be used to advertise proprietary data
so other users at least know that the observations exist!
Return to WG Home Page
R.D. Bentley, UCL-MSSL
Revised March 2009