|
|
|||
|
||||
Attending were John Caron (Unidata), Mike Folk (NCSA), James Gallagher (OPeNDAP), Robert McGrath (NCSA), and Russ Rew (Unidata). Participants via telephone included Quincey Kozial (NCSA), Peter Cao (NCSA), and Kent Yang (NCSA).
The goals of the meeting were to agree on the creation of mappings among the data models for HDF5, netCDF-3, OPeNDAP-2, netCDF-4, and OPeNDAP-4. The latter two models are under development, so this is an ideal time for a "merger" of the data models for HDF5/netCDF/OPeNDAP to make interoperability possible.
John Caron proposed three levels of specification:
We decided we should definitely try to do 1 as a goal for this first meeting. The proposed product would be a document for developers of HDF, netCDF, and OPeNDAP. Ultimately, a specification of a Common Data Model might be a candidate for submission to the new NASA Earth Science Data Systems Standards Process as an ESE Community Standard.
Given the three hours alloted for the meeting, we also agreed it would be best to identify issues without trying to resolve them. The meeting time would be spent brainstorming various ideas, clarifications, and issues related to a Common Data Model to start a conversation we could continue on the Common Data Model Wiki site and via email on the data-models@unidata.ucar.edu mailing list.
John suggested the following topics for discussion, initially allotting about 10 minutes for an overview of the issues in each area to make sure we cover them all:
Bob noted that OPeNDAP is a read-only protocol, so wondered whether this distinction with read/write netCDF and HDF interfaces was important to keep in mind while discussing the three models. Also, maybe we should concentrate on OPeNDAP issues to take advantage of having James here. We decided to treat OPeNDAP, netCDF, and HDF as comparable data models and to focus on the next generation of data models needed for netCDF-4 and OPeNDAP-4 rather than the current implementations.
Unidata has decided to support HDF5 groups in netCDF-4 to provide scopes for names. HDF5 uses a directed graph for groups, but in netCDF-4, we want to restrict groups to a tree, so that each non-root Group has a unique parent Group. In HDF5, cycles are permitted in Group graphs and a Group may have multiple "parents". HDF5 groups provide power similar to multiple inheritance, with similar complexities.
If both netCDF-4 and HDF5 support groups, OPeNDAP should also provide some meaning for Group objects. Should OPeNDAP model groups with structures or introduce groups as a new object?
In both HDF5 and netCDF-4, structs map to compound objects with efficient access. That is, users expect that the elements of a struct are stored close together. Groups are used as containers to aggregate multiple objects which are not necessarily stored close together. Representing groups with structs in OPeNDAP is using one concept for two different kinds of hierarchy, or two kinds of containers.
James pointed out that Lists and Functions had originally been part of the OPeNDAP data model, but were later removed, because no one used them. But having Lists and Functions in the DAP specification created a lot of work for those implementing the specification.
Do we need two kinds of containers, represented by structs and groups? Are there other workarounds instead of adding groups to OPeNDAP? We need a nice clean default. It's easy for servers to ignore groups, but a server writer who needs them (e.g. HDF5 server) would like to use them. Groups could be used effectively in a GRIB server, putting multiple projections of the same variables in separate groups. Clients can't ignore groups, so they probably have to map them into something else (e.g. "flatten" them). If clients just ignore groups, they couldn't access any data from some datasets.
In OPeNDAP, attributes are atomic types and vectors of atomic types, modeled after netCDF attributes. Should map vectors of an OPeNDAP Grid be attributes? John thought this would be wrong, they should be allowed to be multidimensional arrays. In HDF5, attributes may have the same type as variables, including compound types.
We should also discuss the operations that apply to objects, for an access model rather than a data model. In netCDF, attributes are intended for metadata that is all read into memory when a netCDF file is opened. Thus accessing a netCDF attribute does not require a disk or network access. For OPeNDAP, this access model is even more important to let users know what is expensive. Browsing attributes should be cheap. HDF-EOS products sometimes have huge attributes [is this true?]. Maybe a client should have the option to request that all attributes be read on open or not. The model in OPeNDAP-2 is to open a remote source and get all the attributes when you ask for them.
Unidata has requested that HDF5 support defining attributes for structure members. Quincey said he's not fond of this, because it's hard to come up with a good interface. John described the use case for a station observation model, where measurements for temperature and pressure are represented as a structure, and the units of measurement values are represented as attributes. Attributes of structure members were added to FITS and HDF4. Quincey said that implementing the general case is difficult, especially with nested structures. This may be a case where "the perfect is the enemy of the good".
NetCDF uses shared dimensions for coordinates and to indicate that different variables are defined on the same grid. HDF5 does not have shared dimensions yet. OPeNDAP has map vectors for grids that serve the same purpose. John thinks shared dimensions are so simple and "cool" that he would like to see explicit support for a shared dimension object type in both HDF5 and OPeNDAP-4. OPeNDAP has an aliasing scheme that allows for reduction of repetition. A suggestion was to get rid of the OPeNDAP grid data type and use dimensions instead. Incidental name collisions might cause a problem.
The HDF5 developers' group decided against trying to include coordinate systems in the HDF5 library, but instead to implement them above the library using "dimension scales". A dimension is part of the Dataspace, and a dimension scale is a dataset with an optional name and an attribute that indicates it's a dimension scale. Each dimension can have one or more dimension scales, sharable by multiple dimensions. The relationship between a dataset dimension and its scale is not maintained by the library, because you could delete a dimension without deleting its scale. The dimension scale proposal allows for multidimensional dimension scales. The set of functions that deals with dimension scales is proposed as a new high-level API. Mike has made available a summary of the dimension scales proposal for HDF5 and associated slides. The proposal is quite general, and should be adequate for supporting netCDF-4. It includes using a start value and offset pair to represent an equally-spaced dimension scale.
James said he would like to get rid of grid datatype in OPeNDAP, but there are various pressures to keep it.
John explained his support for coordinate systems in the Java netCDF interface (which is also how we plan to support them in the other netCDF interfaces) with this example:
float var(time, z, x, y)
var:_coordinates = "lat lon lev time"
float lat(x, y)
float lon(x, y)
float lev(time, z, x, y)
int time(time)
In this example, lat, lon, lev, and time are coordinate variables,
generalizing the one-dimensional coordinate variables in current
netCDF conventions..
The set of dimensions used by the coordinates must be a subset of the
set of dimensions that the variable uses. We want to restrict
the coordinate variables to be scalar type, with no extra dimensions.
Recently we set up an OPeNDAP server using ESML and level 2 radar with a complex OPeNDAP representation, but using this notion of coordinate systems simplified the representation so it was easy to understand, using "range elevation azimuth" as a coordinate system for time, lat, and lon dimensions.
James said this is also currently in the OPeNDAP-4 proposal, and that grids do this. GML (geographic markup language) uses a restricted form of this. HDF5 has some use cases in the dimension scales proposal that are more general. See Quincey's use cases and RFC for some controversial issues that arise. The netCDF-4 prototype can implement this on top of HDF5, but we could solidify this as a best practice by implementing an API for coordinate systems.
The HDF5 primitive types about which there are issues include:
The opaque type should be thought of as a "lump o' bytes" along with a name that gives a hint about its use, such as "JPEG2000". It has a size and a tag, and can be used much like a user-defined type. In HDF5 these are atomic, in that it doesn't make sense to access pieces of an opaque value. Consider it like a TCP packet, that is indivisible. It behaves differently from an array of bytes.
Enums convey additional meaning. ASN and EML have enums. Boolean could be an enum.
A reference points to an object or to a selection within a dataset. With references, you can support an array of pointers to objects. NPOESS uses references to point to slices of data. The astronomy community wants general references to pieces. NetCDF-4 could use names for the first kind of references, but general sections in HDF5 are complex and you could leave them out of netCDF-4. In general they can be used to support scatter/gather, like a bit mask to identify pieces that are of interest. James pointed out that "references" might be a misnomer, since they are not really like pointers or references in a programming language when used for the selection or subsetting function..
John pointed out that there were two things going on that he wanted to separate: HDF5's vlen data type used for arrays of any type whose length is variable and OPeNDAP sequences, which also represent 1D arrays of structures whose length is variable. OPeNDAP sequences can be passed a constraint expression to select particular elements.
In HDF5, a vlen is atomic. A ragged array of vlens is like an array of strings in C, with each element having its own length.
John asked about the following:
struct {
int a[*] // type vlen int
float f[*] // type vlen float
} s[213]
James agreed that OPeNDAP does not support vlens, but instead only has sequences and nested sequences. A JGOFS data set is an example. With sequences in real use for JGOFS, a relational expression is needed. The List data type that was dropped from OPeNDAP for lack of use was exactly a vlen, but without constraint expressions, no one had any use for it.
The HDF5 experience is that vlens are used in the DNA community and are exactly what is needed by Boeing for modeling their flight testing data. Could list be just a special case of sequence, so that one abstraction could unify these? What about adding relational constraints to HDF5 so they could be used to model OPeNDAP sequences with constraint expressions? That would make servers a lot more complicated. But if a server can't implement a sequence, it could just make it a list.
How do you decide which struct member can be used in constraint expressions? A feature request for OPeNDAP-4 is to provide a way of identifying which fields (structure members) can be used in constraint expressions.
An HDF5 vlen is like an OPeNDAP list in some ways, but it's atomic, in that you always get all of a row of a ragged matrix. (HDF5 also has an array type of fixed length that's atomic.)
John asked whether you get just small fields in a sequence without reading everything else. There are subtleties and efficiency issues. The goal would be to serialize an OPeNDAP stream to HDF5 without a loss of information. Boeing does relational operations on tables, and would like indexes of certain fields of their tables for fast access. This capability would also be useful for creating multiple views. If someone were to write a proposal to add indexes to HDF5 in the form of another high-level datatype for creating indexes, sorting, etc., it might get funded for a year's worth of work.
Russ cautioned against adding indexes, views, query optimization, and similar support for database operations. If someone needs a relational database, they should just use a relational database, not a scientific data model on which database functionality has been grafted. John argued that there are simple cases that we should provide in a common data model to cover a lot of the functionality.
According to James, JGOFS and FreeForm are what people mostly use for sequence data, as well as HDF4. John thought the model for station data should be a list of structures with vlens, not a sequence. The station model use case would be ideal for MADIS mesonet data, to make it accessible from the IDV, and to provide what would look like servers for individual stations. James reminded us that no one ever implemented lists in their OPeNDAP servers, so they were dropped from the OPeNDAP-2 documentation submitted as a NASA standards specification. The function datatype was also eliminated for the same reason. OPeNDAP client writers have a much harder task than server writers, because they have to deal with multiple kinds of servers.
James thought enums and time would be useful datatypes to add, perhaps as a subclass of string. Quincey expressed scepticism about representing time in a general enough way for most uses. Russ noted that Java had gone through two iterations and now had a pretty useful abstraction that we could borrow from. John thought it would be sufficient to document a set of best practices for time and use the ISO string as a standard for the interface. There was a consensus to build a higher-level best practice for time, since Quincey objected to storing it as a "low-level" file type. We can, if we choose, still put it in the CDM by specializing an existing data type like string.
Instead of adding groups to OPeNDAP, James wondered if it would be sufficient to externalize each group as a separate data source. However, in netCDF-4, we have in mind that groups will share things factored out into parent groups. Also, groups may be useful to organize the same data in different ways, such as by time and by parameter, although this requires that groups be non-hierarchical, as in HDF5. James expressed worry about adding groups to OPeNDAP-4, wondering what old clients would do with new servers? Would this break all existing clients?
The problem now is that people use OPeNDAP structures for both structures and groups. To prevent confusion this causes, groups should be first class objects. James expressed reluctance to add groups because he wants a simple spec so servers will be written. Adding groups to OPeNDAP-4 may be more of a deployment issue than a software issue. Mike pointed out that both Boeing and the DNA project implemented groups as nested sequences.
James asked how THREDDS catalogs relate to groups, since they could both be thought of as containers. The THREDDS hierarchy eventually gets down to single files. John wanted to think about this more.
Bob wondered how a large XML dataset could be served by OPeNDAP. James needs to talk more with John about making server front ends in Java rather than C++, and wondered how much of the Java server stuff could be used.
Russ will write up these meeting notes and make them available from the Wiki, and Mike will also make his notes available. James will see if he can re-invigorate the moribund OPeNDAP-4 specification development, but currently everyone's over committed.
James will revisit the OPeNDAP-4 spec on the web. He will also look at the Wiki and bring it up to date.
John will consider marketing ideas from this work, perhaps to get funding from the fact that we are working together on a common data model. Russ suggested a possibility of joining with an SEIII proposal he's thinking of writing to NSF (December 15 deadline).
Bob wants specific documents to result from this work. Perhaps we can use the Wiki to collaboratively develop a common data model document and a proposal document. The ESIP federation would certainly be interested in this work.
John is interested in development using the Wiki. Seamless interoperability among netCDF, HDF, and OPeNDAP is a long term goal, but that may take years. The ESIP meeting is the first week in January, which might be a good time to tell people about this work.
James provided an appropriate epigram to end the meeting: "The reality is that in the end, running code wins."
| Contact Us Site Map Search Terms and Conditions Privacy Policy Participation Policy | ||||||
|
||||||