Unidata - To provide the data services, tools, and cyberinfrastructure leadership that advance Earth system science, enhance educational opportunities, and broaden participation. Unidata
         
  advanced  
 

Developing Conventions for NetCDF-4

Russ Rew, UCAR Unidata Program
September 2008

  1. Developing Conventions for NetCDF-4
    1. Background: NetCDF and Conventions
      1. NetCDF Data Models
    2. Use of NetCDF-4 Model Features
      1. Uses for Groups
      2. User-Defined Types
      3. Uses for Compound Types
      4. Convention for Assigning Attributes to Members of Compound Type
      5. Use of Variable-Length Types
      6. Use of Enumerations
      7. Cautions Regarding Use of Unsigned Integers
      8. Use of Strings
      9. Use of the NetCDF Classic Model with the NetCDF-4 Format
      10. Draft Recommendations for Using netCDF-4 Features
      11. Best Practices and CF Conventions for netCDF-4?
      12. Conclusion

Background: NetCDF and Conventions

Since 1988, netCDF user documentation has recommended use of conventions for representing meaning in data and for encouraging interoperability between data providers, application developers, and data users.  User Guide Conventions refer to recommendations that appear in the netCDF User Guide.  User Guide Conventions are intended to be general enough to apply to any kind of data represented in netCDF form, including data that is not earth-referenced, for example.

User Guide Conventions include recommendations such as:


User Guide Conventions are intended to be independent of scientific discipline.  Some are also independent of particular human languages, such as the identity of variable and dimension names to identify coordinate variables.  User Guide Conventions are intended to provide general solutions that anticipate needs of providers of data, developers of applications, and purveyors of data services.  They deal with issues that need a published standard to support interoperability.  If practical, netCDF applications should interpret User Guide Conventions properly.

User Guide Conventions are not intended to be comprehensive: more specific conventions  may be required for particular projects, disciplines, or communities.  The climate and forecast community recognized that the simple set of conventions in the Users Guide were not sufficient for writing generic analysis and visualization applications for their data sets. This lead to the development and publishing of the Climate and Forecast (CF) Conventions (www.cfconventions.org ), designed to support self-description, to be easy to use for both data writers and readers, to be effectively understandable by both humans and programs, and to minimize redundancy in representation.  CF Conventions make use of and build upon User Guide Conventions as well as earlier gridded data conventions.

Development of the CF Conventions provides an example of how a community can agree on standard ways to represent quantities and coordinate systems within the simple framework provided by netCDF-3, using only dimensions, variables, attributes, and a limited set of six primitive types.  In particular, the "standard_name" attribute is now used to represent a wide variety of observed and modeled quantities in a widely used collection of over 1000 standard names.  In addition to coordinate systems and standard names for quantities, CF provides standard representations for grid cell bounds and grid cell measures.

With the recent establishment of governance and conventions committees, the CF Conventions have achieved primacy among netCDF conventions more comprehensive than the Users Guide Conventions.  Many earth-science datasets use CF-compliance as a brand of interoperability, and handling CF-compliant data has become an important requirement for servers, clients, and applications.  Recently development of a new open source library, libcf, has begun, with the objective of making it easier for data providers to create and for readers to access CF-compliant data.

The CF Conventions focus on gridded model outputs but also deal with some simple observational data sets.  A more comprehensive set of conventions for observational data for point, trajectory, station, and profile data is implemented in the Unidata Observation Dataset Conventions for netCDF-3 (UnidataObsConvention.html), and is currently supported by the netCDF Java interface.  A recent proposal specifies modifications to integrate these observational data conventions with the CF conventions.

With the development and release of netCDF-4.0, an enhanced but more complex data model is available.  Some of the new features in netCDF-4 provide better ways to represent observational data, new ways to represent metadata, and ways to make data more self-describing.  This document suggests a few new conventions, but also recommends proceeding slowly, because the most useful conventions evolve over time from experience of data providers, application developers, and users of the data.  Each new convention adopted potentially adds work for developers of compliant applications and for providers of compliant data.

Achieving a balance between making a less-than-perfect convention available quickly and taking the time required to achieve a consensus on a complete, comprehensive, and well-tested convention is sometimes necessary. A successful transition to a new set of netCDF-4 conventions will require careful attention to compatibility concerns as well as realization that delay to achieve perfect consensus on a comprehensive and general convention is not always practical.  Delay may result in the undesirable use of incompatible conventions and loss of interoperability by data providers who have a timely need to make data available.

NetCDF Data Models

Data formats are low-level, implementing data conventions by mapping the abstractions that are the subject of conventions to their representation on storage media.  Data conventions make use of data formats, adding higher-level abstractions and data objects such as coordinate systems.  Data models are the most abstract and simplest conceptual layer, and may make use of data conventions in representing the intent of a data provider.  Data formats typically implement a data model directly when no conventions are used, but conventions may improve the data model by supporting additional abstractions or simplifications.

Two important data models for netCDF are
Use of CF Conventions version 1.x support the classic netCDF data model with coordinate systems and other useful abstractions.

The classic netCDF model represents data sets using named variables, dimensions, and attributes.  A variable is a multidimensional array whose elements are all of the same type.  A variable may also have attributes, which are associated named values. Each variable has a shape, specified by its dimensions, named axes that have a length.  Variables may share dimensions, indicating a common grid.  One dimension may be of unlimited length, so data may be efficiently appended to variables along that dimension.  Variables and attributes have one of six primitive data types: char, byte, short, int, float, or double.

A Unified Modeling Language (UML) diagram of the classic netCDF Data Model shows its simplicity:
Although the netCDF-3 data model has the virtue of simplicity, it also has significant limitations.  There is little support for data structures other than multidimensional arrays and lists.  In particular, nested structures and ragged arrays are not easily represented.  Only one shared unlimited dimension per file means some datasets must use multiple files.  A flat name space for dimensions and variables limits scalability.  Character arrays can represent strings, but require the user to explicitly deal with their length. Lack of unsigned types and 64-bit integer types precludes some applications.  The associated netCDF classic format does not support compression of individual variables.  Additions to file schema, such as adding new variables and dimensions, can be very inefficient, causing the data to be recopied.  Finally, the classic data format has a bias toward big-endian systems, requiring more byte-swapping conversions for accessing data on little-endian platforms.

The netCDF-4 data model, implemented using an HDF5-based storage layer, deals with all these limitations.  In this enhanced data model, a file has a top-level unnamed group.  Each group may contain one or more named variables, dimensions, attributes, groups, and types.  A variable is still a multidimensional array whose elements are all of the same type, each variable may have attributes, and each variable's shape is specified by its dimensions, which may be shared.  However, in the enhanced data model, one or more dimensions may be of unlimited length, so data may be efficiently appended to variables along any of those dimensions.  Variables and attributes have one of twelve primitive data types or one of four kinds of user-defined types.

A UML diagram of the enhanced netCDF data model used for netCDF-4 shows (in red) what it adds to the classic netCDF data model:
Because preserving access to archived data for future generations is very important, the netCDF-4 data model, data format, and software are designed to provide compatibility with and continued support for netCDF-3 data and applications.  Read and write access are provided for classic format data, and existing programs merely require recompiling.

Use of NetCDF-4 Model Features

Below, we discuss benefits that may be obtained by just using the classic netCDF data model with no netCDF-4 features.  This is recommended practice for existing data archives and projects relying on current conventions or on data management software or visualization and analysis packages that assume the classic netCDF data model.  However for new projects that lack legacy issues or constraints of interoperability with current systems, use of some netCDF-4 features may be more appropriate.  For such cases, we offer some early recommendations for uses for groups and user-defined types.

For the examples presented, we use the netCDF-4 Common Data Language (CDL) notation to show the structure of the data and metadata, as produced by the netCDF-4 ncdump utility and interpreted by the netCDF-4 ncgen utility (to be available in release 4.1). The examples presented are for illustrative purposes, so are not complete.  Some of these examples illustrate potential issues for new netCDF-4 conventions, and discussion of such issues appears interspersed with the examples.

Uses for Groups

Groups provide nested scopes for names, similar to directories in a file system.  Just as files in different directories may have the same names, variables in different groups may also have the same names.  A netCDF group is analogous to a netCDF file, with its own set of named dimensions, variables, attributes, types, and subgroups.  Names for objects in groups may be specified using a "/" separator to identify their location in the group hierarchy, just as with file systems.

Here is an example use of groups to organize data by a named property, in this case geographical regions:

group UnitedStates {
  dimensions: time = unlimited;
  variables: float average_temperature(time);

  group Washington {
    dimensions: time = unlimited, stations = 47;
    variables: float temperature(time, stations);
  }
  group Oregon {
    dimensions: time = unlimited, stations = 61;
    variables: float temperature(time, stations);
  }
  group California
{
    dimensions: time = unlimited, stations = 53;
    variables: float temperature(time, stations);
  }
  …
}


In the above example, each inner group has its own Variable named "temperature", its own dimension named "stations", and its own unlimited dimension named "time".  (It is also possible to have multiple unlimited dimensions within a single group or without using groups.)

Potential uses for groups include applications that require:

User-Defined Types

The netCDF-4 data model makes available several kinds of user-defined types: compound types, enumerations, variable-length types, and opaque types.  Each type has a name and a definition.  Named types are contained in groups, but may be referenced in type definitions in other groups.  Both variables and attributes may be declared to be of user-defined types, which allows a natural extension of conventions that require some variable attributes to be of the same type as the variable, for example _FillValue.

Types exist independently of variables or attributes that use them, so it is possible for a type to be contained in a netCDF group even though no variables or attributes are declared to be of that type.  This may be useful for pre-declaring types to be used for data to be added later or as templates for derived data objects.

Since each type requires a name, proliferation of names suggests it may be useful to have a convention for type names to easily distinguish them from variable names.  In the examples below, we add the "_t" suffix to type names to make them easier to identify.

Uses for Compound Types

Compound types are like C structures, grouping together named fields (also called "members"), that may be of different types, into a structure that may be accessed as a unit.  For example:

types:
  compound wind_vector_t {
    float eastward ;
    float northward ;
    }
dimensions:
    lat = 18 ;
    lon = 36 ;
    pres = 15 ;
    time = 4 ;
variables:
    wind_vector_t wind(time, pres, lat, lon) ;
       wind:standard_name = "geostrophic_wind_vector" ;
data:
    wind = {0, 0}, {10, 20}, {20, 10}, {15, 15}, {20, -5.5}, ...;

defines a wind vector type with two members, eastward, and northward.  The standard_name attribute  is given the value "geostrophic_wind_vector", which is not an actual standard name but a plausible substitute for the current standard names "geostrophic_eastward_wind" and "geostrophic_northward_wind" that identify non-vector quantities.  If data values will be accessed together, it may make sense to package them into a compound type and create a variable that is an array of that type. 


Note there is a new CF Conventions issue with using compound types (or any user-defined type).  Does use of a particular standard name also imply use of a standard type for the associated compound type, for example should a quantity whose standard name includes "_wind_vector" be of a compound type equivalent to the wind_vector_t type defined in the example?  If so, are the member names of the compound type also part of the convention for this quantity?


As another compound type example, consider this representation of point observation data:


types:
  compound wind_vector_t {
    float eastward ;
    float northward ;
    }
  compound ob_t {
      int station_id ;
      double time ;
      float temperature ;
      float pressure ;
      wind_vector_t wind ;
  }
dimensions:
    stations = unlimited ;
variables:
    ob_t obs(stations) ;
data:
    obs = {42, 0.0, 20.5, 950.0, {2.5, 3.5}}, … ;

Compound types may be nested, as the above example shows with the use of a wind member of type wind_vector_t.


Potential uses for compound types include


Member fields of a type have a name and a type, but are not netCDF variables.  In particular, there is no way to directly assign variable attributes to them, but we next propose a convention to handle this.

Convention for Assigning Attributes to Members of Compound Type

Although the netCDF-4 data model does not support assigning attributes directly to individual fields of a compound type, it is possible to assign compound type attributes to a variable of compound type.  This leads to a natural convention for associating the values of fields of a compound type attribute with fields of a variable of compound type that have the same name.  An example may help to make this clearer:

types:
  compound wind_vector_t {
    float eastward ;
    float northward ;
    }
  compound wind_vector_units_t {
    string eastward ;
    string northward ;
    }
dimensions:
    station = 5 ;
variables:
    wind_vector_t wind(station) ;
       wind_vector_units_t  wind:units = {"m/s", "m/s"} ;
      
wind_vector_t        wind:_FillValue = {-9999, -9999} ;
data:
    wind = {0, 0}, {10, 20}, {20, 10}, {15, 15}, {20, -5.5};
Note that the order of field names in the above compound types are not what determines the assignment of units.  Rather the identity of field names maps the value of the field  eastward of string type in the variable attribute units to be associated with the variable field eastward of type float in the wind variable.

As can be seen in the above example, use of this convention for assigning attribute values to members of compound types can lead to a proliferation of types and type names.  The netCDF-4 implementation inherits this problem from HDF5, which does not permit assigning attributes directly to compound type members.  In the future it may be possible to program around this by defining a library-level convention, either in netCDF-4 or libcf.

Use of Variable-Length Types

Named variable-length types may be created for any netCDF-4 base type, to represent one-dimensional arrays of variable length.  In netCDF-4, these currently must be read atomically, that is the entire one-dimensional array must be accessed with one function call to access its length and the values of the data.

Here is an example of using nested variable-length types to represent marine data:

types:
  compound obs_t {                // type for a single observation
    float pressure ;
    float temperature ;
    float salinity ;
  }
  obs_t some_obs_t(*) ;           // type for some observations
  compound profile_t {            // type for a single profile
    float latitude ;
    float longitude ;
    int time ;
    some_obs
_t obs ;
  }
  profile_t some_profiles_t(*) ;  // type for some profiles
  compound track_t {              // type for a single track
    string id ;
    string description ;
    some_profiles_t profiles ;
  }

dimensions:
  tracks = 42 ;

variables:
  track_t cruise(tracks) ;         // this cruise had 42 tracks

The above defines 42 tracks, each of which is of compound type containing an id, a description, and a variable number of profiles.  Each profile comprises a location, time, and a variable number of observations.  Each observation is a compound structure of pressure, temperature, and salinity.


Potential uses for variable-length types include ragged arrays and in situ observational data typical of soundings, profiles, and time series.  For a variable-length type, any base type may be used, including a compound type or another variable-length type.  There is no associated shared dimension, and the value of a variable-length type is currently accessed all at once, for example a whole row of a ragged array.  Access to one base value at a time of variable length types may be supported soon in some language interfaces by iterators.


It may be useful to distinguish variable length types with a prefix such as "some_" or "list_of_" as in the example above.  A convention for names of variable-length types might enhance interoperability with other data models and make declarations of complex types more easily understood.  However, such a convention may be too English-centric for international use.

Use of Enumerations

Enumerated types may be used to represent a small number of named values more concisely than strings,  because small numeric values are stored even though the corresponding text symbols are displayed when the data is dumped.  For example, consider the example:

types:
  byte enum cloud_t {
    Clear = 0, Cumulonimbus = 1, Stratus = 2, Stratocumulus = 3,
    Cumulus = 4, Altostratus = 5, Nimbostratus = 6, Altocumulus = 7,
    Cirrostratus = 8, Cirrocumulus = 9, Cirrus = 10, Missing = 127
  } ;
dimensions:
    time = unlimited ;
variables:
    cloud_t primary_cloud(time) ;
        
cloud_t  primary_cloud:_FillValue = Missing ;
data:
    primary_cloud = Clear, Stratus, Clear, Cumulonimbus, Missing ;
Each data value of the variable primary_cloud in the example above only requires one byte of storage.  Using an array of fixed-length strings instead, as required in netCDF-3, would use at least 13 bytes of storage for each value, in order to reserve enough space to store the longest string "Stratocumulus".

Enumeration types can improve self-description while keeping data compact.  They provide a better alternative to using strings for flags for such purposes as data quality indicators, soil type, cloud type, and similar situations where a small fixed set of non-numeric values are appropriate.  In the CF Conventions, the attributes flag_meaning and flag_values are used for this purpose,  but using an enumeration type may be somewhat simpler.

A potential conventions issue is whether the gain in simplicity is worth the cost of a new convention.  If enumerations are used, would the enumeration symbols also be standardized, as the names for standard quantities are in the standard names table?  Would a convention be needed to associate a more descriptive string for each enumeration symbol?

Cautions Regarding Use of Unsigned Integers

Unsigned integers are not a supported type in some programming languages, such as Fortran and Java.  In these languages, n-bit unsigned integer values may have to be read into signed integers with more bits to ensure values are preserved.  For example, unsigned 16-bit shorts might need to be read into 32-bit signed integers.  This is especially problematical for the unsigned 64-bit type, for which no integer type may be available wide enough to hold large unsigned values.  As a general recommendation, avoid using the unsigned 64-bit integer type for data that someone might want to read using Fortran, Java, or other languages not supporting this type.

Use of Strings

The primitive type "string" is available in the netCDF-4 data model for variable-length strings.  Arrays of strings are useful for representing multiple lines of text, lists of variable-length text values, and similar applications.  However string is a new primitive type, not available to netCDF-3 C and Fortran APIs.  It is not compatible with netCDF-3 applications.  Data providers must weigh the convenience of using the string primitive type against the adaptation that will be required for software to access string data.

Currently, long multi-line strings used for metadata such as  "history", "source", or "references" global attributes, must use embedded newline characters "\n" to separate lines.  With an attribute of string type, arrays of lines may be represented without "\n" separators, and the values of such attributes will be displayed with one string value per line.

Use of the NetCDF Classic Model with the NetCDF-4 Format

Data providers writing new netCDF data  have a choice among two obvious alternatives and a third less obvious choice:

  1. Continue using netCDF-3 software, data model, and associated format for maximum compatibility.
  2. Make use of netCDF-4 software and the netCDF-4 (HDF5-based) format for its new data model and performance features.
  3. Use the classic netCDF data model with the netCDF-4 format.

This third choice is supported by the netCDF-4 software, by using the  NC_CLASSIC_MODEL flag (in the C interface) when creating a file, which enforces rules on what functions may be called to store data in the file, to make sure its data can be read by netCDF-3 applications (when relinked to the new netCDF-4 library).

Use of the NC_CLASSIC_MODEL flag for writing new data files provides several significant benefits for both writers and readers, without breaking backwards compatibility for applications that read netCDF data.  Data written in this mode may follow current conventions for netCDF-3 files, even though the data will be written as netCDF-4/HDF5 files.  Benefits of using this mode for data writers include:


Some of these benefits are also available to netCDF-3 programs reading netCDF-4 NC_CLASSIC_MODEL data:


Data providers that use the NC_CLASSIC_MODEL when creating a file are prevented by the interface from making use of certain new features in the netCDF-4 data model that cannot be interpreted by netCDF-3 programs, including groups, new primitive types, and user-defined types.  The new primitive types are 8-, 16-, and 32-bit unsigned integers, 64-bit signed and unsigned integers, and strings.  User-defined types include compound structures, variable-length types, and enumerated types.

Draft Recommendations for Using netCDF-4 Features

Before using features of the netCDF-4 data model, consider the implications:


Nevertheless, there are cases when use of new features from the netCDF-4 data model may be the right choice.  For example, on a new project that lacks legacy issues or constraints from need for interoperability with existing applications, experimentation with the new data model may be desirable and practical.  Consider netCDF-4 if:


With these considerations in mind, some recommendations for data providers can be made during the transition from predominant use of netCDF-3 and the classic data model to wider use of the features of the enhanced data model supported in netCDF-4:


Some features of the new data model may be adopted and supported earlier in applications than other features.  It is possible that some features may not be widely used or supported by third-party software.  For example, support for groups and strings is easier than supporting arbitrarily nested user-defined types.

Benchmarks and user performance tests will help with guidance on compression, chunking parameters, and use of user-defined types, all of which have have performance implications.

Best Practices and CF Conventions for netCDF-4?

NetCDF-4 has only recently been made available, so there are currently few identified conventions issues.  One of the principles for the CF Conventions is
Conventions [should be] developed only for known issues.  Instead of trying to foresee the future, features are added as required.
Analogously, development of CF Conventions for netCDF-4 is still somewhat premature.  We may be able to foresee some issues and make modest recommendations to simplify the tasks of providing data and developing new services and applications, but developing comprehensive conventions for use in climate models and observational data, for example, will require more experience.

In general, it seems wise to avoid replacing an existing adequate convention with a better alternative convention that uses netCDF-4 features unless there is some overwhelming advantage to the new convention, because applications will have to accept both the old and new conventions.

Conclusion

There is still little experience with representing model outputs in the new data model.  Most of the experience so far is with use of the netCDF-4 format and the classic data model.  Gaining the experience needed to provide guidance for use of new features is better done by data providers and users than the netCDF-4 developers. 

The current release of netCDF-4 is not mature enough that we can recommend a comprehensive set of new conventions.  Nevertheless, we have tried to identify some features of netCDF-4 that are candidates for use in evolving CF conventions.  Application developers are likely to delay supporting netCDF-4 features until it's clear which features will prove useful for representing the next generation of model output archives and observational datasets.

With data providers, application developers, and conventions creators, we are confronted with something like a three-stage chicken-and-egg problem.  Data providers are unlikely to be the first to use features not supported by applications or standardized by conventions.  Application developers are unlikely to expend the effort needed to support features that are not being used by data providers and that are not standardized as published conventions.  Those drafting and maintaining standard conventions must wait until data providers identify needs for new conventions and must consider issues applications developers will confront if they decide to support the new conventions.

Best practices that become crystallized into new conventions develop out of the experience of data providers, application developers, and users. These considerations mean a definitive set of user guide and CF conventions for netCDF-4 may take a while to develop and mature.  More usage examples and draft proposals from users, developers, and data providers will advance the process.

This document is a draft and we welcome feedback.