Developing
Conventions for NetCDF-4
Russ Rew, UCAR Unidata Program
September 2008
Background: NetCDF and Conventions
Since 1988, netCDF user documentation has recommended use of conventions for
representing meaning in data and for encouraging interoperability between data
providers, application developers, and data users. User Guide
Conventions refer to recommendations that appear in the netCDF User
Guide. User Guide Conventions are intended to be general enough to apply
to any kind of data represented in netCDF form, including data that is not
earth-referenced, for example.
User Guide Conventions include recommendations such as:
-
Use of the same name for a variable as for a dimension to represent simple
coordinates
-
A "units" attribute to store a string representation of units of measurement
-
"scale_factor" and "add_offset" attributes for simple packing of
floating-point values into smaller 8- or 16-bit integers
-
A "_FillValue" attribute to represent data that has not been written or that
is missing
-
A "Conventions" attribute to declare which discipline-specific,
project-specific, or community-specific conventions a data file complies
with
User Guide Conventions are intended to be independent of scientific
discipline. Some are also independent of particular human languages, such
as the identity of variable and dimension names to identify coordinate
variables. User Guide Conventions are intended to provide general
solutions that anticipate needs of providers of data, developers of
applications, and purveyors of data services. They deal with issues that
need a published standard to support interoperability. If practical,
netCDF applications should interpret User Guide Conventions properly.
User Guide Conventions are not intended to be comprehensive: more specific
conventions may be required for particular projects, disciplines, or
communities. The climate and forecast community recognized that the simple
set of conventions in the Users Guide were not sufficient for writing generic
analysis and visualization applications for their data sets. This lead to the
development and publishing of the Climate and Forecast (CF) Conventions
(www.cfconventions.org
), designed to support self-description, to be easy to use for both data writers
and readers, to be effectively understandable by both humans and programs, and
to minimize redundancy in representation. CF Conventions make use of and
build upon User Guide Conventions as well as earlier gridded data conventions.
Development of the CF Conventions provides an example of how a community can
agree on standard ways to represent quantities and coordinate systems within the
simple framework provided by netCDF-3, using only dimensions, variables,
attributes, and a limited set of six primitive types. In particular, the
"standard_name" attribute is now used to represent a wide variety of observed
and modeled quantities in a widely used collection of over 1000 standard
names. In addition to coordinate systems and standard names for
quantities, CF provides standard representations for grid cell bounds and grid
cell measures.
With the recent establishment of governance and conventions committees, the CF
Conventions have achieved primacy among netCDF conventions more comprehensive
than the Users Guide Conventions. Many earth-science datasets use
CF-compliance as a brand of interoperability, and handling CF-compliant data has
become an important requirement for servers, clients, and applications.
Recently development of a new open source library, libcf, has begun, with the
objective of making it easier for data providers to create and for readers to
access CF-compliant data.
The CF Conventions focus on gridded model outputs but also deal with some simple
observational data sets. A more comprehensive set of conventions for
observational data for point, trajectory, station, and profile data is
implemented in the Unidata Observation Dataset Conventions for netCDF-3
(UnidataObsConvention.html),
and is currently supported by the netCDF Java interface. A
recent
proposal specifies modifications to integrate these observational data
conventions with the CF conventions.
With the development and release of netCDF-4.0, an enhanced but more complex
data model is available. Some of the new features in netCDF-4 provide
better ways to represent observational data, new ways to represent metadata, and
ways to make data more self-describing. This document suggests a few new
conventions, but also recommends proceeding slowly, because the most useful
conventions evolve over time from experience of data providers, application
developers, and users of the data. Each new convention adopted potentially
adds work for developers of compliant applications and for providers of
compliant data.
Achieving a balance between making a less-than-perfect convention available
quickly and taking the time required to achieve a consensus on a complete,
comprehensive, and well-tested convention is sometimes necessary. A successful
transition to a new set of netCDF-4 conventions will require careful attention
to compatibility concerns as well as realization that delay to achieve perfect
consensus on a comprehensive and general convention is not always
practical. Delay may result in the undesirable use of incompatible
conventions and loss of interoperability by data providers who have a timely
need to make data available.
NetCDF Data Models
Data formats are low-level, implementing data conventions by
mapping the abstractions that are the subject of conventions to their
representation on storage media. Data conventions make use
of data formats, adding higher-level abstractions and data objects such as
coordinate systems. Data models are the most abstract
and simplest conceptual layer, and may make use of data conventions in
representing the intent of a data provider. Data formats typically
implement a data model directly when no conventions are used, but conventions
may improve the data model by supporting additional abstractions or
simplifications.
Two important data models for netCDF are
-
the "classic netCDF model", used for netCDF-3 and earlier versions
-
an enhanced data model referred to as the Common Data Model (CDM), used for
netCDF-4 and later versions
Use of CF Conventions version 1.x support the classic netCDF data model with
coordinate systems and other useful abstractions.
The classic netCDF model represents data sets using named variables,
dimensions, and attributes. A variable is a multidimensional
array whose elements are all of the same type. A variable may also have
attributes, which are associated named values. Each variable has a shape,
specified by its dimensions, named axes that have a length. Variables may
share dimensions, indicating a common grid. One dimension may be of
unlimited length, so data may be efficiently appended to variables along that
dimension. Variables and attributes have one of six primitive data types:
char, byte, short, int, float, or double.
A Unified Modeling Language (UML) diagram of the classic netCDF Data Model shows
its simplicity:
Although the netCDF-3 data model has the virtue of simplicity, it also has
significant limitations. There is little support for data structures other
than multidimensional arrays and lists. In particular, nested structures
and ragged arrays are not easily represented. Only one shared unlimited
dimension per file means some datasets must use multiple files. A flat
name space for dimensions and variables limits scalability. Character
arrays can represent strings, but require the user to explicitly deal with their
length. Lack of unsigned types and 64-bit integer types precludes some
applications. The associated netCDF classic format does not support
compression of individual variables. Additions to file schema, such as
adding new variables and dimensions, can be very inefficient, causing the data
to be recopied. Finally, the classic data format has a bias toward
big-endian systems, requiring more byte-swapping conversions for accessing data
on little-endian platforms.
The netCDF-4 data model, implemented using an HDF5-based storage layer, deals
with all these limitations. In this enhanced data model, a file has a
top-level unnamed group. Each group may contain one or more named
variables, dimensions, attributes, groups, and types. A variable is still
a multidimensional array whose elements are all of the same type, each variable
may have attributes, and each variable's shape is specified by its dimensions,
which may be shared. However, in the enhanced data model, one or
more dimensions may be of unlimited length, so data may be efficiently
appended to variables along any of those dimensions. Variables and
attributes have one of twelve primitive data types or one of four kinds of
user-defined types.
A UML diagram of the enhanced netCDF data model used for netCDF-4 shows (in red)
what it adds to the classic netCDF data model:
Because preserving access to archived data for future generations is very
important, the netCDF-4 data model, data format, and software are designed to
provide compatibility with and continued support for netCDF-3 data and
applications. Read and write access are provided for classic format data,
and existing programs merely require recompiling.
Use of NetCDF-4 Model Features
Below, we discuss benefits that may be obtained by just using the classic netCDF
data model with no netCDF-4 features. This is recommended practice for
existing data archives and projects relying on current conventions or on data
management software or visualization and analysis packages that assume the
classic netCDF data model. However for new projects that lack legacy
issues or constraints of interoperability with current systems, use of some
netCDF-4 features may be more appropriate. For such cases, we offer some
early recommendations for uses for groups and user-defined types.
For the examples presented, we use the netCDF-4 Common Data Language (CDL)
notation to show the structure of the data and metadata, as produced by the
netCDF-4 ncdump utility and interpreted by the netCDF-4 ncgen utility (to be
available in release 4.1). The examples presented are for illustrative purposes,
so are not complete. Some of these examples illustrate potential issues
for new netCDF-4 conventions, and discussion of such issues appears interspersed
with the examples.
Uses for Groups
Groups provide nested scopes for names, similar to directories in a file
system. Just as files in different directories may have the same names,
variables in different groups may also have the same names. A netCDF group
is analogous to a netCDF file, with its own set of named dimensions, variables,
attributes, types, and subgroups. Names for objects in groups may be
specified using a "/" separator to identify their location in the group
hierarchy, just as with file systems.
Here is an example use of groups to organize data by a named property, in this
case geographical regions:
group
UnitedStates {
dimensions: time =
unlimited;
variables:
float
average_temperature(time);
group
Washington
{
dimensions: time = unlimited, stations =
47;
variables: float temperature(time,
stations);
}
group Oregon
{
dimensions: time = unlimited, stations =
61;
variables: float temperature(time,
stations);
}
group
California
{
dimensions: time = unlimited, stations =
53;
variables: float temperature(time,
stations);
}
…
}
In the above example, each inner group has its own Variable named "temperature",
its own dimension named "stations", and its own unlimited dimension named
"time". (It is also possible to have multiple unlimited dimensions within
a single group or without using groups.)
Potential uses for groups include applications that require:
-
containers to "factor out" common information, such as for regions, grids,
or model ensembles
-
hierarchies to organize a large number of variables
-
separate name scopes, so that multiple sets of data may use the same names
for dimensions, variables, and group-level attributes
-
sequences of analysis steps, from raw data to derived products
-
containers for storing closely coupled data, such as instrument calibration
parameters and instrument sensor descriptions
User-Defined Types
The netCDF-4 data model makes available several kinds of user-defined types:
compound types, enumerations, variable-length types, and opaque types.
Each type has a name and a definition. Named types are contained in
groups, but may be referenced in type definitions in other groups. Both
variables and attributes may be declared to be of user-defined types, which
allows a natural extension of conventions that require some variable attributes
to be of the same type as the variable, for example _FillValue.
Types exist independently of variables or attributes that use them, so it is
possible for a type to be contained in a netCDF group even though no variables
or attributes are declared to be of that type. This may be useful for
pre-declaring types to be used for data to be added later or as templates for
derived data objects.
Since each type requires a name, proliferation of names suggests it may be
useful to have a convention for type names to easily distinguish them from
variable names. In the examples below, we add the "_t" suffix to type
names to make them easier to identify.
Uses for Compound Types
Compound types are like C structures, grouping together named fields (also
called "members"), that may be of different types, into a structure that may be
accessed as a unit. For example:
types:
compound
wind_vector_t {
float
eastward ;
float
northward ;
}
dimensions:
lat =
18 ;
lon = 36 ;
pres = 15 ;
time = 4 ;
variables:
wind_vector_t wind(time, pres, lat, lon) ;
wind:standard_name =
"geostrophic_wind_vector" ;
data:
wind =
{0, 0}, {10, 20}, {20, 10}, {15, 15}, {20, -5.5}, ...;
defines a wind vector type with two members,
eastward,
and
northward.
The standard_name attribute is given the value
"geostrophic_wind_vector",
which is not an actual standard name but a plausible substitute for the
current standard names
"geostrophic_eastward_wind" and
"geostrophic_northward_wind" that
identify non-vector quantities. If data values will be accessed
together, it may make sense to package them into a compound type and create a
variable that is an array of that type.
Note there is a new CF Conventions issue with using compound types (or any
user-defined type). Does use of a particular standard name also imply
use of a standard type for the associated compound type, for example should a
quantity whose standard name includes
"_wind_vector"
be of a compound type equivalent to the
wind_vector_t
type defined in the example? If so, are the member names of the compound
type also part of the convention for this quantity?
As another compound type example, consider this representation of point
observation data:
types:
compound
wind_vector_t {
float
eastward ;
float
northward ;
}
compound ob_t
{
int station_id
;
double time
;
float temperature
;
float pressure ;
wind_vector_t wind ;
}
dimensions:
stations = unlimited
;
variables:
ob_t obs(stations)
;
data:
obs = {42, 0.0, 20.5, 950.0, {2.5, 3.5}}, … ;
Compound types may be nested, as the above example shows with the use of a
wind
member of type
wind_vector_t.
Potential uses for compound types include
-
Representing vector quantities like velocities
-
Modeling relational database tuples
-
Representing objects with components
-
Bundling multiple in situ observations together (profiles,
soundings)
-
Representing C structures in a portable form
Member fields of a type have a name and a type, but are not
netCDF variables. In particular, there is no way to directly assign
variable attributes to them, but we next propose a convention to handle this.
Convention for Assigning Attributes to Members of Compound Type
Although the netCDF-4 data model does not support assigning attributes directly
to individual fields of a compound type, it is possible to assign compound type
attributes to a variable of compound type. This leads to a natural
convention for associating the values of fields of a compound type attribute
with fields of a variable of compound type that have the same name. An
example may help to make this clearer:
types:
compound wind_vector_t {
float eastward ;
float northward ;
}
compound wind_vector_units_t {
string eastward ;
string northward ;
}
dimensions:
station = 5 ;
variables:
wind_vector_t wind(station) ;
wind_vector_units_t
wind:units
=
{"m/s",
"m/s"}
;
wind_vector_t
wind:_FillValue
= {-9999, -9999} ;
data:
wind = {0, 0}, {10, 20}, {20, 10}, {15, 15}, {20,
-5.5};
Note that the order of field names in the above compound types are not what
determines the assignment of units. Rather the identity of field names
maps the value of the field
eastward
of string type in the variable attribute
units
to be associated with the variable field
eastward
of type float in the
wind
variable.
As can be seen in the above example, use of this convention for assigning
attribute values to members of compound types can lead to a proliferation of
types and type names. The netCDF-4 implementation inherits this problem
from HDF5, which does not permit assigning attributes directly to compound type
members. In the future it may be possible to program around this by
defining a library-level convention, either in netCDF-4 or libcf.
Use
of Variable-Length Types
Named variable-length types may be created for any netCDF-4 base type, to
represent one-dimensional arrays of variable length. In netCDF-4, these
currently must be read atomically, that is the entire one-dimensional array must
be accessed with one function call to access its length and the values of the
data.
Here is an example of using nested variable-length types to represent marine
data:
types:
compound obs_t
{
// type for a single observation
float pressure ;
float temperature ;
float salinity ;
}
obs_t some_obs_t(*) ;
// type for some observations
compound profile_t {
// type for a single profile
float latitude ;
float longitude ;
int time ;
some_obs_t
obs ;
}
profile_t some_profiles_t(*) ; // type for some
profiles
compound track_t
{ //
type for a single track
string id
;
string description
;
some_profiles_t profiles
;
}
dimensions:
tracks = 42 ;
variables:
track_t cruise(tracks) ;
// this cruise had 42 tracks
The above defines 42 tracks, each of which is of compound type containing an
id, a description, and a variable number of profiles. Each profile
comprises a location, time, and a variable number of observations. Each
observation is a compound structure of pressure, temperature, and salinity.
Potential uses for variable-length types include ragged arrays and
in situ observational data typical of soundings,
profiles, and time series. For a variable-length type, any base type may
be used, including a compound type or another variable-length type.
There is no associated shared dimension, and the value of a variable-length
type is currently accessed all at once, for example a whole row of a ragged
array. Access to one base value at a time of variable length types may
be supported soon in some language interfaces by iterators.
It may be useful to distinguish variable length types with a prefix such as
"some_" or "list_of_" as in the example above. A convention for names of
variable-length types might enhance interoperability with other data models
and make declarations of complex types more easily understood. However,
such a convention may be too English-centric for international use.
Use of Enumerations
Enumerated types may be used to represent a small number of named values more
concisely than strings, because small numeric values are stored even
though the corresponding text symbols are displayed when the data is
dumped. For example, consider the example:
types:
byte enum cloud_t {
Clear = 0, Cumulonimbus = 1, Stratus = 2,
Stratocumulus = 3,
Cumulus = 4, Altostratus = 5, Nimbostratus = 6,
Altocumulus = 7,
Cirrostratus = 8, Cirrocumulus = 9, Cirrus =
10, Missing = 127
} ;
dimensions:
time = unlimited
;
variables:
cloud_t primary_cloud(time)
;
cloud_t
primary_cloud:_FillValue
= Missing ;
data:
primary_cloud = Clear, Stratus, Clear,
Cumulonimbus, Missing ;
Each data value of the variable
primary_cloud in the example
above only requires one byte of storage. Using an array of fixed-length
strings instead, as required in netCDF-3, would use at least 13 bytes of storage
for each value, in order to reserve enough space to store the longest string
"Stratocumulus".
Enumeration types can improve self-description while keeping data compact.
They provide a better alternative to using strings for flags for such purposes
as data quality indicators, soil type, cloud type, and similar situations where
a small fixed set of non-numeric values are appropriate. In the CF
Conventions, the attributes
flag_meaning
and
flag_values
are used for this purpose, but using an enumeration type may be somewhat
simpler.
A potential conventions issue is whether the gain in simplicity is worth the
cost of a new convention. If enumerations are used, would the enumeration
symbols also be standardized, as the names for standard quantities are in the
standard names table? Would a convention be needed to associate a more
descriptive string for each enumeration symbol?
Cautions Regarding Use of Unsigned Integers
Unsigned integers are not a supported type in some programming languages, such
as Fortran and Java. In these languages, n-bit unsigned integer values may
have to be read into signed integers with more bits to ensure values are
preserved. For example, unsigned 16-bit shorts might need to be read into
32-bit signed integers. This is especially problematical for the unsigned
64-bit type, for which no integer type may be available wide enough to hold
large unsigned values. As a general recommendation, avoid using the
unsigned 64-bit integer type for data that someone might want to read using
Fortran, Java, or other languages not supporting this type.
Use of Strings
The primitive type "string" is available in the netCDF-4 data model for
variable-length strings. Arrays of strings are useful for representing
multiple lines of text, lists of variable-length text values, and similar
applications. However string is a new primitive type, not available to
netCDF-3 C and Fortran APIs. It is not compatible with netCDF-3
applications. Data providers must weigh the convenience of using the
string primitive type against the adaptation that will be required for software
to access string data.
Currently, long multi-line strings used for metadata such as "history",
"source", or "references" global attributes, must use embedded newline
characters "\n" to separate lines. With an attribute of string type,
arrays of lines may be represented without "\n" separators, and the values of
such attributes will be displayed with one string value per line.
Use of the NetCDF Classic Model with the NetCDF-4 Format
Data providers writing new netCDF data have a choice among two obvious
alternatives and a third less obvious choice:
-
Continue using netCDF-3 software, data model, and associated format for
maximum compatibility.
-
Make use of netCDF-4 software and the netCDF-4 (HDF5-based) format for its
new data model and performance features.
-
Use the classic netCDF data model with the netCDF-4 format.
This third choice is supported by the netCDF-4 software, by using the
NC_CLASSIC_MODEL flag (in the C interface) when creating a file, which enforces
rules on what functions may be called to store data in the file, to make sure
its data can be read by netCDF-3 applications (when relinked to the new netCDF-4
library).
Use of the NC_CLASSIC_MODEL flag for writing new data files provides several
significant benefits for both writers and readers, without breaking backwards
compatibility for applications that read netCDF data. Data written in this
mode may follow current conventions for netCDF-3 files, even though the data
will be written as netCDF-4/HDF5 files. Benefits of using this mode for
data writers include:
-
ability to specify per-variable compression, with associated I/O performance
benefits
-
ability to use per-variable chunking, with efficiencies for
writing data along multiple axes
-
elimination of netCDF-3 variable size restrictions (4 GiB for fixed-size
variables, 4 GiB/record for record variables)
-
"reader makes right" I/O, with associated performance benefits for accessing
data on little-endian platforms
-
efficient dynamic schema changes, so new variables, dimensions, and
attributes may be added without copying data
-
potential to use parallel I/O
Some of these benefits are also available to netCDF-3 programs reading netCDF-4
NC_CLASSIC_MODEL data:
-
reading compressed variables transparently
-
reading chunked variables transparently and efficiently along multiple axes
-
ability to read variables larger than netCDF-3 allows
-
"reader makes right" I/O, with performance benefits when reading in same
endian-ness as written
-
ability to use HDF5 tools on file, since netCDF-4 files are easily readable
by tools such as HDFView and h5dump
Data providers that use the NC_CLASSIC_MODEL when creating a file are prevented
by the interface from making use of certain new features in the netCDF-4 data
model that cannot be interpreted by netCDF-3 programs, including groups, new
primitive types, and user-defined types. The new primitive types are 8-,
16-, and 32-bit unsigned integers, 64-bit signed and unsigned integers, and
strings. User-defined types include compound structures, variable-length
types, and enumerated types.
Draft Recommendations for Using netCDF-4 Features
Before using features of the netCDF-4 data model, consider the implications:
-
As of this writing, C-based netCDF-4 is recently released and Java based
netCDF-4 software is still under development.
-
Few utilities or applications have been adapted to access, visualize, or
analyze netCDF-4 data yet. The ncdump utility has been adapted to
netCDF-4 features, but ncgen is not yet fully adapted.
-
Adapting generic utilities to support all netCDF-4 features requires
significant efforts.
-
The Parallel netCDF library from Argonne and Northwestern is based on the
classic netCDF data model (although netCDF-4 also supports a different
parallel I/O interface).
-
Significant performance improvements are available now by using the classic
netCDF data model with the netCDF-4 format.
Nevertheless, there are cases when use of new features from the netCDF-4 data
model may be the right choice. For example, on a new project that lacks
legacy issues or constraints from need for interoperability with existing
applications, experimentation with the new data model may be desirable and
practical. Consider netCDF-4 if:
-
a netCDF-4-specific primitive type is required, such as 64-bit integers
-
the compound type is needed to provide arrays of portable structures
-
multiple unlimited dimensions are required
-
variable-length types are needed for structures such as ragged arrays
-
groups are required for multiple name spaces
-
enumerated types are needed for better self-description
-
nested combinations of user-defined types are needed
With these considerations in mind, some recommendations for data providers can
be made during the transition from predominant use of netCDF-3 and the classic
data model to wider use of the features of the enhanced data model supported in
netCDF-4:
-
Continue using classic data model and format, if suitable
-
Evaluate practicality and benefits of classic model with netCDF-4 format
using the NC_CLASSIC_MODEL (or language equivalent) creation flag
-
Incrementally test and explore uses of extended netCDF-4 data model features
-
Help evolve netCDF-4 conventions and best practices based on experience
Some features of the new data model may be adopted and supported earlier in
applications than other features. It is possible that some features may
not be widely used or supported by third-party software. For example,
support for groups and strings is easier than supporting arbitrarily nested
user-defined types.
Benchmarks and user performance tests will help with guidance on compression,
chunking parameters, and use of user-defined types, all of which have have
performance implications.
Best Practices and CF Conventions for netCDF-4?
NetCDF-4 has only recently been made available, so there are currently few
identified conventions issues. One of the principles for the CF
Conventions is
Conventions [should be] developed only for known issues.
Instead of trying to foresee the future, features are added as required.
Analogously, development of CF Conventions for netCDF-4 is still somewhat
premature. We may be able to foresee some issues and make modest
recommendations to simplify the tasks of providing data and developing new
services and applications, but developing comprehensive conventions for use in
climate models and observational data, for example, will require more
experience.
In general, it seems wise to avoid replacing an existing adequate convention
with a better alternative convention that uses netCDF-4 features unless there is
some overwhelming advantage to the new convention, because applications will
have to accept both the old and new conventions.
Conclusion
There is still little experience with representing model outputs in the new data
model. Most of the experience so far is with use of the netCDF-4 format
and the classic data model. Gaining the experience needed to provide
guidance for use of new features is better done by data providers and users than
the netCDF-4 developers.
The current release of netCDF-4 is not mature enough that we can recommend a
comprehensive set of new conventions. Nevertheless, we have tried to
identify some features of netCDF-4 that are candidates for use in evolving CF
conventions. Application developers are likely to delay supporting
netCDF-4 features until it's clear which features will prove useful for
representing the next generation of model output archives and observational
datasets.
With data providers, application developers, and conventions creators, we are
confronted with something like a three-stage chicken-and-egg problem. Data
providers are unlikely to be the first to use features not supported by
applications or standardized by conventions. Application developers are
unlikely to expend the effort needed to support features that are not being used
by data providers and that are not standardized as published conventions.
Those drafting and maintaining standard conventions must wait until data
providers identify needs for new conventions and must consider issues
applications developers will confront if they decide to support the new
conventions.
Best practices that become crystallized into new conventions develop out of the
experience of data providers, application developers, and users. These
considerations mean a definitive set of user guide and CF conventions for
netCDF-4 may take a while to develop and mature. More usage examples and
draft proposals from users, developers, and data providers will advance the
process.
This document is a draft and we welcome feedback.