08: Ancillary variables#

In science you often don’t just want to publish your data variables. You might want to include extra or secondary variables that are related or provide further context to your primary data variables.

For example, you might have sea water chlorophyll A data taken from water samples at different depths. You might want to also publish

  • the volume of your water sample

  • other values you have measured in order to compute the chlorophyll A values

  • quality flags

In the CF conventions, these variables are referred to as ancillary data, and this section of the CF conventions is dedicated to them: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#ancillary-data

In this tutorial, we will look at how you can include ancillary data in a CF-NetCDF file and encode how the variables relate to each other in a machine-understandable way.

Basic example without ancillary data#

import xarray as xr

depths = [10,20,30,40,50]
chlorophyll_a = [0.411,0.152,0.067,0.017,0.014]

xrds = xr.Dataset(
    coords={
        'depth': depths
    },
    data_vars={
        'Chlorophyll_A': ('depth', chlorophyll_a)
    } 
)

xrds['Chlorophyll_A'].attrs = {
    'standard_name': 'mass_concentration_of_chlorophyll_a_in_sea_water',
    'long_name': 'Mass concentration of chlorophyll a in sea water derived from water samples from Niskin bottles',
    'units': 'μg L-1',
    'coverage_content_type': 'physicalMeasurement'
}
xrds['depth'].attrs = {
    'standard_name': 'depth',
    'long_name': 'Sea water depth',
    'units': 'meters',
    'coverage_content_type': 'coordinate',
    'positive': 'down'
}
xrds
<xarray.Dataset>
Dimensions:        (depth: 5)
Coordinates:
  * depth          (depth) int64 10 20 30 40 50
Data variables:
    Chlorophyll_A  (depth) float64 0.411 0.152 0.067 0.017 0.014

Assigning quality or status flags#

Quality or status flags tell the user about the quality information of the data. You can read about how to encode this section of the CF conventions (see examples 3.4 to 3.8): https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#ancillary-data

We need to create a new variable for the flags.

Flags are stored as numbers, and with the meanings for the numbers stored as a variable attribute. The flag_meanings should be separated by spaces - so don’t include spaces in any of the terms you use! The length of the flag values and flag_meanings should be the same.

chla_flag_possible_values = [0,1,2,3,4,5,6,7,8,9]
chla_flag_meanings = "no_qc_performed good_data probably_good_data bad_data_that_are_potentially_correctable bad_data value_changed value_below_detection nominal_value interpolated_value missing_value"

So for example, a value of 2 means probably_good_data.

You might wonder which conventions these quality flag values and meanings adhere to. In this case, we are following the OceanSITES Manual v 1.4. http://www.oceansites.org/docs/oceansites_data_format_reference_manual.pdf

However, other quality flag conventions exist.

Now let’s create a variable for the quality flags.

chla_flags = [1,1,1,2,1] # Same length as Chlorophyll_A variable

xrds['Chlorophyll_A_quality_flags'] = ('depth', chla_flags)

xrds
<xarray.Dataset>
Dimensions:                      (depth: 5)
Coordinates:
  * depth                        (depth) int64 10 20 30 40 50
Data variables:
    Chlorophyll_A                (depth) float64 0.411 0.152 0.067 0.017 0.014
    Chlorophyll_A_quality_flags  (depth) int64 1 1 1 2 1

Now we need to state that the new Chlorophyll_A_quality_flags variable is related to the Chlorophyll A variable.

xrds['Chlorophyll_A'].attrs['ancillary_variables'] = "Chlorophyll_A_quality_flags"

Finally we need to add our metadata to the ancillary variable to describe it. There are a lot of standard names for different types of flags. Search for flag here to find a suitable standard_name for you. https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html

The CF conventions also has standardised variable attributes you can use for flag_values and flag_meanings. You often see the valid_range attribute used here too to explicitely show that any values outside of that range are invalid. You could use valid_min and valid_max used instead.

# Metadata for the 'Chlorophyll_A_quality_flags' variable
xrds['Chlorophyll_A_quality_flags'].attrs = {
    'long_name': 'Chlorophyll A quality flag',
    'standard_name': 'quality_flag',
    'flag_values': chla_flag_possible_values,
    'flag_meanings': chla_flag_meanings,
    'valid_range': [0,9],
    'coverage_content_type': 'qualityInformation',
    '_FillValue': -127
}

xrds['Chlorophyll_A_quality_flags']
<xarray.DataArray 'Chlorophyll_A_quality_flags' (depth: 5)>
array([1, 1, 1, 2, 1])
Coordinates:
  * depth    (depth) int64 10 20 30 40 50
Attributes:
    long_name:              Chlorophyll A quality flag
    standard_name:          quality_flag
    flag_values:            [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    flag_meanings:          no_qc_performed good_data probably_good_data bad_...
    valid_range:            [0, 9]
    coverage_content_type:  qualityInformation
    _FillValue:             -127

Make sure you refer to the conventions you are following for your quality flags in your Conventions global attribute, for example

xrds.attrs['Conventions'] = 'CF-1.8, ACDD-1.3, OceanSITES Manual 1.4'
xrds
<xarray.Dataset>
Dimensions:                      (depth: 5)
Coordinates:
  * depth                        (depth) int64 10 20 30 40 50
Data variables:
    Chlorophyll_A                (depth) float64 0.411 0.152 0.067 0.017 0.014
    Chlorophyll_A_quality_flags  (depth) int64 1 1 1 2 1
Attributes:
    Conventions:  CF-1.8, ACDD-1.3, OceanSITES Manual 1.4

Retrieving only good_quality data#

Suppose we want to retrieve only the good_quality Chlorophyll_A data, where Chlorophyll_A_quality_flags = 1

good_quality_chlorophyll_a = xrds['Chlorophyll_A'].where(xrds['Chlorophyll_A_quality_flags'] == 1)
good_quality_chlorophyll_a
<xarray.DataArray 'Chlorophyll_A' (depth: 5)>
array([0.411, 0.152, 0.067,   nan, 0.014])
Coordinates:
  * depth    (depth) int64 10 20 30 40 50
Attributes:
    standard_name:          mass_concentration_of_chlorophyll_a_in_sea_water
    long_name:              Mass concentration of chlorophyll a in sea water ...
    units:                  μg L-1
    coverage_content_type:  physicalMeasurement
    ancillary_variables:    Chlorophyll_A_quality_flags

And to drop the nans

good_quality_chlorophyll_a = xrds['Chlorophyll_A'].where(xrds['Chlorophyll_A_quality_flags'] == 1, drop=True)
good_quality_chlorophyll_a
<xarray.DataArray 'Chlorophyll_A' (depth: 4)>
array([0.411, 0.152, 0.067, 0.014])
Coordinates:
  * depth    (depth) int64 10 20 30 50
Attributes:
    standard_name:          mass_concentration_of_chlorophyll_a_in_sea_water
    long_name:              Mass concentration of chlorophyll a in sea water ...
    units:                  μg L-1
    coverage_content_type:  physicalMeasurement
    ancillary_variables:    Chlorophyll_A_quality_flags

Other ancillary data#

We can write other ancillary variables in a similar way. For example.

filtered_volumes = [0.8,1.2,0.7,0.8,1.0]
xrds['Filtered_volume'] = ('depth', filtered_volumes)

# Multiple ancillary variables separated by spaces
xrds['Chlorophyll_A'].attrs['ancillary_variables'] = "Chlorophyll_A_quality_flags Filtered_volume" 

xrds['Filtered_volume'].attrs = {
    'long_name': 'Volume of sea water filtered to to measure the Chlorophyll A values',
    'units': 'L',
    'covereage_content_type': 'auxiliaryInformation',
    '_FillValue': -1
}

xrds
<xarray.Dataset>
Dimensions:                      (depth: 5)
Coordinates:
  * depth                        (depth) int64 10 20 30 40 50
Data variables:
    Chlorophyll_A                (depth) float64 0.411 0.152 0.067 0.017 0.014
    Chlorophyll_A_quality_flags  (depth) int64 1 1 1 2 1
    Filtered_volume              (depth) float64 0.8 1.2 0.7 0.8 1.0
Attributes:
    Conventions:  CF-1.8, ACDD-1.3, OceanSITES Manual 1.4

More work needs to be done to expand the CF conventions to standardise ancillary data. At the time of writing, a standard_name for the volume of sea water filtered does not exist.

This is where the scientific community can help!

New standard names can be suggested by raising an issue of this GitHub repository: cf-convention/discuss#issues

Follow these guidelines for constructing standard names: https://cfconventions.org/Data/cf-standard-names/docs/guidelines.html

How to cite this course#

If you think this course contributed to the work you are doing, consider citing it in your list of references. Here is a recommended citation:

Marsden, L. (2024, April 19). NetCDF in Python - from beginner to pro. Zenodo. https://doi.org/10.5281/zenodo.10997447

And you can navigate to the publication and export the citation in different styles and formats by clicking the icon below.

DOI