03: Extracting data to different formats#
from IPython.display import YouTubeVideo
YouTubeVideo('rgBqsqwvvcc')
In this notebook, we will look at how we can export data from a NetCDF file into different formats. This will include:
Exporting data to a numpy array
Exporting data to a Pandas dataframe that we can write to a CSV or XLSX file
Let’s first revist this example from tutorial #01, importing a depth profile
import xarray as xr
import numpy as np
import datetime as dt
netcdf_file = 'https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy-single_profile/NMDC_Nansen-Legacy_PR_CT_58US_2021708/CTD_station_P1_NLEG01-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc'
xrds = xr.open_dataset(netcdf_file)
xrds
<xarray.Dataset> Dimensions: (PRES: 320) Coordinates: * PRES (PRES) float32 1.0 2.0 3.0 4.0 ... 317.0 318.0 319.0 320.0 Data variables: (12/33) PRES_QC (PRES) float32 ... TEMP (PRES) float32 ... PSAL (PRES) float32 ... FLU2 (PRES) float32 ... CNDC (PRES) float32 ... DENS (PRES) float32 ... ... ... OXYOCPVL-1_QC (PRES) float32 ... SPAR_QC (PRES) float32 ... PAR_QC (PRES) float32 ... PSAL-2_QC (PRES) float32 ... TEMP-2_QC (PRES) float32 ... ATTNZS01_QC (PRES) float32 ... Attributes: (12/73) qc_manual: Recommendations for in-situ data Near Re... contact: datahjelp@hi.no distribution_statement: These data are public and free of charge... naming_authority: no.unis license: https://creativecommons.org/licenses/by/... data_assembly_center: IMR ... ... station_name: P1 (NLEG01) _NCProperties: version=2,netcdf=4.6.3,hdf5=1.10.5 date_created: 2022-08-08T12:44:34Z doi: 10.21335/NMDC-2085836005-P1_NLEG01-1 title: CTD_station_P1_NLEG01-1_-_Nansen_Legacy_... metadata_link: https://doi.org/10.21335/NMDC-2085836005...
Variable with 1 dimension to numpy array#
By default when we access the data from a variable, the values are extracted as a NumPy array. Note that we have not needed to import numpy to do this.
myarray = xrds['TEMP'].values
myarray
array([3.735, 3.738, 3.739, 3.741, 3.736, 3.737, 3.736, 3.742, 3.736,
3.738, 3.783, 3.833, 3.838, 3.837, 3.83 , 3.806, 3.792, 3.79 ,
3.814, 3.846, 3.858, 3.833, 3.815, 3.817, 3.814, 3.812, 3.792,
3.709, 3.704, 3.642, 3.582, 3.546, 3.512, 3.484, 3.368, 3.297,
3.165, 3.106, 3.094, 3.094, 3.102, 3.079, 3.094, 3.076, 3.035,
3.009, 2.992, 2.981, 2.954, 2.933, 2.903, 2.874, 2.85 , 2.824,
2.801, 2.787, 2.764, 2.713, 2.711, 2.705, 2.701, 2.696, 2.682,
2.665, 2.651, 2.647, 2.642, 2.635, 2.624, 2.588, 2.604, 2.608,
2.593, 2.507, 2.472, 2.468, 2.46 , 2.453, 2.441, 2.43 , 2.431,
2.43 , 2.431, 2.433, 2.43 , 2.417, 2.394, 2.382, 2.361, 2.335,
2.32 , 2.303, 2.275, 2.267, 2.263, 2.277, 2.27 , 2.265, 2.262,
2.256, 2.241, 2.216, 2.201, 2.191, 2.176, 2.135, 2.11 , 2.091,
2.071, 2.064, 2.103, 2.106, 2.099, 2.095, 2.092, 2.088, 2.082,
2.085, 2.093, 2.09 , 2.089, 2.085, 2.054, 2.022, 2.019, 2.021,
2.026, 2.027, 2.036, 2.041, 2.06 , 2.082, 2.084, 2.087, 2.093,
2.097, 2.099, 2.098, 2.1 , 2.101, 2.099, 2.095, 2.094, 2.094,
2.102, 2.101, 2.095, 2.082, 2.078, 2.076, 2.076, 2.072, 2.066,
2.064, 2.063, 2.065, 2.065, 2.063, 2.058, 2.041, 2.032, 2.028,
2.027, 2.026, 2.022, 2.013, 2.01 , 2.009, 2.007, 2.002, 1.998,
1.994, 1.99 , 1.986, 1.979, 1.974, 1.971, 1.968, 1.966, 1.962,
1.957, 1.945, 1.933, 1.924, 1.919, 1.916, 1.907, 1.9 , 1.893,
1.887, 1.883, 1.881, 1.879, 1.876, 1.873, 1.868, 1.866, 1.863,
1.858, 1.853, 1.851, 1.848, 1.847, 1.847, 1.846, 1.845, 1.845,
1.844, 1.844, 1.843, 1.843, 1.842, 1.842, 1.84 , 1.837, 1.834,
1.832, 1.83 , 1.823, 1.82 , 1.818, 1.815, 1.808, 1.804, 1.798,
1.794, 1.792, 1.79 , 1.789, 1.788, 1.787, 1.783, 1.779, 1.776,
1.772, 1.77 , 1.767, 1.763, 1.761, 1.757, 1.751, 1.744, 1.74 ,
1.739, 1.736, 1.731, 1.727, 1.723, 1.72 , 1.715, 1.702, 1.693,
1.694, 1.693, 1.691, 1.685, 1.673, 1.667, 1.663, 1.662, 1.662,
1.661, 1.657, 1.648, 1.644, 1.642, 1.628, 1.622, 1.619, 1.613,
1.61 , 1.597, 1.59 , 1.587, 1.585, 1.581, 1.576, 1.567, 1.564,
1.558, 1.54 , 1.531, 1.525, 1.523, 1.516, 1.512, 1.507, 1.504,
1.503, 1.496, 1.469, 1.442, 1.431, 1.408, 1.385, 1.361, 1.332,
1.319, 1.303, 1.297, 1.295, 1.29 , 1.286, 1.287, 1.286, 1.286,
1.286, 1.287, 1.287, 1.286, 1.287, 1.287, 1.288, 1.288, 1.287,
1.287, 1.287, 1.287, 1.287, 1.287], dtype=float32)
type(myarray)
numpy.ndarray
Variables with 1 dimension to pandas dataframe#
Alternatively, you might want to export the data into a pandas dataframe. Dataframes are essentially tables of data and you can export them easily to CSV or XLSX files.
xrds['TEMP'].to_dataframe()
TEMP | |
---|---|
PRES | |
1.0 | 3.735 |
2.0 | 3.738 |
3.0 | 3.739 |
4.0 | 3.741 |
5.0 | 3.736 |
... | ... |
316.0 | 1.287 |
317.0 | 1.287 |
318.0 | 1.287 |
319.0 | 1.287 |
320.0 | 1.287 |
320 rows × 1 columns
See how the relevant coordinate variable has been extracted along with it as the index to the dataframe?
Now let’s try several variables together.
xrds[['TEMP', 'PSAL', 'DENS', 'SVEL']].to_dataframe()
TEMP | PSAL | DENS | SVEL | |
---|---|---|---|---|
PRES | ||||
1.0 | 3.735 | 34.254002 | 27.207001 | 1464.550049 |
2.0 | 3.738 | 34.259998 | 27.209999 | 1464.579956 |
3.0 | 3.739 | 34.261002 | 27.211000 | 1464.599976 |
4.0 | 3.741 | 34.266998 | 27.216000 | 1464.630005 |
5.0 | 3.736 | 34.258999 | 27.209999 | 1464.619995 |
... | ... | ... | ... | ... |
316.0 | 1.287 | 34.916000 | 27.945000 | 1459.969971 |
317.0 | 1.287 | 34.916000 | 27.945000 | 1459.989990 |
318.0 | 1.287 | 34.916000 | 27.945000 | 1460.010010 |
319.0 | 1.287 | 34.916000 | 27.945000 | 1460.020020 |
320.0 | 1.287 | 34.916000 | 27.945000 | 1460.040039 |
320 rows × 4 columns
Data with multiple dimensions#
Let’s first revist these data from tutorial #02, a global surface temperature anomalies through time.
H.-M. Zhang, B. Huang, J. H. Lawrimore, M. J. Menne, and T. M. Smith (2019): NOAA Global Surface Temperature Dataset (NOAAGlobalTemp), Version 5.0. NOAA National Centers for Environmental Information. doi:10.25921/9qth-2p70 Accessed 2024-01-09.
url = 'https://www.ncei.noaa.gov/thredds/dodsC/noaa-global-temp-v5/NOAAGlobalTemp_v5.0.0_gridded_s188001_e202212_c20230108T133308.nc'
xrds = xr.open_dataset(url)
xrds
<xarray.Dataset> Dimensions: (time: 1716, lat: 36, lon: 72, z: 1) Coordinates: * time (time) datetime64[ns] 1880-01-01 1880-02-01 ... 2022-12-01 * lat (lat) float32 -87.5 -82.5 -77.5 -72.5 -67.5 ... 72.5 77.5 82.5 87.5 * lon (lon) float32 2.5 7.5 12.5 17.5 22.5 ... 342.5 347.5 352.5 357.5 * z (z) float32 0.0 Data variables: anom (time, z, lat, lon) float32 ... Attributes: (12/66) Conventions: CF-1.6, ACDD-1.3 title: NOAA Merged Land Ocean Global Surface Te... summary: NOAAGlobalTemp is a merged land-ocean su... institution: DOC/NOAA/NESDIS/National Centers for Env... id: gov.noaa.ncdc:C00934 naming_authority: gov.noaa.ncei ... ... time_coverage_duration: P143Y0M references: Vose, R. S., et al., 2012: NOAAs merged ... climatology: Climatology is based on 1971-2000 monthl... acknowledgment: The NOAA Global Surface Temperature Data... date_modified: 2023-01-08T18:33:09Z date_issued: 2023-01-08T18:33:09Z
Let’s create a dataframe that includes all the data and all variables.
df = xrds.to_dataframe()
#df.to_excel('data/exported_from_notebooks/03_temperature_anomalies_all_variables.xlsx') # Too large to write to Excel
#df.to_csv('data/exported_from_notebooks/03_temperature_anomalies_all_variables.csv')
df
anom | ||||
---|---|---|---|---|
time | lat | lon | z | |
1880-01-01 | -87.5 | 2.5 | 0.0 | NaN |
7.5 | 0.0 | NaN | ||
12.5 | 0.0 | NaN | ||
17.5 | 0.0 | NaN | ||
22.5 | 0.0 | NaN | ||
... | ... | ... | ... | ... |
2022-12-01 | 87.5 | 337.5 | 0.0 | NaN |
342.5 | 0.0 | NaN | ||
347.5 | 0.0 | NaN | ||
352.5 | 0.0 | NaN | ||
357.5 | 0.0 | NaN |
4447872 rows × 1 columns
What about just isolating data for a certain time? We can create a new xarray object that is just a selection of the original.
timeslice = xrds.sel(time=dt.datetime(1970, 1, 1))
timeslice
<xarray.Dataset> Dimensions: (lat: 36, lon: 72, z: 1) Coordinates: time datetime64[ns] 1970-01-01 * lat (lat) float32 -87.5 -82.5 -77.5 -72.5 -67.5 ... 72.5 77.5 82.5 87.5 * lon (lon) float32 2.5 7.5 12.5 17.5 22.5 ... 342.5 347.5 352.5 357.5 * z (z) float32 0.0 Data variables: anom (z, lat, lon) float32 nan nan nan nan nan ... nan nan nan nan nan Attributes: (12/66) Conventions: CF-1.6, ACDD-1.3 title: NOAA Merged Land Ocean Global Surface Te... summary: NOAAGlobalTemp is a merged land-ocean su... institution: DOC/NOAA/NESDIS/National Centers for Env... id: gov.noaa.ncdc:C00934 naming_authority: gov.noaa.ncei ... ... time_coverage_duration: P143Y0M references: Vose, R. S., et al., 2012: NOAAs merged ... climatology: Climatology is based on 1971-2000 monthl... acknowledgment: The NOAA Global Surface Temperature Data... date_modified: 2023-01-08T18:33:09Z date_issued: 2023-01-08T18:33:09Z
You can see that the xarray object above no longer has a time dimension, but maintains a time coordinate variable that contains a single value. Sometimes we don’t know if data exist for our selection, so we can access on the nearest data.
timeslice = xrds.sel(time=dt.datetime(1970, 1, 12), method='nearest')
timeslice
<xarray.Dataset> Dimensions: (lat: 36, lon: 72, z: 1) Coordinates: time datetime64[ns] 1970-01-01 * lat (lat) float32 -87.5 -82.5 -77.5 -72.5 -67.5 ... 72.5 77.5 82.5 87.5 * lon (lon) float32 2.5 7.5 12.5 17.5 22.5 ... 342.5 347.5 352.5 357.5 * z (z) float32 0.0 Data variables: anom (z, lat, lon) float32 nan nan nan nan nan ... nan nan nan nan nan Attributes: (12/66) Conventions: CF-1.6, ACDD-1.3 title: NOAA Merged Land Ocean Global Surface Te... summary: NOAAGlobalTemp is a merged land-ocean su... institution: DOC/NOAA/NESDIS/National Centers for Env... id: gov.noaa.ncdc:C00934 naming_authority: gov.noaa.ncei ... ... time_coverage_duration: P143Y0M references: Vose, R. S., et al., 2012: NOAAs merged ... climatology: Climatology is based on 1971-2000 monthl... acknowledgment: The NOAA Global Surface Temperature Data... date_modified: 2023-01-08T18:33:09Z date_issued: 2023-01-08T18:33:09Z
Notice that the time selected is the beginning of the month we tried to select.
What about selecting by multiple coordinates?
one_location = xrds.sel(lat=23.41,lon=-121.89, method='nearest')
one_location
<xarray.Dataset> Dimensions: (time: 1716, z: 1) Coordinates: * time (time) datetime64[ns] 1880-01-01 1880-02-01 ... 2022-12-01 lat float32 22.5 lon float32 2.5 * z (z) float32 0.0 Data variables: anom (time, z) float32 nan nan nan nan nan ... 1.114 0.695 0.5151 3.575 Attributes: (12/66) Conventions: CF-1.6, ACDD-1.3 title: NOAA Merged Land Ocean Global Surface Te... summary: NOAAGlobalTemp is a merged land-ocean su... institution: DOC/NOAA/NESDIS/National Centers for Env... id: gov.noaa.ncdc:C00934 naming_authority: gov.noaa.ncei ... ... time_coverage_duration: P143Y0M references: Vose, R. S., et al., 2012: NOAAs merged ... climatology: Climatology is based on 1971-2000 monthl... acknowledgment: The NOAA Global Surface Temperature Data... date_modified: 2023-01-08T18:33:09Z date_issued: 2023-01-08T18:33:09Z
This is now a time series of values. Let’s write that to a dataframe.
df = one_location.to_dataframe()
df
lat | lon | anom | ||
---|---|---|---|---|
time | z | |||
1880-01-01 | 0.0 | 22.5 | 2.5 | NaN |
1880-02-01 | 0.0 | 22.5 | 2.5 | NaN |
1880-03-01 | 0.0 | 22.5 | 2.5 | NaN |
1880-04-01 | 0.0 | 22.5 | 2.5 | NaN |
1880-05-01 | 0.0 | 22.5 | 2.5 | NaN |
... | ... | ... | ... | ... |
2022-08-01 | 0.0 | 22.5 | 2.5 | 0.376479 |
2022-09-01 | 0.0 | 22.5 | 2.5 | 1.114245 |
2022-10-01 | 0.0 | 22.5 | 2.5 | 0.695008 |
2022-11-01 | 0.0 | 22.5 | 2.5 | 0.515076 |
2022-12-01 | 0.0 | 22.5 | 2.5 | 3.575243 |
1716 rows × 3 columns
CTD data published cruise by cruise#
What about these CTD data that are published cruise by cruise?
Angelika Renner (2022) CTD data from Nansen Legacy Cruise - JC3 Winter gaps cruise https://doi.org/10.21335/NMDC-675177809
xrds = xr.open_dataset('https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy/NMDC_Nansen-Legacy_PR_CT_58US_2022702.nc')
xrds
<xarray.Dataset> Dimensions: (TIME: 29, LATITUDE: 29, LONGITUDE: 29, POSITION: 29, DEPTH: 3555) Coordinates: * TIME (TIME) datetime64[ns] 2022-02-22T17:47:32 ... 2022-03-11T1... * LATITUDE (LATITUDE) float32 76.0 80.68 80.6 ... 80.39 79.14 78.34 * LONGITUDE (LONGITUDE) float32 31.22 33.81 33.56 ... 29.77 8.917 14.47 Dimensions without coordinates: POSITION, DEPTH Data variables: (12/37) POSITION_QC (POSITION) float32 ... DIRECTION object ... PRES (TIME, DEPTH) float32 ... PRES_QC (TIME, DEPTH) float32 ... TEMP (TIME, DEPTH) float64 ... PSAL (TIME, DEPTH) float64 ... ... ... SPAR_QC (TIME, DEPTH) float32 ... PAR_QC (TIME, DEPTH) float32 ... PSAL-2_QC (TIME, DEPTH) float32 ... TEMP-2_QC (TIME, DEPTH) float32 ... ATTNZS01_QC (TIME, DEPTH) float32 ... TIME_QC (TIME) float32 ... Attributes: (12/67) _NCProperties: version=2,netcdf=4.7.4,hdf5=1.12.0, qc_manual: Recommendations for in-situ data Near Rea... contact: datahjelp@hi.no distribution_statement: These data are public and free of charge.... naming_authority: Norwegian Marine Datacenter license: https://creativecommons.org/licenses/by/4... ... ... creator_name: Angelika Renner creator_type: person creator_email: angelika.renner@hi.no creator_url: https://www.hi.no/hi publisher: The Norwegian Marine Data Centre (NMD) publisher_url: https://www.nmdc.no/
The file contains many depth profiles. Therefore, most of the data variables have two dimensions; time and depth.
You might notice that there is no depth coordinate variable. Why not? The depth profiles are binned into 1 dbar bins. The density of sea water is a function of sea water and salinity, and so is not constant. So you can’t use a single depth variable for all the profiles.
They could have included pressure as a dimension instead of depth and also included pressure as a coordinate variable.
They could have also published all the depth profiles in separate files. Then each file could include its own depth coordinate variable. It is good practice to publish data with finer granularity because each file can be simpler and thus easier to create, understand, and build services upon. Granularity of data and working with multiple NetCDF files will be focus of tutorial #05.
Let’s see what happens now when we try to access a specific depth profile using the method we followed in the section above.
one_profile = xrds.sel(TIME=dt.datetime(2022, 3, 1,9), method='nearest')
one_profile
<xarray.Dataset> Dimensions: (LATITUDE: 29, LONGITUDE: 29, POSITION: 29, DEPTH: 3555) Coordinates: TIME datetime64[ns] 2022-03-01T06:29:59 * LATITUDE (LATITUDE) float32 76.0 80.68 80.6 ... 80.39 79.14 78.34 * LONGITUDE (LONGITUDE) float32 31.22 33.81 33.56 ... 29.77 8.917 14.47 Dimensions without coordinates: POSITION, DEPTH Data variables: (12/37) POSITION_QC (POSITION) float32 ... DIRECTION object ... PRES (DEPTH) float32 ... PRES_QC (DEPTH) float32 ... TEMP (DEPTH) float64 ... PSAL (DEPTH) float64 ... ... ... SPAR_QC (DEPTH) float32 ... PAR_QC (DEPTH) float32 ... PSAL-2_QC (DEPTH) float32 ... TEMP-2_QC (DEPTH) float32 ... ATTNZS01_QC (DEPTH) float32 ... TIME_QC float32 ... Attributes: (12/67) _NCProperties: version=2,netcdf=4.7.4,hdf5=1.12.0, qc_manual: Recommendations for in-situ data Near Rea... contact: datahjelp@hi.no distribution_statement: These data are public and free of charge.... naming_authority: Norwegian Marine Datacenter license: https://creativecommons.org/licenses/by/4... ... ... creator_name: Angelika Renner creator_type: person creator_email: angelika.renner@hi.no creator_url: https://www.hi.no/hi publisher: The Norwegian Marine Data Centre (NMD) publisher_url: https://www.nmdc.no/
Above you can see that the latitude and longitude dimenions are still equal to 29. This is strange, because a single depth profile should only have 1 latitude and 1 longitude associated with it.
The problem here is that the time latitude and longitude are not explicitely linked to the time. Ideally, there should only be 2 dimensions (time and depth/pressure). The latitude and longitude variables should have a dimension of time.
The data variables look okay - they now only have one dimension, DEPTH, where previously they also had a second dimension, TIME.
Nevertheless, you might still want to find out what the latitude and longitude of the profile you have accessed is. Here is another method. Let’s start by extracting the 5th profile in the dataset.
profile5 = xrds['TEMP'][5].values
print(profile5)
print('time: ',xrds['TIME'][5].values)
print('latitude: ',xrds['LATITUDE'][5].values)
print('longitude: ',xrds['LONGITUDE'][5].values)
[-2.403 3.934 12.479 ... nan nan nan]
time: 2022-03-01T20:43:21.000000000
latitude: 82.0375
longitude: 29.7788
However, often we won’t know which number the profile we want to access is.
We can do this to extract the data closest to a desired time.
desired_time = np.datetime64(dt.datetime(2022, 3, 5, 11))
time_var = xrds['TIME'].values # Saving values to a numpy array
time_diff = np.abs(time_var - desired_time) # Difference between each time and our desired time
nearest_index = int(time_diff.argmin().item()) # Find index of time closest to desired time
temperature = xrds['TEMP'][nearest_index].values
latitude = xrds['LATITUDE'][nearest_index].values
longitude = xrds['LONGITUDE'][nearest_index].values
time = xrds['TIME'][nearest_index].values
print(temperature)
print('time: ',time)
print('latitude: ',latitude)
print('longitude: ',longitude)
[-1.68 -1.666 -1.661 ... nan nan nan]
time: 2022-03-05T10:49:14.000000000
latitude: 81.5868
longitude: 30.7572
Now let’s instead export multiple variables and multiple depth profiles to a single pandas dataframe.
df = xrds[['TEMP','PSAL', 'PRES', 'SVEL']].to_dataframe()
df.to_excel('../data/exported_from_notebooks/03_ctd_data.xlsx')
df.to_csv('../data/exported_from_notebooks/03_ctd_data.csv')
df
TEMP | PSAL | PRES | SVEL | ||
---|---|---|---|---|---|
TIME | DEPTH | ||||
2022-02-22 17:47:32 | 0 | 2.044 | 34.953 | 5.0 | 1458.29 |
1 | 2.033 | 34.953 | 6.0 | 1458.25 | |
2 | 1.990 | 34.955 | 7.0 | 1458.08 | |
3 | 2.007 | 34.954 | 8.0 | 1458.17 | |
4 | 1.987 | 34.954 | 9.0 | 1458.10 | |
... | ... | ... | ... | ... | ... |
2022-03-11 18:18:43 | 3550 | NaN | NaN | NaN | NaN |
3551 | NaN | NaN | NaN | NaN | |
3552 | NaN | NaN | NaN | NaN | |
3553 | NaN | NaN | NaN | NaN | |
3554 | NaN | NaN | NaN | NaN |
103095 rows × 4 columns
Notice that in this case both coordinate variables are plotted as individual columns. The data variables are also long columns of data.
How to cite this course#
If you think this course contributed to the work you are doing, consider citing it in your list of references. Here is a recommended citation:
Marsden, L. (2024, April 19). NetCDF in Python - from beginner to pro. Zenodo. https://doi.org/10.5281/zenodo.10997447
And you can navigate to the publication and export the citation in different styles and formats by clicking the icon below.