使用Python读取NetCDF数据|极客笔记

使用Python读取NetCDF数据

在以下教程中，我们将了解如何借助Python编程语言读取NetCDF数据。

在开始之前，让我们简要了解一下什么是NetCDF。

了解NetCDF

网络通用数据格式（也称为NetCDF）通常用于存储多维地理数据。这些数据的一些示例可以是降水、温度和风速。存储在NetCDF中的变量通常在大型（洲际）区域内每天测量多次。随着每天的多次测量，数据值迅速累积并变得难以处理。当每个值也分配给地理位置时，数据管理进一步复杂化。NetCDF为这些挑战提供了解决方案。本教程将引导您开始使用Python编程语言中的模块从NetCDF文件中读取数据。

安装

我们可以使用不同的Python模块来读取NetCDF文件。其中一些著名的模块包括 netCDF4 和 gdal 。在本教程中，我们主要关注 netCDF4 模块。

安装模块的过程很简单。我们可以使用pip安装程序或anaconda Python发行版。使用pip安装程序安装netCDF4模块的语法如下所示：

语法：

$ pip install netCDF4

我们还可以使用Anaconda Python发行版来消除依赖和版本问题可能带来的混乱。使用Anaconda（conda）安装模块的语法如下所示：

语法：

$ conda install netCDF4

验证安装

安装完成后，我们可以通过创建一个空的Python程序文件并编写如下的 import 语句来验证安装，如下所示：

文件：verify.py

import netCDF4 as nc

现在，请保存上述文件，并在终端中使用以下命令执行：

语法：

$ python verify.py

如果上面的Python程序文件没有返回任何错误，则模块已正确安装。然而，如果出现异常，尝试重新安装模块，并建议参考模块的官方文档。

加载NetCDF数据集

加载数据集非常简单。我们只需将NetCDF文件路径传递给 netCDF4.Dataset() 函数即可。在本教程中，我们将使用一个包含气候数据的文件。

让我们考虑下面的示例代码片段来演示相同的内容：

示例：

# importing the required module
import netCDF4 as nc

# defining the path to file
filePath = 'sample.nc'

# using the Dataset() function
dSet = nc.Dataset(filePath)

输出：

说明：

在上面的代码片段中，我们导入了所需的模块。然后，我们指定了NetCDF文件的路径。然后，我们使用 Dataset() 函数将NetCDF文件的数据转换为数据集。

通用文件结构

netCDF4模块使我们能够访问与NetCDF文件相关的元数据和数据。NetCDF文件由三个基本部分组成：元数据、维度和变量。变量包含元数据和数据。

访问元数据

打印数据集可以提供有关文件中存储的变量及其维度的信息。

让我们考虑以下演示相同内容的代码片段。

示例：

# printing the dataset
print(dSet)

输出：

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    start_year: 2021
    month: 01
    source: Daymet Software Version 4.0
    Version_software: Daymet Software Version 4.0
    Version_data: Daymet Data Version 4.0
    Conventions: CF-1.6
    citation: Please see http://daymet.ornl.gov/ for current Daymet data citation information
    references: Please see http://daymet.ornl.gov/ for current information on Daymet references
    dimensions(sizes): x(284), y(584), time(31), nv(2)
    variables(dimensions): float32 x(x), float32 y(y), float32 lat(y, x), float32 lon(y, x), float32 time(time), int16 yearday(time), float32 time_bnds(time, nv), int16 lambert_conformal_conic(), float32 prcp(time, y, x)
    groups:

说明：

在上面的代码片段中，我们使用了 print() 函数来打印用户数据集。正如我们所看到的，上面的信息包括文件格式、数据来源、数据版本、引用、维度和变量。我们感兴趣的变量有 lat 、 lon 、时间和 prcp (降水)。通过这些变量，我们可以找到给定时间特定位置的降水量。该文件仅包含31个时间步骤(时间维度为 31 )。

我们还可以作为Python字典访问元数据，这更加有用。让我们考虑以下示例来说明相同的情况。

示例：

# importing the required module
import netCDF4 as nc

# defining the path to file
filePath = 'sample.nc'

# using the Dataset() function
dSet = nc.Dataset(filePath)

# printing the metadata as a dictionary
print(dSet.__dict__)

输出：

{'start_year': 2021, 'month': '01', 'source': 'Daymet Software Version 4.0', 'Version_software': 'Daymet Software Version 4.0', 'Version_data': 'Daymet Data Version 4.0', 'Conventions': 'CF-1.6', 'citation': 'Please see http://daymet.ornl.gov/ for current Daymet data citation information', 'references': 'Please see http://daymet.ornl.gov/ for current information on Daymet references'}

解释:

在上面的代码片段中，我们引入了所需的模块并定义了NetCDF文件的路径。然后我们使用 Dataset() 函数创建了文件的数据集。最后，我们将数据转换为字典并将结果数据打印给用户。

然后我们可以使用键访问任何元数据元素。让我们看一个演示相同情况的示例:

示例:

# importing the required module
import netCDF4 as nc

# defining the path to file
filePath = 'sample.nc'

# using the Dataset() function
dSet = nc.Dataset(filePath)

# printing the metadata as a dictionary
print(dSet.__dict__['start_year'])

输出：

解释：

在上面的代码片段中，我们指定了键并打印了用户的结果值。

维度

访问维度类似于访问文件的元数据。每个维度都存储为维度类，其中包含相关信息。我们可以通过循环遍历所有可用的维度来访问所有维度的元数据。让我们考虑下面的代码片段来演示同样的情况。

示例：

# importing the required module
import netCDF4 as nc

# defining the path to file
filePath = 'sample.nc'

# using the Dataset() function
dSet = nc.Dataset(filePath)

# printing the dimensions of the dataset
for dimension in dSet.dimensions.values():
    print(dimension)

输出：

<class 'netCDF4._netCDF4.Dimension'>: name = 'x', size = 284
<class 'netCDF4._netCDF4.Dimension'>: name = 'y', size = 584
<class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 31
<class 'netCDF4._netCDF4.Dimension'>: name = 'nv', size = 2

说明：

在以下的代码片段中，我们已经导入了所需的库并定义了NetCDF文件的路径。然后我们使用 Dataset() 函数创建了一个数据集。最后，我们使用 for 循环遍历数据集中的每个维度，并打印出维度的大小。

我们也可以像这样访问单个维度： dSet.dimensions[‘x’] 。

变量元数据

我们可以以类似的方式访问变量元数据，就像维度一样。让我们考虑下面的代码片段来演示同样的情况。

示例：

# importing the required module
import netCDF4 as nc

# defining the path to file
filePath = 'sample.nc'

# using the Dataset() function
dSet = nc.Dataset(filePath)

# printing the variables of the dataset
for variable in dSet.variables.values():
    print(variable)

输出：

<class 'netCDF4._netCDF4.Variable'>
float32 x(x)
    units: m
    long_name: x coordinate of projection
    standard_name: projection_x_coordinate
unlimited dimensions:
current shape = (284,)
filling on, default _FillValue of 9.969209968386869e+36 used
<class 'netCDF4._netCDF4.Variable'>
float32 y(y)
    units: m
    long_name: y coordinate of projection
    standard_name: projection_y_coordinate
unlimited dimensions:
current shape = (584,)
filling on, default _FillValue of 9.969209968386869e+36 used
<class 'netCDF4._netCDF4.Variable'>
float32 lat(y, x)
    units: degrees_north
    long_name: latitude coordinate
    standard_name: latitude
unlimited dimensions:
current shape = (584, 284)
filling on, default _FillValue of 9.969209968386869e+36 used
<class 'netCDF4._netCDF4.Variable'>
float32 lon(y, x)
    units: degrees_east
    long_name: longitude coordinate
    standard_name: longitude
unlimited dimensions:
current shape = (584, 284)
filling on, default _FillValue of 9.969209968386869e+36 used
<class 'netCDF4._netCDF4.Variable'>
float32 time(time)
    standard_name: time
    calendar: standard
    units: days since 1950-01-01 00:00:00
    bounds: time_bnds
    long_name: 24-hour day based on local time
unlimited dimensions: time
current shape = (31,)
filling on, default _FillValue of 9.969209968386869e+36 used
<class 'netCDF4._netCDF4.Variable'>
int16 yearday(time)
    long_name: day of year (DOY) starting with day 1 on January 1st
unlimited dimensions: time
current shape = (31,)
filling on, default _FillValue of -32767 used
<class 'netCDF4._netCDF4.Variable'>
float32 time_bnds(time, nv)
unlimited dimensions: time
current shape = (31, 2)
filling on, default _FillValue of 9.969209968386869e+36 used
<class 'netCDF4._netCDF4.Variable'>
int16 lambert_conformal_conic()
    grid_mapping_name: lambert_conformal_conic
    longitude_of_central_meridian: -100.0
    latitude_of_projection_origin: 42.5
    false_easting: 0.0
    false_northing: 0.0
    standard_parallel: [25. 60.]
    semi_major_axis: 6378137.0
    inverse_flattening: 298.257223563
unlimited dimensions:
current shape = ()
filling on, default _FillValue of -32767 used
<class 'netCDF4._netCDF4.Variable'>
float32 prcp(time, y, x)
    _FillValue: -9999.0
    long_name: daily total precipitation
    units: mm/day
    missing_value: -9999.0
    coordinates: lat lon
    grid_mapping: lambert_conformal_conic
    cell_methods: area: mean time: sum
unlimited dimensions: time
current shape = (31, 584, 284)
filling on

解释:

在上面的代码片段中，我们已导入所需的模块并定义了NetCDF文件的路径。然后，我们使用 Dataset() 函数创建文件的数据集。最后，我们使用 for 循环遍历数据集的变量并将其打印给用户。

我们还可以访问单个变量。让我们考虑以下示例来演示同样的情况:

示例:

# importing the required module
import netCDF4 as nc

# defining the path to file
filePath = 'sample.nc'

# using the Dataset() function
dSet = nc.Dataset(filePath)

# printing the value of the prcp variable
print(dSet['prcp'])

输出：

<class 'netCDF4._netCDF4.Variable'>
float32 prcp(time, y, x)
    _FillValue: -9999.0
    long_name: daily total precipitation
    units: mm/day
    missing_value: -9999.0
    coordinates: lat lon
    grid_mapping: lambert_conformal_conic
    cell_methods: area: mean time: sum
unlimited dimensions: time
current shape = (31, 584, 284)
filling on

解释:

在上面的代码片段中，我们通过将其作为参数指定给dSet变量，打印了prcp变量的值。

总结

NetCDF文件通常用于地理时间序列数据。初始时，由于大量数据和与最常用的CSV和栅格文件不同的格式，它们可能会让人感到非常困惑。NetCDF是一种很好的地理数据文档化方法，因为它具有内置的文档和元数据。这使得最终用户能够很容易地理解数据代表什么，减少了模糊度。NetCDF数据是作为NumPy数组访问的，这为分析和整合到现有实用工具和工作流程中提供了许多可能性。