Project Pythia Logo

Analysis-Ready Cloud-Optimized Datasets¶

Overview¶

In this notebook, we will work on understanding the main concepts of creating ARCO datasets for the Geosciences.

Analysis-Ready datasets
Cloud-Optimized datasets
Fair principles
Zarr format

Prerequisites¶

Concepts	Importance	Notes
Intro to Xarray	Necessary	Basic features
Radar Cookbook	Necessary	Radar basics
Intro to Zarr	Necessary	Zarr basics

Time to learn: 30 minutes

Imports¶

import xarray as xr
import fsspec
from glob import glob
import xradar as xd
import matplotlib.pyplot as plt
import cmweather
import numpy as np
import hvplot.xarray
from xarray.core.datatree import DataTree
from zarr.errors import ContainsGroupError

Analys-Ready¶

Analysis-Ready data is a concept that emphasizes the preparation and structuring of datasets to be immediately usable for analysis. In the CrowdFlower Data Science Report 2016, the “How Data Scientists Spend Their Time” figure illustrates the distribution of time that data scientists allocate to various tasks. The figure highlights that the majority of a data scientist’s time is dedicated to preparing and cleaning data (~80%), which is often considered the most time-consuming and critical part of the data science workflow.

Project Pythia Logo

Here’s how AR caters to various aspects:

Datasets instead of data files
Pre-processed datasets, ensuring it is clean and well-organized
Dataset enriched with comprehensive metadata
Curated and cataloged
Facilitates a more efficient and accurate analysis
More time for fun (science)

Cloud-Optimized¶

NetCDF/Raw radar data formats are not cloud optimized. Other formats, like Zarr, aim to make accessing and reading data from the cloud fast and painless. Cloud-Optimized data is structured for efficient storage, access, and processing in cloud environments.

Cloud-Optimized leverages scalable formats and parallel processing capabilities

FAIR data¶

FAIR data adheres to principles that ensure it is Findable, Accessible, Interoperable, and Reusable. These guidelines promote data sharing, collaboration, and long-term usability across various platforms and disciplines.

Fair

“FAIR sharing of data is beneficial for both data producers and consumers. Consumers gain access to interesting datasets that would otherwise be out of reach. Producers get citations to their work, when consumers publish their derivative work. OME-Zarr is the technology basis for enabling effective FAIR sharing of large image datasets.” Zarr illustrations

Fair

Courtesy: Zarr illustrations

Zarr format¶

Zarr is a flexible and efficient format for storing large, chunked, compressed, multi-dimensional arrays, enabling easy and scalable data access in both local and cloud environments. It supports parallel processing and is widely used in scientific computing for handling large datasets.

zarr

Courtesy: Zarr illustrations

ARCO Radar data¶

Leveraging the Climate and Forecast (CF) format-based FM301 hierarchical tree structure, endorsed by the World Meteorological Organization (WMO), and Analysis-Ready Cloud-Optimized (ARCO) formats, we developed an open data model to arrange, manage, and store radar data in cloud-storage buckets efficiently

CfRadial2.1/FM301 standard¶

Xradar employs xarray.DataTree objects to organize radar sweeps within a single hierachical structure, where each sweep is an xarray.Dataset containing relevant metadata and variables.

xradar

Let’s see how this hierarchical-datatree looks like

# Connection to Pythi s3 Bucket
URL = 'https://js2.jetstream-cloud.org:8001/'
path = f'pythia/radar/erad2024'

fs = fsspec.filesystem("s3", anon=True, client_kwargs=dict(endpoint_url=URL))

# C-band radar files
path = "pythia/radar/erad2024/20240522_MeteoSwiss_ARPA_Lombardia/Data/Cband/*.nc"
radar_files = fs.glob(path)
radar_files[:3]

['pythia/radar/erad2024/20240522_MeteoSwiss_ARPA_Lombardia/Data/Cband/MonteLema_202405221300.nc',
 'pythia/radar/erad2024/20240522_MeteoSwiss_ARPA_Lombardia/Data/Cband/MonteLema_202405221305.nc',
 'pythia/radar/erad2024/20240522_MeteoSwiss_ARPA_Lombardia/Data/Cband/MonteLema_202405221310.nc']

# open files locally
local_files = [
    fsspec.open_local(
        f"simplecache::{URL}{i}", s3={"anon": True}, filecache={"cache_storage": "."}
    )
    for i in radar_files[:5]
]

We can open one of this nc files using xradar.io.open_cfradial1_datree method

dt = xd.io.open_cfradial1_datatree(local_files[0])
display(dt)

Let’s create our first ARCO dataset using .to_zarr method.

dt.to_zarr("radar.zarr", consolidated=True)

We can check that a new Zarr store is created (Object storage)

!ls

ARCO-Datasets.ipynb  QPE-QVPs.ipynb  radar.zarr

This is stored locally, but could be stored in a Bucket on the cloud. Let’s open it back using Xarray.backends.api.open_datatree

dt_back = xr.backends.api.open_datatree(
    "radar.zarr", 
    consolidated=True
)

/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'netcdf4' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)
/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'h5netcdf' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)
/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'scipy' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)

display(dt_back)

Radar data time-series¶

Concatenating xradar.DataTree objects along a temporal dimension is a great approach to create a more organized and comprehensive dataset. By doing so, you can maintain a cohesive dataset that is both easier to manage and more meaningful for temporal analysis.

#let's use our local files
len(local_files)

5

To create an ARCO dataset, we need to ensure that all radar volumes are properly aligned. To achieve this, we developed the following function:

def fix_angle(ds: xr.Dataset, tolerance: float=None, **kwargs) -> xr.Dataset:
    """
    This function reindex the radar azimuth angle to make all sweeps starts and end at the same angle
    @param ds: xarray dataset containing and xradar object
    @param tolerance: Tolerance for interpolation between azimuth angles. 
                      Defaul, the radar azimuth angle resolution.
    @return: azimuth reindex xarray dataset
    """
    ds["time"] = ds.time.load() 
    angle_dict = xd.util.extract_angle_parameters(ds)
    start_ang = angle_dict["start_angle"]
    stop_ang = angle_dict["stop_angle"]
    direction = angle_dict["direction"]
    ds = xd.util.remove_duplicate_rays(ds)
    az = len(np.arange(start_ang, stop_ang))
    ar = np.round(az / len(ds.azimuth.data), 2)
    tolerance = ar if not tolerance else tolerance
    ds = xd.util.reindex_angle(
        ds, 
        start_ang,  
        stop_ang, 
        ar, 
        direction, 
        method="nearest", 
        tolerance=tolerance, **kwargs
    )
    return ds

Now, we can use the Xarray.open_mfdataset method to open all nc files simultaneously. We can iterate over each sweep and concatenate them along the volume_time dimension.

# listing all the sweeps within each nc file
sweeps = [
        i[1:] for i in list(dt.groups) if i.startswith("/sweep") if i not in ["/"]
    ]
sweeps

['sweep_0',
 'sweep_1',
 'sweep_2',
 'sweep_3',
 'sweep_4',
 'sweep_5',
 'sweep_6',
 'sweep_7',
 'sweep_8',
 'sweep_9',
 'sweep_10',
 'sweep_11',
 'sweep_12',
 'sweep_13',
 'sweep_14',
 'sweep_15',
 'sweep_16',
 'sweep_17',
 'sweep_18',
 'sweep_19']

for sweep in sweeps:
    root = {}
    ds = xr.open_mfdataset(
        local_files,
        preprocess=fix_angle,
        engine="cfradial1",
        group=sweep,
        concat_dim="volume_time",
        combine="nested",
    ).xradar.georeference()
    ds
    root[f"{sweep}"] = ds
    dtree = DataTree.from_dict(root)
    
    try:
        dtree.to_zarr(
            "radar_ts.zarr", 
            consolidated=True,
        )
    except ContainsGroupError:
        dtree.to_zarr(
            "radar_ts.zarr", 
            consolidated=True, 
            mode="a", 
        )
    del dtree, ds

Let’s see our new radar-time series dataset

dtree = xr.backends.api.open_datatree(
    "radar_ts.zarr",
    consolidated=True,
    chunks={}
)

/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'netcdf4' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)
/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'h5netcdf' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)
/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'scipy' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)

dtree

We have successfully created a Analysis-Ready Cloud-Optimezed dataset.

Summary¶

We discussed the concept of Analysis-Ready Cloud-Optimezed (ARCO) datasets, emphasizing the importance of datasets that are pre-processed, clean, and well-organized. Leveraging the Climate and Forecast (CF) format-based FM301 hierarchical tree structure, endorsed by the World Meteorological Organization (WMO), we developed an open data model to arrange, manage, and store radar data in cloud-storage buckets efficiently. The ultimate goal of radar ARCO data is to streamline the data science process, making datasets immediately usable without the need for extensive preprocessing.

What’s next?¶

Now, we can explore some quantitavite precipitation estimation (QPE) and quasy-vertical profiles (QVP) demos in the QPE-QVPs notebook.

Resources and references¶

LROSE & PyDDA multi doppler analysis

LROSE Wind Tutorial

ARCO Radar Datasets

QPE & QVPs