Skip to article frontmatterSkip to article content

Analysis-Ready Cloud-Optimized Datasets

Project Pythia Logo

Analysis-Ready Cloud-Optimized Datasets


Overview

In this notebook, we will work on understanding the main concepts of creating ARCO datasets for the Geosciences.

  1. Analysis-Ready datasets
  2. Cloud-Optimized datasets
  3. Fair principles
  4. Zarr format

Prerequisites

ConceptsImportanceNotes
Intro to XarrayNecessaryBasic features
Radar CookbookNecessaryRadar basics
Intro to ZarrNecessaryZarr basics
  • Time to learn: 30 minutes

Imports

import xarray as xr
import fsspec
from glob import glob
import xradar as xd
import matplotlib.pyplot as plt
import cmweather
import numpy as np
import hvplot.xarray
from xarray.core.datatree import DataTree
from zarr.errors import ContainsGroupError  
Loading...

Analys-Ready

Analysis-Ready data is a concept that emphasizes the preparation and structuring of datasets to be immediately usable for analysis. In the CrowdFlower Data Science Report 2016, the “How Data Scientists Spend Their Time” figure illustrates the distribution of time that data scientists allocate to various tasks. The figure highlights that the majority of a data scientist’s time is dedicated to preparing and cleaning data (~80%), which is often considered the most time-consuming and critical part of the data science workflow.

Project Pythia Logo

Here’s how AR caters to various aspects:

  • Datasets instead of data files
  • Pre-processed datasets, ensuring it is clean and well-organized
  • Dataset enriched with comprehensive metadata
  • Curated and cataloged
  • Facilitates a more efficient and accurate analysis
  • More time for fun (science)

Cloud-Optimized

NetCDF/Raw radar data formats are not cloud optimized. Other formats, like Zarr, aim to make accessing and reading data from the cloud fast and painless. Cloud-Optimized data is structured for efficient storage, access, and processing in cloud environments.

Move to cloud diagram

Cloud-Optimized leverages scalable formats and parallel processing capabilities

FAIR data

FAIR data adheres to principles that ensure it is Findable, Accessible, Interoperable, and Reusable. These guidelines promote data sharing, collaboration, and long-term usability across various platforms and disciplines.

Fair

“FAIR sharing of data is beneficial for both data producers and consumers. Consumers gain access to interesting datasets that would otherwise be out of reach. Producers get citations to their work, when consumers publish their derivative work. OME-Zarr is the technology basis for enabling effective FAIR sharing of large image datasets.” Zarr illustrations

Fair

Courtesy: Zarr illustrations

Zarr format

Zarr is a flexible and efficient format for storing large, chunked, compressed, multi-dimensional arrays, enabling easy and scalable data access in both local and cloud environments. It supports parallel processing and is widely used in scientific computing for handling large datasets.

zarr

Courtesy: Zarr illustrations

ARCO Radar data

Leveraging the Climate and Forecast (CF) format-based FM301 hierarchical tree structure, endorsed by the World Meteorological Organization (WMO), and Analysis-Ready Cloud-Optimized (ARCO) formats, we developed an open data model to arrange, manage, and store radar data in cloud-storage buckets efficiently

CfRadial2.1/FM301 standard

Xradar employs xarray.DataTree objects to organize radar sweeps within a single hierachical structure, where each sweep is an xarray.Dataset containing relevant metadata and variables.

xradar

Let’s see how this hierarchical-datatree looks like

# Connection to Pythi s3 Bucket
URL = 'https://js2.jetstream-cloud.org:8001/'
path = f'pythia/radar/erad2024'

fs = fsspec.filesystem("s3", anon=True, client_kwargs=dict(endpoint_url=URL))

# C-band radar files
path = "pythia/radar/erad2024/20240522_MeteoSwiss_ARPA_Lombardia/Data/Cband/*.nc"
radar_files = fs.glob(path)
radar_files[:3]
['pythia/radar/erad2024/20240522_MeteoSwiss_ARPA_Lombardia/Data/Cband/MonteLema_202405221300.nc', 'pythia/radar/erad2024/20240522_MeteoSwiss_ARPA_Lombardia/Data/Cband/MonteLema_202405221305.nc', 'pythia/radar/erad2024/20240522_MeteoSwiss_ARPA_Lombardia/Data/Cband/MonteLema_202405221310.nc']
# open files locally
local_files = [
    fsspec.open_local(
        f"simplecache::{URL}{i}", s3={"anon": True}, filecache={"cache_storage": "."}
    )
    for i in radar_files[:5]
]

We can open one of this nc files using xradar.io.open_cfradial1_datree method

dt = xd.io.open_cfradial1_datatree(local_files[0])
display(dt)
Loading...

Let’s create our first ARCO dataset using .to_zarr method.

dt.to_zarr("radar.zarr", consolidated=True)

We can check that a new Zarr store is created (Object storage)

!ls 
ARCO-Datasets.ipynb  QPE-QVPs.ipynb  radar.zarr

This is stored locally, but could be stored in a Bucket on the cloud. Let’s open it back using Xarray.backends.api.open_datatree

dt_back = xr.backends.api.open_datatree(
    "radar.zarr", 
    consolidated=True
)
/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'netcdf4' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)
/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'h5netcdf' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)
/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'scipy' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)
display(dt_back)
Loading...

Radar data time-series

Concatenating xradar.DataTree objects along a temporal dimension is a great approach to create a more organized and comprehensive dataset. By doing so, you can maintain a cohesive dataset that is both easier to manage and more meaningful for temporal analysis.

#let's use our local files
len(local_files)
5

To create an ARCO dataset, we need to ensure that all radar volumes are properly aligned. To achieve this, we developed the following function:

def fix_angle(ds: xr.Dataset, tolerance: float=None, **kwargs) -> xr.Dataset:
    """
    This function reindex the radar azimuth angle to make all sweeps starts and end at the same angle
    @param ds: xarray dataset containing and xradar object
    @param tolerance: Tolerance for interpolation between azimuth angles. 
                      Defaul, the radar azimuth angle resolution.
    @return: azimuth reindex xarray dataset
    """
    ds["time"] = ds.time.load() 
    angle_dict = xd.util.extract_angle_parameters(ds)
    start_ang = angle_dict["start_angle"]
    stop_ang = angle_dict["stop_angle"]
    direction = angle_dict["direction"]
    ds = xd.util.remove_duplicate_rays(ds)
    az = len(np.arange(start_ang, stop_ang))
    ar = np.round(az / len(ds.azimuth.data), 2)
    tolerance = ar if not tolerance else tolerance
    ds = xd.util.reindex_angle(
        ds, 
        start_ang,  
        stop_ang, 
        ar, 
        direction, 
        method="nearest", 
        tolerance=tolerance, **kwargs
    )
    return ds

Now, we can use the Xarray.open_mfdataset method to open all nc files simultaneously. We can iterate over each sweep and concatenate them along the volume_time dimension.

# listing all the sweeps within each nc file
sweeps = [
        i[1:] for i in list(dt.groups) if i.startswith("/sweep") if i not in ["/"]
    ]
sweeps
['sweep_0', 'sweep_1', 'sweep_2', 'sweep_3', 'sweep_4', 'sweep_5', 'sweep_6', 'sweep_7', 'sweep_8', 'sweep_9', 'sweep_10', 'sweep_11', 'sweep_12', 'sweep_13', 'sweep_14', 'sweep_15', 'sweep_16', 'sweep_17', 'sweep_18', 'sweep_19']
for sweep in sweeps:
    root = {}
    ds = xr.open_mfdataset(
        local_files,
        preprocess=fix_angle,
        engine="cfradial1",
        group=sweep,
        concat_dim="volume_time",
        combine="nested",
    ).xradar.georeference()
    ds
    root[f"{sweep}"] = ds
    dtree = DataTree.from_dict(root)
    
    try:
        dtree.to_zarr(
            "radar_ts.zarr", 
            consolidated=True,
        )
    except ContainsGroupError:
        dtree.to_zarr(
            "radar_ts.zarr", 
            consolidated=True, 
            mode="a", 
        )
    del dtree, ds

Let’s see our new radar-time series dataset

dtree = xr.backends.api.open_datatree(
    "radar_ts.zarr",
    consolidated=True,
    chunks={}
)
/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'netcdf4' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)
/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'h5netcdf' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)
/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/plugins.py:149: RuntimeWarning: 'scipy' fails while guessing
  warnings.warn(f"{engine!r} fails while guessing", RuntimeWarning)
dtree
Loading...

We have successfully created a Analysis-Ready Cloud-Optimezed dataset.


Summary

We discussed the concept of Analysis-Ready Cloud-Optimezed (ARCO) datasets, emphasizing the importance of datasets that are pre-processed, clean, and well-organized. Leveraging the Climate and Forecast (CF) format-based FM301 hierarchical tree structure, endorsed by the World Meteorological Organization (WMO), we developed an open data model to arrange, manage, and store radar data in cloud-storage buckets efficiently. The ultimate goal of radar ARCO data is to streamline the data science process, making datasets immediately usable without the need for extensive preprocessing.

What’s next?

Now, we can explore some quantitavite precipitation estimation (QPE) and quasy-vertical profiles (QVP) demos in the QPE-QVPs notebook.

Resources and references