Figure 1:Analysis-Ready Cloud-Optimized (ARCO) concept diagram

Analysis-Ready Cloud-Optimized Datasets¶

Overview¶

In this notebook, we will explore Analysis-Ready Cloud-Optimized (ARCO) radar datasets using Canadian weather radar data. You’ll learn:

Analysis-Ready datasets - Pre-processed data ready for immediate analysis
Cloud-Optimized formats - Efficient storage and access in cloud environments
FAIR principles - Making data Findable, Accessible, Interoperable, and Reusable
Zarr format - Modern chunked storage for large scientific datasets

We’ll use Canadian radar data from the May 2022 Ontario Derecho severe weather event.

Prerequisites¶

Table 1:Prerequisites for this tutorial

Concepts	Importance	Notes
Intro to Xarray	Necessary	Basic features
Radar Cookbook	Necessary	Radar basics
Intro to Zarr	Necessary	Zarr basics

Time to learn: 30 minutes

Imports¶

Analysis-Ready Data¶

Analysis-Ready data means datasets are prepared and structured to be immediately usable for scientific analysis. Studies show that data scientists typically spend ~80% of their time preparing and cleaning data rather than doing actual analysis.

Analysis-Ready datasets solve this by providing:

Clean, pre-processed data that’s ready to use
Rich metadata that explains what the data contains
Standardized formats that work well with analysis tools
Quality control that ensures data reliability

This means more time for science and discovery! 🚀

Analysis-Ready Data Diagram — Figure 2:Analysis-Ready data workflow visualization

Key Benefits of Analysis-Ready Data:

✅ Datasets instead of scattered files - Organized collections of related data
✅ Pre-processed and clean - No need to spend hours fixing data issues
✅ Rich metadata included - Clear documentation of what the data represents
✅ Cataloged and discoverable - Easy to find relevant datasets
✅ Immediate analysis capability - Start analyzing right away
✅ More time for science! - Focus on research questions, not data wrangling

Cloud-Optimized Data¶

Traditional radar data formats (like individual NetCDF files) work well on local computers but are slow and inefficient in cloud environments. Cloud-Optimized formats like Zarr are designed specifically for fast, efficient access from cloud storage.

Move to cloud diagram — Figure 3:Traditional vs Cloud-Optimized data access patterns

Why Cloud-Optimized matters:

Parallel access - Multiple users can read different parts simultaneously
Chunked storage - Only download the data you need
Fast streaming - No need to download entire files
Scalable processing - Handle datasets too large for local computers

FAIR Data Principles¶

FAIR data follows principles that make scientific data more valuable and reusable:

Findable - Easy to discover through catalogs and search
Accessible - Available through standard protocols
Interoperable - Works with different tools and systems
Reusable - Well-documented for future use by others

FAIR Data Principles — Figure 4:FAIR (Findable, Accessible, Interoperable, Reusable) data principles diagram

FAIR data benefits everyone:

Data producers get citations when others use their datasets
Data consumers access interesting datasets that would otherwise be unavailable
Science advances through improved data sharing and collaboration

FAIR data reuse cycle — Figure 5:FAIR data reuse and collaboration cycle

Image courtesy: Zarr illustrations

Zarr format¶

Zarr is a modern storage format designed for large scientific datasets. Instead of storing data in single large files, Zarr breaks data into small “chunks” that can be:

Compressed to save storage space
Accessed in parallel by multiple users
Streamed efficiently from cloud storage
Processed on-demand without downloading everything

Think of it like having a library where you can grab just the books you need, rather than having to check out the entire library!

Monolithic vs Chunked storage — Figure 6:Monolithic vs chunked data storage comparison showing Zarr’s advantage

Courtesy: Zarr illustrations

We’ll create Analysis-Ready Cloud-Optimized radar datasets using the CfRadial2.1/FM301 standard - a hierarchical structure endorsed by the World Meteorological Organization (WMO). This standard organizes radar data efficiently for both storage and analysis.

CfRadial2.1/FM301 standard¶

The DataTree structure organizes radar data hierarchically:

Root level: Contains general radar metadata (location, time, etc.)
Sweep levels: Each elevation angle gets its own dataset with radar variables
This structure mirrors how meteorologists think about radar scans

CfRadial2.1 DataTree structure — Figure 7:CfRadial2.1/FM301 hierarchical DataTree structure for radar data organization

Summary¶

We learned about Analysis-Ready Cloud-Optimized (ARCO).

🎯 Key Learning Outcomes:¶

📊 Analysis-Ready: Pre-processed, clean datasets ready for immediate scientific analysis
☁️ Cloud-Optimized: Efficient Zarr format enabling fast access from cloud storage
🌐 FAIR Principles: Making data Findable, Accessible, Interoperable, and Reusable
📈 Time Series: Combined multiple radar volumes to track storm evolution
🏗️ Standardized Structure: Used WMO-endorsed FM301 hierarchical organization

🚀 What This Enables:¶

Faster Research: No more data preprocessing - start analyzing immediately
Cloud Analytics: Process large datasets without downloading everything
Reproducible Science: Standardized formats work across different tools
Collaboration: Easy data sharing following FAIR principles
Storm Tracking: Time series analysis of severe weather events

The Ontario Derecho case study demonstrates how ARCO datasets streamline radar meteorology research and education! 🌪️

Analysis-Ready Cloud-Optimized Datasets