Skip to article frontmatterSkip to article content

Analysis-Ready Cloud-Optimized Datasets

Analysis-Ready Cloud-Optimized Datasets


Overview

In this notebook, we will explore Analysis-Ready Cloud-Optimized (ARCO) radar datasets using Canadian weather radar data. You’ll learn:

  1. Analysis-Ready datasets - Pre-processed data ready for immediate analysis
  2. Cloud-Optimized formats - Efficient storage and access in cloud environments
  3. FAIR principles - Making data Findable, Accessible, Interoperable, and Reusable
  4. Zarr format - Modern chunked storage for large scientific datasets

We’ll use Canadian radar data from the May 2022 Ontario Derecho severe weather event.

Prerequisites

Table 1:Prerequisites for this tutorial

ConceptsImportanceNotes
Intro to XarrayNecessaryBasic features
Radar CookbookNecessaryRadar basics
Intro to ZarrNecessaryZarr basics
  • Time to learn: 30 minutes

Imports

Analysis-Ready Data

Analysis-Ready data means datasets are prepared and structured to be immediately usable for scientific analysis. Studies show that data scientists typically spend ~80% of their time preparing and cleaning data rather than doing actual analysis.

Analysis-Ready datasets solve this by providing:

  • Clean, pre-processed data that’s ready to use
  • Rich metadata that explains what the data contains
  • Standardized formats that work well with analysis tools
  • Quality control that ensures data reliability

This means more time for science and discovery! 🚀

Analysis-Ready Data Diagram

Figure 2:Analysis-Ready data workflow visualization

Key Benefits of Analysis-Ready Data:

Datasets instead of scattered files - Organized collections of related data
Pre-processed and clean - No need to spend hours fixing data issues
Rich metadata included - Clear documentation of what the data represents
Cataloged and discoverable - Easy to find relevant datasets
Immediate analysis capability - Start analyzing right away
More time for science! - Focus on research questions, not data wrangling

Cloud-Optimized Data

Traditional radar data formats (like individual NetCDF files) work well on local computers but are slow and inefficient in cloud environments. Cloud-Optimized formats like Zarr are designed specifically for fast, efficient access from cloud storage.

Move to cloud diagram

Figure 3:Traditional vs Cloud-Optimized data access patterns

Why Cloud-Optimized matters:

  • Parallel access - Multiple users can read different parts simultaneously
  • Chunked storage - Only download the data you need
  • Fast streaming - No need to download entire files
  • Scalable processing - Handle datasets too large for local computers

FAIR Data Principles

FAIR data follows principles that make scientific data more valuable and reusable:

  • Findable - Easy to discover through catalogs and search
  • Accessible - Available through standard protocols
  • Interoperable - Works with different tools and systems
  • Reusable - Well-documented for future use by others
FAIR Data Principles

Figure 4:FAIR (Findable, Accessible, Interoperable, Reusable) data principles diagram

FAIR data benefits everyone:

  • Data producers get citations when others use their datasets
  • Data consumers access interesting datasets that would otherwise be unavailable
  • Science advances through improved data sharing and collaboration
FAIR data reuse cycle

Figure 5:FAIR data reuse and collaboration cycle

Image courtesy: Zarr illustrations

Zarr format

Zarr is a modern storage format designed for large scientific datasets. Instead of storing data in single large files, Zarr breaks data into small “chunks” that can be:

  • Compressed to save storage space
  • Accessed in parallel by multiple users
  • Streamed efficiently from cloud storage
  • Processed on-demand without downloading everything

Think of it like having a library where you can grab just the books you need, rather than having to check out the entire library!

Monolithic vs Chunked storage

Figure 6:Monolithic vs chunked data storage comparison showing Zarr’s advantage

Courtesy: Zarr illustrations

We’ll create Analysis-Ready Cloud-Optimized radar datasets using the CfRadial2.1/FM301 standard - a hierarchical structure endorsed by the World Meteorological Organization (WMO). This standard organizes radar data efficiently for both storage and analysis.

CfRadial2.1/FM301 standard

The DataTree structure organizes radar data hierarchically:

  • Root level: Contains general radar metadata (location, time, etc.)
  • Sweep levels: Each elevation angle gets its own dataset with radar variables
  • This structure mirrors how meteorologists think about radar scans
CfRadial2.1 DataTree structure

Figure 7:CfRadial2.1/FM301 hierarchical DataTree structure for radar data organization


Summary

We learned about Analysis-Ready Cloud-Optimized (ARCO).

🎯 Key Learning Outcomes:

📊 Analysis-Ready: Pre-processed, clean datasets ready for immediate scientific analysis
☁️ Cloud-Optimized: Efficient Zarr format enabling fast access from cloud storage
🌐 FAIR Principles: Making data Findable, Accessible, Interoperable, and Reusable
📈 Time Series: Combined multiple radar volumes to track storm evolution
🏗️ Standardized Structure: Used WMO-endorsed FM301 hierarchical organization

🚀 What This Enables:

  • Faster Research: No more data preprocessing - start analyzing immediately
  • Cloud Analytics: Process large datasets without downloading everything
  • Reproducible Science: Standardized formats work across different tools
  • Collaboration: Easy data sharing following FAIR principles
  • Storm Tracking: Time series analysis of severe weather events

The Ontario Derecho case study demonstrates how ARCO datasets streamline radar meteorology research and education! 🌪️