Data Handling Module

The data handling module provides essential functions for loading, preprocessing, and preparing atmospheric data for budget analysis. It handles both single NetCDF files and multiple GFS files with proper coordinate system management.

Main Functions

src.data_handling.load_data(infile, longitude_indexer, args, app_logger)[source]

Loads data from a specified NetCDF file, handling both single files and multiple GFS files.

Parameters:
  • infile (str) – Path to the input .nc file or a pattern matching multiple files.

  • args – Parsed command-line arguments containing flags and options.

  • app_logger (logging.Logger) – Logger for recording messages about script progress and issues.

Returns:

The loaded dataset.

Return type:

xr.Dataset

Raises:
  • FileNotFoundError – If the input file or files specified by infile do not exist.

  • Exception – For any other issues encountered during data loading.

src.data_handling.preprocess_data(data, df_namelist, args, app_logger)[source]

Preprocesses the loaded data by sorting, slicing, and adjusting units as necessary.

Parameters:
  • data (xr.Dataset) – The loaded dataset to preprocess.

  • df_namelist (pd.DataFrame) – DataFrame containing namelist information such as variable names.

  • args – Parsed command-line arguments containing flags and options.

  • app_logger (logging.Logger) – Logger for recording messages about script progress and issues.

Returns:

The preprocessed dataset.

Return type:

xr.Dataset

Raises:
  • ValueError – If critical namelist variables are missing or if data preprocessing encounters an issue.

  • Exception – For any other issues encountered during data preprocessing.

Core Functions

Data Loading

load_data() - Primary function for loading atmospheric data from NetCDF files

  • Supports single NetCDF files and multi-file GFS datasets

  • Handles GRIB format conversion using cfgrib engine

  • Implements Dask parallel processing for large datasets

  • Automatic longitude coordinate conversion

  • Robust error handling and logging

Data Preprocessing

preprocess_data() - Comprehensive preprocessing pipeline for atmospheric data

  • Unit standardization (pressure levels to Pascal)

  • Coordinate sorting for consistent data arrangement

  • Domain slicing for computational efficiency

  • Radian coordinate assignment for mathematical calculations

  • Cosine latitude weighting preparation

Key Features

Multi-Format Support

NetCDF Files: * Standard atmospheric reanalysis data (ERA5, NCEP, etc.) * Single file or concatenated multi-file datasets * Automatic metadata preservation

GFS GRIB Files: * Operational weather model data * Multi-file time series handling * Pressure level filtering (isobaricInhPa) * Parallel loading with nested concatenation

Coordinate Management

Longitude Conversion: * Automatic detection of longitude conventions (0-360° vs -180-180°) * Standardization to consistent coordinate system * Proper handling of dateline crossing

Coordinate Sorting: * Longitude, latitude, and pressure level ordering * Ensures consistent integration results * Handles data from different sources uniformly

Radian Coordinates: * Conversion to radians for mathematical operations * Cosine latitude weighting for area calculations * Proper spherical coordinate handling

Data Optimization

Domain Slicing: * Reduces memory footprint by extracting relevant regions * Faster processing for regional analysis * Configurable through command-line arguments

Unit Standardization: * Pressure levels converted to Pascal (Pa) * Consistent physical units throughout analysis * MetPy integration for unit handling

Dask Integration: * Lazy loading for large datasets * Parallel processing capabilities * Memory-efficient chunked operations * Large chunk splitting configuration

Error Handling

Comprehensive error management:

  • FileNotFoundError - Missing input files

  • ValueError - Invalid namelist configurations

  • Exception - General processing errors

  • Detailed logging for debugging and monitoring

Preprocessing Pipeline

The preprocess_data() function implements a standardized pipeline:

  1. Validation - Check critical namelist variables

  2. Unit Conversion - Standardize pressure coordinates to Pa

  3. Sorting - Order coordinates consistently

  4. Domain Slicing - Extract relevant spatial/temporal regions

  5. Radian Assignment - Add mathematical coordinate systems

  6. Weighting Preparation - Compute cosine latitude factors

Usage Examples

Basic Data Loading

from src.data_handling import load_data, preprocess_data
import pandas as pd
import argparse
import logging

# Setup logging
logger = logging.getLogger(__name__)

# Load atmospheric data
dataset = load_data(
    infile='era5_data.nc',
    longitude_indexer='longitude',
    args=args,
    app_logger=logger
)

GFS Multi-File Loading

# Load GFS GRIB files (args.gfs = True)
gfs_dataset = load_data(
    infile='gfs_*.grib2',
    longitude_indexer='longitude',
    args=args,
    app_logger=logger
)

Complete Preprocessing

# Load namelist configuration
namelist_df = pd.read_csv('namelist.csv', index_col=0)

# Preprocess the loaded data
processed_data = preprocess_data(
    data=dataset,
    df_namelist=namelist_df,
    args=args,
    app_logger=logger
)

# Access processed coordinates
print(f"Pressure levels: {processed_data.level.values}")
print(f"Latitude range: {processed_data.latitude.values[[0,-1]]}")
print(f"Radian coordinates available: {'rlats' in processed_data.coords}")

Data Pipeline Integration

# Complete data preparation workflow
def prepare_atmospheric_data(input_file, namelist_path, args, logger):
    # Load namelist
    namelist_df = pd.read_csv(namelist_path, index_col=0)
    longitude_var = namelist_df.loc['Longitude']['Variable']

    # Load and preprocess data
    raw_data = load_data(input_file, longitude_var, args, logger)
    processed_data = preprocess_data(raw_data, namelist_df, args, logger)

    return processed_data, namelist_df

Supported Data Sources

ATMOS-BUD can work with any atmospheric dataset that contains the required meteorological variables, as long as the inputs/namelist file is configured correctly to map the variable names and coordinate systems.

Dataset Flexibility: * Any NetCDF or GRIB format atmospheric dataset * Custom variable names supported through namelist configuration * Flexible coordinate system handling (longitude, latitude, pressure, time) * Automatic unit conversion and standardization

Configuration Requirements:

To use any dataset, configure the inputs/namelist file as follows:

;standard_name;Variable;Units
Air Temperature;air_temperature;T;K
Geopotential;geopotential;Z;m**2/s**2
Specific Humidity;specific_humidity;Q;kg/kg
Omega Velocity;omega;W;Pa/s
Eastward Wind Component;eastward_wind;U;m/s
Northward Wind Component;northward_wind;V;m/s
Longitude;;longitude
Latitude;;latitude
Time;;time
Vertical Level;;level

Required Variables: * Temperature * Specific humidity * Horizontal wind components (u, v) * Vertical velocity (omega) * Geopotential or geopotential height * Coordinate arrays (longitude, latitude, pressure, time)