Data Handling Module
The data handling module provides essential functions for loading, preprocessing, and preparing atmospheric data for budget analysis. It handles both single NetCDF files and multiple GFS files with proper coordinate system management.
Main Functions
- src.data_handling.load_data(infile, longitude_indexer, args, app_logger)[source]
Loads data from a specified NetCDF file, handling both single files and multiple GFS files.
- Parameters:
infile (str) – Path to the input .nc file or a pattern matching multiple files.
args – Parsed command-line arguments containing flags and options.
app_logger (logging.Logger) – Logger for recording messages about script progress and issues.
- Returns:
The loaded dataset.
- Return type:
xr.Dataset
- Raises:
FileNotFoundError – If the input file or files specified by infile do not exist.
Exception – For any other issues encountered during data loading.
- src.data_handling.preprocess_data(data, df_namelist, args, app_logger)[source]
Preprocesses the loaded data by sorting, slicing, and adjusting units as necessary.
- Parameters:
data (xr.Dataset) – The loaded dataset to preprocess.
df_namelist (pd.DataFrame) – DataFrame containing namelist information such as variable names.
args – Parsed command-line arguments containing flags and options.
app_logger (logging.Logger) – Logger for recording messages about script progress and issues.
- Returns:
The preprocessed dataset.
- Return type:
xr.Dataset
- Raises:
ValueError – If critical namelist variables are missing or if data preprocessing encounters an issue.
Exception – For any other issues encountered during data preprocessing.
Core Functions
Data Loading
load_data() - Primary function for loading atmospheric data from NetCDF files
Supports single NetCDF files and multi-file GFS datasets
Handles GRIB format conversion using cfgrib engine
Implements Dask parallel processing for large datasets
Automatic longitude coordinate conversion
Robust error handling and logging
Data Preprocessing
preprocess_data() - Comprehensive preprocessing pipeline for atmospheric data
Unit standardization (pressure levels to Pascal)
Coordinate sorting for consistent data arrangement
Domain slicing for computational efficiency
Radian coordinate assignment for mathematical calculations
Cosine latitude weighting preparation
Key Features
Multi-Format Support
NetCDF Files: * Standard atmospheric reanalysis data (ERA5, NCEP, etc.) * Single file or concatenated multi-file datasets * Automatic metadata preservation
GFS GRIB Files: * Operational weather model data * Multi-file time series handling * Pressure level filtering (isobaricInhPa) * Parallel loading with nested concatenation
Coordinate Management
Longitude Conversion: * Automatic detection of longitude conventions (0-360° vs -180-180°) * Standardization to consistent coordinate system * Proper handling of dateline crossing
Coordinate Sorting: * Longitude, latitude, and pressure level ordering * Ensures consistent integration results * Handles data from different sources uniformly
Radian Coordinates: * Conversion to radians for mathematical operations * Cosine latitude weighting for area calculations * Proper spherical coordinate handling
Data Optimization
Domain Slicing: * Reduces memory footprint by extracting relevant regions * Faster processing for regional analysis * Configurable through command-line arguments
Unit Standardization: * Pressure levels converted to Pascal (Pa) * Consistent physical units throughout analysis * MetPy integration for unit handling
Dask Integration: * Lazy loading for large datasets * Parallel processing capabilities * Memory-efficient chunked operations * Large chunk splitting configuration
Error Handling
Comprehensive error management:
FileNotFoundError - Missing input files
ValueError - Invalid namelist configurations
Exception - General processing errors
Detailed logging for debugging and monitoring
Preprocessing Pipeline
The preprocess_data() function implements a standardized pipeline:
Validation - Check critical namelist variables
Unit Conversion - Standardize pressure coordinates to Pa
Sorting - Order coordinates consistently
Domain Slicing - Extract relevant spatial/temporal regions
Radian Assignment - Add mathematical coordinate systems
Weighting Preparation - Compute cosine latitude factors
Usage Examples
Basic Data Loading
from src.data_handling import load_data, preprocess_data
import pandas as pd
import argparse
import logging
# Setup logging
logger = logging.getLogger(__name__)
# Load atmospheric data
dataset = load_data(
infile='era5_data.nc',
longitude_indexer='longitude',
args=args,
app_logger=logger
)
GFS Multi-File Loading
# Load GFS GRIB files (args.gfs = True)
gfs_dataset = load_data(
infile='gfs_*.grib2',
longitude_indexer='longitude',
args=args,
app_logger=logger
)
Complete Preprocessing
# Load namelist configuration
namelist_df = pd.read_csv('namelist.csv', index_col=0)
# Preprocess the loaded data
processed_data = preprocess_data(
data=dataset,
df_namelist=namelist_df,
args=args,
app_logger=logger
)
# Access processed coordinates
print(f"Pressure levels: {processed_data.level.values}")
print(f"Latitude range: {processed_data.latitude.values[[0,-1]]}")
print(f"Radian coordinates available: {'rlats' in processed_data.coords}")
Data Pipeline Integration
# Complete data preparation workflow
def prepare_atmospheric_data(input_file, namelist_path, args, logger):
# Load namelist
namelist_df = pd.read_csv(namelist_path, index_col=0)
longitude_var = namelist_df.loc['Longitude']['Variable']
# Load and preprocess data
raw_data = load_data(input_file, longitude_var, args, logger)
processed_data = preprocess_data(raw_data, namelist_df, args, logger)
return processed_data, namelist_df
Supported Data Sources
ATMOS-BUD can work with any atmospheric dataset that contains the required meteorological variables, as long as the inputs/namelist file is configured correctly to map the variable names and coordinate systems.
Dataset Flexibility: * Any NetCDF or GRIB format atmospheric dataset * Custom variable names supported through namelist configuration * Flexible coordinate system handling (longitude, latitude, pressure, time) * Automatic unit conversion and standardization
Configuration Requirements:
To use any dataset, configure the inputs/namelist file as follows:
;standard_name;Variable;Units
Air Temperature;air_temperature;T;K
Geopotential;geopotential;Z;m**2/s**2
Specific Humidity;specific_humidity;Q;kg/kg
Omega Velocity;omega;W;Pa/s
Eastward Wind Component;eastward_wind;U;m/s
Northward Wind Component;northward_wind;V;m/s
Longitude;;longitude
Latitude;;latitude
Time;;time
Vertical Level;;level
Required Variables: * Temperature * Specific humidity * Horizontal wind components (u, v) * Vertical velocity (omega) * Geopotential or geopotential height * Coordinate arrays (longitude, latitude, pressure, time)