Reliability First ================= .. verified:: 2025-11-12 :reviewer: Christof Buchbender The ops-db-api is designed with one overriding principle: **observatory operations must never fail due to network issues**. This document explains how we achieve this through transaction buffering, LSN tracking, and eventual consistency. .. contents:: Table of Contents :local: :depth: 2 The Core Requirement -------------------- During telescope observations at 5600m altitude in Chile: **What must work - Operations**: * Recording observation start/end times * Logging instrument configurations * Registering raw data files * Tracking observation status changes **What can fail**: * Network connection to main database (Cologne) * Replication lag to local replica * Remote API calls for auxiliary services **The guarantee**: Observatory operations **never** block or fail due to network issues. Transaction Buffering: The Safety Net -------------------------------------- How It Works ~~~~~~~~~~~~ When operating at a secondary site (observatory): 1. **Write request arrives** at critical endpoint (e.g., create observation) 2. **Transaction is built** with pre-generated IDs 3. **Buffer to Redis** immediately (sub-millisecond) 4. **Return success** to client with pre-generated ID 5. **Background processor** executes transaction on main DB asynchronously 6. **LSN tracking** confirms when data reaches replica 7. **Cache cleanup** occurs after replication confirmed .. mermaid:: sequenceDiagram participant Client participant API participant Redis participant BG as Background
Processor participant MainDB as Main DB
(Cologne) participant Replica as Local
Replica Client->>API: POST /executed_obs_units/start Note over API: Build transaction
Pre-generate UUID API->>Redis: LPUSH to buffer (instant) Redis-->>API: OK API-->>Client: 201 Created (UUID: abc123) Note over Client: Client continues
operation immediately loop Background Processing BG->>Redis: RPOP from buffer BG->>MainDB: Execute transaction MainDB-->>BG: LSN: 0/12345678 BG->>Replica: Poll replay LSN alt Replicated Replica-->>BG: LSN: 0/12345678 (caught up) BG->>Redis: Cleanup cache else Not Yet Replica-->>BG: LSN: 0/12345600 (behind) BG->>Redis: Extend cache TTL end end Benefits ~~~~~~~~ **Immediate Response**: * Client never waits for remote database * Sub-millisecond latency for buffered operations * No difference in experience during network issues **Guaranteed Execution**: * Redis persistence (AOF) ensures durability * Automatic retry with exponential backoff * Failed transactions move to dead-letter queue for investigation **Operational Continuity**: * Telescope observations never pause * Operators get immediate feedback * Data accumulates in buffer until network restored Pre-Generated IDs ~~~~~~~~~~~~~~~~~ A critical feature: IDs are generated **before** database insertion: .. code-block:: python # Traditional approach (doesn't work for buffering) def create_observation(data): obs = Observation(**data) db.add(obs) db.commit() return obs.id # ID only available after DB commit! # Our approach (works with buffering) def create_observation(data): obs_id = uuid.uuid4() # Generate immediately transaction = builder.create( model_class=Observation, data={"id": obs_id, **data} ) buffer_transaction(transaction) return obs_id # Available immediately! **Why this matters**: * Client can reference the observation immediately * Can create related records before DB commit (e.g., files linked to observation) * Status queries work immediately (check buffer + DB) See :doc:`../deep-dive/transaction-buffering/transaction-builder` for implementation. LSN Tracking: Precision Over Guesswork --------------------------------------- The Problem with Guessing ~~~~~~~~~~~~~~~~~~~~~~~~~~ **Without LSN tracking**, we'd have to guess when data has replicated: .. code-block:: python # Naive approach (unreliable) def wait_for_replication(record_id): await asyncio.sleep(5) # Hope it's replicated by now? return check_replica(record_id) # Might not be there yet! **Problems**: * Hard-coded delays waste time (too long) or miss replication (too short) * No way to know actual replication state * Can't adjust to varying network conditions * Cache management is guesswork LSN: The Solution ~~~~~~~~~~~~~~~~~ **PostgreSQL Log Sequence Numbers (LSN)** provide precision: **After write to main database**: .. code-block:: sql -- Get LSN after transaction commit SELECT pg_current_wal_lsn(); -- Returns: '0/12345678' **Check replica progress**: .. code-block:: sql -- Get LSN that replica has replayed SELECT pg_last_wal_replay_lsn(); -- Returns: '0/12345600' (slightly behind) **Compare**: .. code-block:: python main_lsn = "0/12345678" replica_lsn = "0/12345600" if replica_lsn >= main_lsn: print("Replica has our data!") else: bytes_behind = parse_lsn(main_lsn) - parse_lsn(replica_lsn) print(f"Replica is {bytes_behind} bytes behind") How We Use LSN ~~~~~~~~~~~~~~ **1. Track Transaction LSN**: When background processor executes a buffered transaction: .. code-block:: python # Execute transaction await executor.execute_transaction(transaction, session) await session.commit() # Capture LSN result = await session.execute("SELECT pg_current_wal_lsn()") lsn = result.scalar() # Store with transaction await redis.setex( f"transaction:{tx_id}:lsn", ttl=3600, value=lsn ) **2. Poll Replica Status**: Periodically check if replica has caught up: .. code-block:: python async def check_replication(transaction_lsn): # Query replica result = await replica_session.execute( "SELECT pg_last_wal_replay_lsn()" ) replica_lsn = result.scalar() # Compare if replica_lsn >= transaction_lsn: return True # Replicated! else: return False # Still waiting **3. Smart Cache Management**: Adjust cache TTL based on replication state: .. code-block:: python if replicated: # Data is in replica, can cleanup buffer await cleanup_transaction_cache(tx_id) else: # Still replicating, keep cache longer await extend_cache_ttl(tx_id, additional_seconds=60) **4. Timeout Handling**: If replication takes too long: .. code-block:: python if time_waiting > LSN_TIMEOUT: # Replication very slow or stuck logger.warning( f"LSN {transaction_lsn} not replicated after {LSN_TIMEOUT}s" ) # Keep cache indefinitely (or until manual cleanup) await set_cache_ttl(tx_id, ttl=None) See :doc:`../deep-dive/transaction-buffering/lsn-tracking` for full implementation. Automatic Retry with Exponential Backoff ----------------------------------------- When transaction execution fails, we retry intelligently: Retry Configuration ~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # .env TRANSACTION_RETRY_ATTEMPTS=3 TRANSACTION_RETRY_DELAY=5 # Initial delay in seconds Retry Strategy ~~~~~~~~~~~~~~ .. code-block:: python async def execute_with_retry(transaction): attempt = 0 delay = TRANSACTION_RETRY_DELAY while attempt < TRANSACTION_RETRY_ATTEMPTS: try: result = await execute_transaction(transaction) return result except NetworkError as e: attempt += 1 if attempt >= TRANSACTION_RETRY_ATTEMPTS: # Move to failed queue await move_to_failed_queue(transaction) raise # Exponential backoff logger.warning( f"Transaction failed (attempt {attempt}), " f"retrying in {delay}s: {e}" ) await asyncio.sleep(delay) delay *= 2 # Exponential backoff **Why exponential backoff**: * Transient network issues often resolve quickly (first retry) * Persistent issues need longer wait (later retries) * Prevents overwhelming failed service with retry storm Failed Transaction Handling ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If all retries fail: .. code-block:: python # Move to dead-letter queue await redis.lpush( f"site:{site_name}:failed_transactions", json.dumps(transaction) ) # Log for investigation logger.error( f"Transaction {tx_id} failed after {attempts} attempts: {error}" ) # Alert monitoring system await alert_monitoring( severity="error", message=f"Transaction {tx_id} failed permanently" ) **Human intervention needed**: * Investigate root cause (network? database? bug?) * Manually replay failed transaction * Or mark as resolved if no longer needed See :doc:`../development/debugging-buffering` for debugging failed transactions. Eventual Consistency: Accept the Lag ------------------------------------- Why Eventual Is Good Enough ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Strong consistency** would require: .. code-block:: python def create_observation(data): # Write to main DB await main_db.insert(data) # Wait for all replicas to confirm for replica in replicas: await replica.wait_for_replication() # Blocks! # If any replica unreachable: FAIL return success **This doesn't work at observatory** because network to Cologne can be down. **Eventual consistency** means: .. code-block:: python def create_observation(data): # Buffer locally await buffer_transaction(data) # Never fails! # Return immediately return success # Client continues # Background: sync when possible # Eventually all sites will have consistent data **The guarantee**: Data *will* become consistent, we just can't say exactly *when*. How Long is "Eventually"? ~~~~~~~~~~~~~~~~~~~~~~~~~~ **Best case** (good network): * Buffered: < 1ms (Redis write) * Executed on main: 1-5 seconds (background processor) * Replicated to observatory: 1-10 seconds (depending on lag) * **Total**: Seconds **Worst case** (network down): * Buffered: < 1ms (Redis write) * Executed on main: Hours to days (when network restored) * Replicated to observatory: Hours to days + replication time * **Total**: Until network restoration + sync time **Acceptable because**: * Observatory operations don't block (buffering works) * Scientists query data hours/days later (lag doesn't matter) * Real-time monitoring uses WebSockets (immediate local updates) Smart Query Manager: Consistent View ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Even during replication lag, reads are consistent: .. code-block:: python async def get_observations(obs_unit_id): # Query local replica db_records = await db.query( Observation ).filter( Observation.obs_unit_id == obs_unit_id ) # Query buffer buffered_records = await get_buffered_records( "Observation", {"obs_unit_id": obs_unit_id} ) # Query read buffer (updates to buffered records) read_buffer_updates = await get_read_buffer_updates( "Observation", {"obs_unit_id": obs_unit_id} ) # Merge: buffer + updates override DB return merge_records( db_records, buffered_records, read_buffer_updates ) **Result**: Client always sees latest state, even if not yet in database. See :doc:`../deep-dive/transaction-buffering/smart-query-manager` for details. Read Buffer: Mutable Buffered Records -------------------------------------- The Problem ~~~~~~~~~~~ Buffered records sometimes need updates before replication: .. code-block:: python # Create observation (buffered) obs_id = await create_observation({ "status": "running", "start_time": "2025-01-01T00:00:00Z" }) # 10 seconds later: observation finishes # But original record still in buffer (not in DB yet) # How to update it? **Can't modify database** (record not there yet) **Can't modify buffer** (breaks transaction atomicity) The Solution: Read Buffer ~~~~~~~~~~~~~~~~~~~~~~~~~~ Separate buffer for tracking updates to buffered records: .. code-block:: python # Update buffered observation await update_read_buffer( model_class="Observation", record_id=obs_id, updates={ "status": "completed", "end_time": "2025-01-01T00:10:00Z" }, transaction_id=original_tx_id ) **Read buffer tracks**: * Which buffered record to update * What fields changed * When update happened * Which transaction created original record **Smart query manager merges**: 1. Database record (if replicated) 2. Buffered record (if still in buffer) 3. Read buffer updates (latest changes) **Cleanup happens** when LSN confirms replication. See :doc:`../deep-dive/transaction-buffering/read-buffer-manager` for implementation. UI Can Tolerate Staleness -------------------------- Why UI Users Don't Mind ~~~~~~~~~~~~~~~~~~~~~~~~ The web UI (ops-db-ui) queries the API for data visualization: **Typical queries**: * Transfer overview dashboard * Observation history for past week * Source visibility calculations * Raw data package listings **Staleness tolerance**: * **Seconds of lag**: Imperceptible to human users * **Minutes of lag**: Acceptable for most dashboards * **Hours of lag**: Only matters for real-time monitoring **Real-time needs** (WebSockets): For truly real-time data (active transfers, current observations): * WebSockets provide immediate updates * Redis pub/sub broadcasts changes * UI updates instantly regardless of replication **Example**: Transfer monitoring .. code-block:: python # REST API: May be seconds behind GET /api/transfer/overview # WebSocket: Immediate updates WS /api/transfer/ws/overview # Receives: { "transfer_id": 123, "status": "completed" } # Instantly! Health Monitoring ----------------- The API exposes health information about the buffering system: Health Endpoint ~~~~~~~~~~~~~~~ .. code-block:: bash curl http://localhost:8000/health Response: .. code-block:: json { "status": "healthy", "database": "connected", "redis": "connected", "transaction_buffer": { "size": 12, "failed": 0, "oldest_pending_seconds": 5.2 }, "background_processor": { "status": "running", "last_run": "2025-01-01T00:00:00Z", "transactions_processed": 150, "success_rate": 0.98 }, "replication_lag": { "current_lag_seconds": 2.5, "main_lsn": "0/12345678", "replica_lsn": "0/12345600" } } Buffer Statistics ~~~~~~~~~~~~~~~~~ .. code-block:: bash curl http://localhost:8000/buffer-stats Response: .. code-block:: json { "pending_transactions": 12, "failed_transactions": 0, "processing_rate": 10.5, "average_execution_time_ms": 125, "buffer_size_mb": 2.3 } Monitoring Alerts ~~~~~~~~~~~~~~~~~ Set up monitoring for: **High buffer size**: * Alert if buffer > 1000 transactions * Indicates network issue or main DB problem **Failed transactions**: * Alert on any failed transaction * Requires human investigation **Replication lag**: * Alert if lag > 5 minutes * Indicates replica performance issue **Background processor stopped**: * Alert if processor not running * Critical - buffer won't drain Summary ------- Reliability is achieved through: **Transaction Buffering**: * Immediate buffering to Redis * Never blocks client * Automatic async execution * Pre-generated IDs for immediate reference **LSN Tracking**: * Precise replication state knowledge * Smart cache management * No guesswork about consistency **Automatic Retry**: * Exponential backoff * Failed transaction queue * Human oversight for permanent failures **Eventual Consistency**: * Accept lag in exchange for reliability * Smart queries merge buffered + persisted * Read buffer tracks updates to buffered records **Monitoring**: * Health endpoints * Buffer statistics * Alerting for anomalies **The result**: Observatory operations continue regardless of network state, data eventually becomes consistent, and we know precisely when that has happened. Next Steps ---------- * :doc:`../deep-dive/transaction-buffering/overview` - Technical implementation details * :doc:`../development/debugging-buffering` - Troubleshooting reliability issues