Reliability First
=================
.. verified:: 2025-11-12
:reviewer: Christof Buchbender
The ops-db-api is designed with one overriding principle: **observatory operations must never fail due to network issues**. This document explains how we achieve this through transaction buffering, LSN tracking, and eventual consistency.
.. contents:: Table of Contents
:local:
:depth: 2
The Core Requirement
--------------------
During telescope observations at 5600m altitude in Chile:
**What must work - Operations**:
* Recording observation start/end times
* Logging instrument configurations
* Registering raw data files
* Tracking observation status changes
**What can fail**:
* Network connection to main database (Cologne)
* Replication lag to local replica
* Remote API calls for auxiliary services
**The guarantee**: Observatory operations **never** block or fail due to network issues.
Transaction Buffering: The Safety Net
--------------------------------------
How It Works
~~~~~~~~~~~~
When operating at a secondary site (observatory):
1. **Write request arrives** at critical endpoint (e.g., create observation)
2. **Transaction is built** with pre-generated IDs
3. **Buffer to Redis** immediately (sub-millisecond)
4. **Return success** to client with pre-generated ID
5. **Background processor** executes transaction on main DB asynchronously
6. **LSN tracking** confirms when data reaches replica
7. **Cache cleanup** occurs after replication confirmed
.. mermaid::
sequenceDiagram
participant Client
participant API
participant Redis
participant BG as Background
Processor
participant MainDB as Main DB
(Cologne)
participant Replica as Local
Replica
Client->>API: POST /executed_obs_units/start
Note over API: Build transaction
Pre-generate UUID
API->>Redis: LPUSH to buffer (instant)
Redis-->>API: OK
API-->>Client: 201 Created (UUID: abc123)
Note over Client: Client continues
operation immediately
loop Background Processing
BG->>Redis: RPOP from buffer
BG->>MainDB: Execute transaction
MainDB-->>BG: LSN: 0/12345678
BG->>Replica: Poll replay LSN
alt Replicated
Replica-->>BG: LSN: 0/12345678 (caught up)
BG->>Redis: Cleanup cache
else Not Yet
Replica-->>BG: LSN: 0/12345600 (behind)
BG->>Redis: Extend cache TTL
end
end
Benefits
~~~~~~~~
**Immediate Response**:
* Client never waits for remote database
* Sub-millisecond latency for buffered operations
* No difference in experience during network issues
**Guaranteed Execution**:
* Redis persistence (AOF) ensures durability
* Automatic retry with exponential backoff
* Failed transactions move to dead-letter queue for investigation
**Operational Continuity**:
* Telescope observations never pause
* Operators get immediate feedback
* Data accumulates in buffer until network restored
Pre-Generated IDs
~~~~~~~~~~~~~~~~~
A critical feature: IDs are generated **before** database insertion:
.. code-block:: python
# Traditional approach (doesn't work for buffering)
def create_observation(data):
obs = Observation(**data)
db.add(obs)
db.commit()
return obs.id # ID only available after DB commit!
# Our approach (works with buffering)
def create_observation(data):
obs_id = uuid.uuid4() # Generate immediately
transaction = builder.create(
model_class=Observation,
data={"id": obs_id, **data}
)
buffer_transaction(transaction)
return obs_id # Available immediately!
**Why this matters**:
* Client can reference the observation immediately
* Can create related records before DB commit (e.g., files linked to observation)
* Status queries work immediately (check buffer + DB)
See :doc:`../deep-dive/transaction-buffering/transaction-builder` for implementation.
LSN Tracking: Precision Over Guesswork
---------------------------------------
The Problem with Guessing
~~~~~~~~~~~~~~~~~~~~~~~~~~
**Without LSN tracking**, we'd have to guess when data has replicated:
.. code-block:: python
# Naive approach (unreliable)
def wait_for_replication(record_id):
await asyncio.sleep(5) # Hope it's replicated by now?
return check_replica(record_id) # Might not be there yet!
**Problems**:
* Hard-coded delays waste time (too long) or miss replication (too short)
* No way to know actual replication state
* Can't adjust to varying network conditions
* Cache management is guesswork
LSN: The Solution
~~~~~~~~~~~~~~~~~
**PostgreSQL Log Sequence Numbers (LSN)** provide precision:
**After write to main database**:
.. code-block:: sql
-- Get LSN after transaction commit
SELECT pg_current_wal_lsn();
-- Returns: '0/12345678'
**Check replica progress**:
.. code-block:: sql
-- Get LSN that replica has replayed
SELECT pg_last_wal_replay_lsn();
-- Returns: '0/12345600' (slightly behind)
**Compare**:
.. code-block:: python
main_lsn = "0/12345678"
replica_lsn = "0/12345600"
if replica_lsn >= main_lsn:
print("Replica has our data!")
else:
bytes_behind = parse_lsn(main_lsn) - parse_lsn(replica_lsn)
print(f"Replica is {bytes_behind} bytes behind")
How We Use LSN
~~~~~~~~~~~~~~
**1. Track Transaction LSN**:
When background processor executes a buffered transaction:
.. code-block:: python
# Execute transaction
await executor.execute_transaction(transaction, session)
await session.commit()
# Capture LSN
result = await session.execute("SELECT pg_current_wal_lsn()")
lsn = result.scalar()
# Store with transaction
await redis.setex(
f"transaction:{tx_id}:lsn",
ttl=3600,
value=lsn
)
**2. Poll Replica Status**:
Periodically check if replica has caught up:
.. code-block:: python
async def check_replication(transaction_lsn):
# Query replica
result = await replica_session.execute(
"SELECT pg_last_wal_replay_lsn()"
)
replica_lsn = result.scalar()
# Compare
if replica_lsn >= transaction_lsn:
return True # Replicated!
else:
return False # Still waiting
**3. Smart Cache Management**:
Adjust cache TTL based on replication state:
.. code-block:: python
if replicated:
# Data is in replica, can cleanup buffer
await cleanup_transaction_cache(tx_id)
else:
# Still replicating, keep cache longer
await extend_cache_ttl(tx_id, additional_seconds=60)
**4. Timeout Handling**:
If replication takes too long:
.. code-block:: python
if time_waiting > LSN_TIMEOUT:
# Replication very slow or stuck
logger.warning(
f"LSN {transaction_lsn} not replicated after {LSN_TIMEOUT}s"
)
# Keep cache indefinitely (or until manual cleanup)
await set_cache_ttl(tx_id, ttl=None)
See :doc:`../deep-dive/transaction-buffering/lsn-tracking` for full implementation.
Automatic Retry with Exponential Backoff
-----------------------------------------
When transaction execution fails, we retry intelligently:
Retry Configuration
~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
# .env
TRANSACTION_RETRY_ATTEMPTS=3
TRANSACTION_RETRY_DELAY=5 # Initial delay in seconds
Retry Strategy
~~~~~~~~~~~~~~
.. code-block:: python
async def execute_with_retry(transaction):
attempt = 0
delay = TRANSACTION_RETRY_DELAY
while attempt < TRANSACTION_RETRY_ATTEMPTS:
try:
result = await execute_transaction(transaction)
return result
except NetworkError as e:
attempt += 1
if attempt >= TRANSACTION_RETRY_ATTEMPTS:
# Move to failed queue
await move_to_failed_queue(transaction)
raise
# Exponential backoff
logger.warning(
f"Transaction failed (attempt {attempt}), "
f"retrying in {delay}s: {e}"
)
await asyncio.sleep(delay)
delay *= 2 # Exponential backoff
**Why exponential backoff**:
* Transient network issues often resolve quickly (first retry)
* Persistent issues need longer wait (later retries)
* Prevents overwhelming failed service with retry storm
Failed Transaction Handling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If all retries fail:
.. code-block:: python
# Move to dead-letter queue
await redis.lpush(
f"site:{site_name}:failed_transactions",
json.dumps(transaction)
)
# Log for investigation
logger.error(
f"Transaction {tx_id} failed after {attempts} attempts: {error}"
)
# Alert monitoring system
await alert_monitoring(
severity="error",
message=f"Transaction {tx_id} failed permanently"
)
**Human intervention needed**:
* Investigate root cause (network? database? bug?)
* Manually replay failed transaction
* Or mark as resolved if no longer needed
See :doc:`../development/debugging-buffering` for debugging failed transactions.
Eventual Consistency: Accept the Lag
-------------------------------------
Why Eventual Is Good Enough
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Strong consistency** would require:
.. code-block:: python
def create_observation(data):
# Write to main DB
await main_db.insert(data)
# Wait for all replicas to confirm
for replica in replicas:
await replica.wait_for_replication() # Blocks!
# If any replica unreachable: FAIL
return success
**This doesn't work at observatory** because network to Cologne can be down.
**Eventual consistency** means:
.. code-block:: python
def create_observation(data):
# Buffer locally
await buffer_transaction(data) # Never fails!
# Return immediately
return success # Client continues
# Background: sync when possible
# Eventually all sites will have consistent data
**The guarantee**: Data *will* become consistent, we just can't say exactly *when*.
How Long is "Eventually"?
~~~~~~~~~~~~~~~~~~~~~~~~~~
**Best case** (good network):
* Buffered: < 1ms (Redis write)
* Executed on main: 1-5 seconds (background processor)
* Replicated to observatory: 1-10 seconds (depending on lag)
* **Total**: Seconds
**Worst case** (network down):
* Buffered: < 1ms (Redis write)
* Executed on main: Hours to days (when network restored)
* Replicated to observatory: Hours to days + replication time
* **Total**: Until network restoration + sync time
**Acceptable because**:
* Observatory operations don't block (buffering works)
* Scientists query data hours/days later (lag doesn't matter)
* Real-time monitoring uses WebSockets (immediate local updates)
Smart Query Manager: Consistent View
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Even during replication lag, reads are consistent:
.. code-block:: python
async def get_observations(obs_unit_id):
# Query local replica
db_records = await db.query(
Observation
).filter(
Observation.obs_unit_id == obs_unit_id
)
# Query buffer
buffered_records = await get_buffered_records(
"Observation",
{"obs_unit_id": obs_unit_id}
)
# Query read buffer (updates to buffered records)
read_buffer_updates = await get_read_buffer_updates(
"Observation",
{"obs_unit_id": obs_unit_id}
)
# Merge: buffer + updates override DB
return merge_records(
db_records,
buffered_records,
read_buffer_updates
)
**Result**: Client always sees latest state, even if not yet in database.
See :doc:`../deep-dive/transaction-buffering/smart-query-manager` for details.
Read Buffer: Mutable Buffered Records
--------------------------------------
The Problem
~~~~~~~~~~~
Buffered records sometimes need updates before replication:
.. code-block:: python
# Create observation (buffered)
obs_id = await create_observation({
"status": "running",
"start_time": "2025-01-01T00:00:00Z"
})
# 10 seconds later: observation finishes
# But original record still in buffer (not in DB yet)
# How to update it?
**Can't modify database** (record not there yet)
**Can't modify buffer** (breaks transaction atomicity)
The Solution: Read Buffer
~~~~~~~~~~~~~~~~~~~~~~~~~~
Separate buffer for tracking updates to buffered records:
.. code-block:: python
# Update buffered observation
await update_read_buffer(
model_class="Observation",
record_id=obs_id,
updates={
"status": "completed",
"end_time": "2025-01-01T00:10:00Z"
},
transaction_id=original_tx_id
)
**Read buffer tracks**:
* Which buffered record to update
* What fields changed
* When update happened
* Which transaction created original record
**Smart query manager merges**:
1. Database record (if replicated)
2. Buffered record (if still in buffer)
3. Read buffer updates (latest changes)
**Cleanup happens** when LSN confirms replication.
See :doc:`../deep-dive/transaction-buffering/read-buffer-manager` for implementation.
UI Can Tolerate Staleness
--------------------------
Why UI Users Don't Mind
~~~~~~~~~~~~~~~~~~~~~~~~
The web UI (ops-db-ui) queries the API for data visualization:
**Typical queries**:
* Transfer overview dashboard
* Observation history for past week
* Source visibility calculations
* Raw data package listings
**Staleness tolerance**:
* **Seconds of lag**: Imperceptible to human users
* **Minutes of lag**: Acceptable for most dashboards
* **Hours of lag**: Only matters for real-time monitoring
**Real-time needs** (WebSockets):
For truly real-time data (active transfers, current observations):
* WebSockets provide immediate updates
* Redis pub/sub broadcasts changes
* UI updates instantly regardless of replication
**Example**: Transfer monitoring
.. code-block:: python
# REST API: May be seconds behind
GET /api/transfer/overview
# WebSocket: Immediate updates
WS /api/transfer/ws/overview
# Receives: { "transfer_id": 123, "status": "completed" }
# Instantly!
Health Monitoring
-----------------
The API exposes health information about the buffering system:
Health Endpoint
~~~~~~~~~~~~~~~
.. code-block:: bash
curl http://localhost:8000/health
Response:
.. code-block:: json
{
"status": "healthy",
"database": "connected",
"redis": "connected",
"transaction_buffer": {
"size": 12,
"failed": 0,
"oldest_pending_seconds": 5.2
},
"background_processor": {
"status": "running",
"last_run": "2025-01-01T00:00:00Z",
"transactions_processed": 150,
"success_rate": 0.98
},
"replication_lag": {
"current_lag_seconds": 2.5,
"main_lsn": "0/12345678",
"replica_lsn": "0/12345600"
}
}
Buffer Statistics
~~~~~~~~~~~~~~~~~
.. code-block:: bash
curl http://localhost:8000/buffer-stats
Response:
.. code-block:: json
{
"pending_transactions": 12,
"failed_transactions": 0,
"processing_rate": 10.5,
"average_execution_time_ms": 125,
"buffer_size_mb": 2.3
}
Monitoring Alerts
~~~~~~~~~~~~~~~~~
Set up monitoring for:
**High buffer size**:
* Alert if buffer > 1000 transactions
* Indicates network issue or main DB problem
**Failed transactions**:
* Alert on any failed transaction
* Requires human investigation
**Replication lag**:
* Alert if lag > 5 minutes
* Indicates replica performance issue
**Background processor stopped**:
* Alert if processor not running
* Critical - buffer won't drain
Summary
-------
Reliability is achieved through:
**Transaction Buffering**:
* Immediate buffering to Redis
* Never blocks client
* Automatic async execution
* Pre-generated IDs for immediate reference
**LSN Tracking**:
* Precise replication state knowledge
* Smart cache management
* No guesswork about consistency
**Automatic Retry**:
* Exponential backoff
* Failed transaction queue
* Human oversight for permanent failures
**Eventual Consistency**:
* Accept lag in exchange for reliability
* Smart queries merge buffered + persisted
* Read buffer tracks updates to buffered records
**Monitoring**:
* Health endpoints
* Buffer statistics
* Alerting for anomalies
**The result**: Observatory operations continue regardless of network state, data eventually becomes consistent, and we know precisely when that has happened.
Next Steps
----------
* :doc:`../deep-dive/transaction-buffering/overview` - Technical implementation details
* :doc:`../development/debugging-buffering` - Troubleshooting reliability issues