Data Model
==========
.. verified:: 2025-11-25
:reviewer: Christof Buchbender
ops-db doesn't store actual data files - it tracks metadata about where files exist.
This is crucial for distributed data management across multiple sites and storage types.
File Hierarchy
--------------
Data files are organized in a hierarchy from individual files to transfer bundles:
.. mermaid::
graph TB
RDF[RawDataFile]
RDP[RawDataPackage]
DTP[DataTransferPackage]
RDF -->|bundled into| RDP
RDP -->|bundled into| DTP
style RDF fill:#e1f5ff
style RDP fill:#fff4e1
style DTP fill:#ffe1f5
RawDataFile
-----------
:py:class:`~ccat_ops_db.models.RawDataFile` represents an individual data file produced
by an instrument module. Uses UUID because files are created by telescope systems that
may be offline. See also :doc:`/ops-db-api/docs/index`.
For complete attribute details, see :py:class:`~ccat_ops_db.models.RawDataFile`.
RawDataPackage
--------------
:py:class:`~ccat_ops_db.models.RawDataPackage` is a bundled collection of related
:py:class:`~ccat_ops_db.models.RawDataFile` objects packaged as a tar archive. The
RawDataPackage groups files per `ExecutedObsUnit` and `InstrumentModule`. Thousands of
small files are inefficient for archiving. Packaging in this ways consolidates them into
manageable units of closely related files that will have to be processed together. The
packaging preserves directory structure and metadata so that when the data is unpacked
the original directory structure is restored.
**State Meanings**:
* **WAITING** (yellow hourglass in UI): Only exists in primary location
* **TRANSFERRING** (blue circle): Part of an active DataTransferPackage
* **ARCHIVED** (green checkmark): Successfully stored in long-term archive
* **FAILED** (red cross): Transfer or archive failed
For complete attribute details, see :py:class:`~ccat_ops_db.models.RawDataPackage`.
RawDataPackageMetadata
----------------------
:py:class:`~ccat_ops_db.models.RawDataPackageMetadata` stores additional metadata for
IVOA-compatible metadata generation. Keeps the core :py:class:`~ccat_ops_db.models.RawDataPackage`
model clean while allowing flexible metadata storage.
For complete attribute details, see :py:class:`~ccat_ops_db.models.RawDataPackageMetadata`.
DataTransferPackage
-------------------
:py:class:`~ccat_ops_db.models.DataTransferPackage` bundles multiple
:py:class:`~ccat_ops_db.models.RawDataPackage` objects for efficient network transfer.
Optimizes network transfer efficiency - many packages → fewer transfer operations. For
long distance transfers, optimal package sizes exist in the range of 10-50TB. One
:py:class:`~ccat_ops_db.models.DataTransferPackage` can have multiple
:py:class:`~ccat_ops_db.models.DataTransfer` records (same bundle to multiple destinations).
For complete attribute details, see :py:class:`~ccat_ops_db.models.DataTransferPackage`.
Physical Copy Tracking
----------------------
The Physical Copy System
^^^^^^^^^^^^^^^^^^^^^^^^
:py:class:`~ccat_ops_db.models.PhysicalCopy` tracks where each file/package physically
exists across all storage locations. Data can exist in multiple places simultaneously
(buffer, archive, staging area), enabling safe deletion and staged unpacking. It is
polymorphic with subclasses:
:py:class:`~ccat_ops_db.models.RawDataFilePhysicalCopy`,
:py:class:`~ccat_ops_db.models.RawDataPackagePhysicalCopy`, and
:py:class:`~ccat_ops_db.models.DataTransferPackagePhysicalCopy`. Each subclass has a
``full_path`` property that constructs the actual filesystem/S3 path.
For complete attribute details, see :py:class:`~ccat_ops_db.models.PhysicalCopy` and its
subclasses.
PhysicalCopyStatus Enum
^^^^^^^^^^^^^^^^^^^^^^^
:py:class:`~ccat_ops_db.models.PhysicalCopyStatus` tracks the lifecycle state of a
physical copy:
.. list-table:: PhysicalCopyStatus Values
:header-rows: 1
:widths: 30 70
* - Status
- Meaning
* - PRESENT
- File exists and is available
* - STAGED
- Package unpacked, original archive removed to save space
* - DELETION_POSSIBLE
- Eligible for cleanup (exists in other locations)
* - DELETION_PENDING
- Scheduled for removal
* - DELETION_SCHEDULED
- Cleanup task queued
* - DELETION_IN_PROGRESS
- Currently being deleted
* - DELETION_FAILED
- Deletion attempt failed
* - DELETED
- Successfully removed
For complete enum details, see :py:class:`~ccat_ops_db.models.PhysicalCopyStatus`.
Physical Copy Relationships
^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. mermaid::
graph TB
RDP[RawDataPackage]
PC1[PhysicalCopy
at Location 1]
PC2[PhysicalCopy
at Location 2]
PC3[PhysicalCopy
at Location 3]
DL1[DataLocation 1
Chile Buffer]
DL2[DataLocation 2
Cologne Archive]
DL3[DataLocation 3
Processing]
RDP -->|has| PC1
RDP -->|has| PC2
RDP -->|has| PC3
PC1 -->|at| DL1
PC2 -->|at| DL2
PC3 -->|at| DL3
style RDP fill:#e1f5ff
style PC1 fill:#fff4e1
style PC2 fill:#fff4e1
style PC3 fill:#fff4e1
**Example**: A :py:class:`~ccat_ops_db.models.RawDataPackage` might have 3 physical copies:
* One PRESENT at Chile buffer
* One PRESENT at Cologne archive
* One STAGED at processing location (unpacked, archive removed)
Status and State Management
---------------------------
Status Enum
^^^^^^^^^^^
:py:class:`~ccat_ops_db.models.Status` is used for operations (transfer, archive, staging):
.. list-table:: Status Values
:header-rows: 1
:widths: 30 70
* - Status
- Meaning
* - PENDING
- Queued but not started
* - SCHEDULED
- Assigned to worker
* - IN_PROGRESS
- Currently executing
* - COMPLETED
- Finished successfully
* - FAILED
- Failed and won't retry
For complete enum details, see :py:class:`~ccat_ops_db.models.Status`.
PackageState Enum
^^^^^^^^^^^^^^^^^
:py:class:`~ccat_ops_db.models.PackageState` is used for data lifecycle:
.. list-table:: PackageState Values
:header-rows: 1
:widths: 30 70
* - State
- Meaning
* - WAITING
- Only in primary location
* - TRANSFERRING
- Being transferred
* - ARCHIVED
- In long-term archive
* - FAILED
- Operation failed
For complete enum details, see :py:class:`~ccat_ops_db.models.PackageState`.
Why This Structure?
-------------------
**Separation of File and Package Levels**
Allows tracking at two granularities:
* File-level: For detailed provenance and access
* Package-level: For efficient transfer and storage management
**Physical Copy Tracking**
Enables:
* Knowing exactly where each copy exists
* Safe deletion (only delete if copies exist elsewhere)
* Staged unpacking (remove archive after extraction to save space)
**Status/State Separation**
Allows:
* Tracking operational progress (status)
* Tracking data lifecycle state (state)
* Retry logic and failure handling
Relationship to Data Transfer
-----------------------------
The :doc:`/data-transfer/docs/index` package orchestrates moving data based on these
records. ops-db just tracks metadata - data-transfer does the actual file operations.
For detailed data flow documentation, see the :doc:`/data-transfer/docs/index`
documentation.
Related Documentation
---------------------
* Complete API reference: :doc:`../api_reference/models`
* Location model: :doc:`location_model`
* Transfer model: :doc:`transfer_model`
* Data transfer workflows: :doc:`/data-transfer/docs/index`