Skip to main content

Data Aging

1. Types of Data

  1. Static / Catalog Data

    • Small tables (e.g., reference data, configuration settings).
    • Typically, have a “valid-from” / “valid-to” column to capture effective dates but are not subject to archiving or frequent purging.
    • Volume is usually small and stable.
  2. Audit / Logs

    • Time-based data (e.g., process history, system logs, journaling).
    • Potentially large volume.
    • Often can be moved outside the primary transactional database (for instance, to Elasticsearch or a data lake).
    • Retention Policies: Deleted after a time period, or must remain available as long as the “owning” entity is active. Once that entity is fully archived, the audit data may be removed.
  3. Time-Based Entities

    • Examples: Orders, Tickets, Service Requests.
    • Have an expected lifecycle: created → processed → (possibly long-running) → closed/completed → archived.
    • Primarily partitioned by creation date or expected “end-of-life” date.
  4. Master Data Entities

    • Examples: Customer, Inventory Items, Products.
    • May remain in the system for a very long time or indefinitely.
    • Often do not have a predictable “end-of-life,” so they may not be partitioned by date.
    • For especially large master data sets, consider hash-based partitioning or range partitioning by ID.
  5. Child Entities

    • Examples: OrderLine (child of Order), TicketRelatedEntity (child of Ticket).
    • Contain a foreign key to their parent entity and inherit the lifecycle of the parent.
    • If the parent moves to HOT, WARM, or ARCHIVED, the child must move in sync.
    • When the parent entity is queried, the system looks up the child entities in the same partition or lifecycle status.

2. Lifecycle Status

We distinguish four main data lifecycle statuses. These statuses determine both how data is stored (e.g., which partitions) and how the application handles them (e.g., default queries):

  1. LIVE

    • Currently active data.
    • Requires fast read/write access.
    • High transaction rate (updates and inserts).
  2. HOT

    • Still active, but less frequently updated (e.g., a ticket that remains open with few updates).
    • Access may be slightly slower than LIVE, but still needs relatively high performance.
  3. WARM

    • Historical data that is rarely accessed.
    • Mostly read-only operations.
    • Slower/cheaper storage can be used.
    • Records have a REMOVAL_TIME attribute, which tells the archiving system when the record should move to ARCHIVED state.
  4. ARCHIVED

    • Historical data retained mostly for compliance or very infrequent lookup.
    • Normally “detached” from the main database, or kept in a form not immediately searchable.
    • If needed, data can be reattached or restored back into an accessible partition (WARM) temporarily.
    • By default, not included in primary (public) queries.

Lifecycle Transitions

  1. LIVE → HOT

    • Occurs when a record remains active beyond an expected threshold or is flagged as “low-update frequency.”
    • The partition or status label changes from something like LIVE_202401 to HOT_202401Q1 (for example).
  2. HOT → WARM

    • When an entity’s business process completes, and the data is no longer expected to be updated.
    • A REMOVAL_TIME is set.
  3. WARM → ARCHIVED

    • Reached at or after the specified REMOVAL_TIME.
    • The data is moved (or flagged) to ARCHIVED, possibly detaching its partition.
  4. ARCHIVED → WARM (Reattachment)

    • Data is attached again for reading or maintenance.
    • Usually a temporary operation.

3. PostgreSQL Partitioning

3.1 Partitioning by Time

Recommended for time-based data (e.g., Orders, Tickets, or other entities that have creation dates and predictable aging).

  • Partition Granularity:

    • Can be hourly, daily, weekly, monthly, quarterly, or yearly.
    • Depends on data volume and acceptable partition sizes (commonly ~5–10 GB each).
  • Naming Convention Examples:

    • Hourly: LIVE_2024010105
    • Daily: LIVE_20240101
    • Weekly: LIVE_202401W1
    • Monthly: LIVE_202401
    • Quarterly: LIVE_2024Q1
    • Yearly: LIVE_2024, HOT_2024, WARM_2024

Data Movement:

  • When time-based entities cross a threshold (e.g., older than X months) but are still active, move them to a HOT_<period> partition.
  • Once an entity is finalized, a REMOVAL_TIME is assigned and the status changes to WARM. Data moves to WARM_<year>, WARM_<month>, etc.
  • After the REMOVAL_TIME passes, the status becomes ARCHIVED_<yearmonth> and the partition can be detached to reduce overhead.

3.2 Partitioning by ID

Recommended for large master data sets with no clear time-based lifecycle.

  • Hash Partitioning
    • Partition by a hash of the primary key.
    • Spreads inserts/updates evenly among partitions.

3.3 Tips for PostgreSQL Partitioning

  1. Keep Partitions Manageable in Size

    • Aim for partitions that can be quickly maintained, detached, or reattached (~10 GB each, if possible).
    • WARM partitions can be larger, as they are mostly read-only.
  2. Indexing

    • Each partition has its own local indexes.
    • PostgreSQL does not support truly global indexes across all partitions.
    • If you have many partitions, be mindful that too many can degrade performance. Keep the number of active partitions manageable.
  3. Maintenance

    • Set up autovacuum parameters according to DB size and the number of partitions.
    • Monitor partition utilization.
    • WARM and detached (ARCHIVED) partitions typically require minimal maintenance.
  4. Query Performance

    • Queries should include the partition key in the WHERE clause (time range or ID range) for partition pruning.
    • This helps PostgreSQL skip scanning irrelevant partitions.

4. Elasticsearch Index Tiers

For data needing full-text search or flexible queries, Elasticsearch is often used. It supports a tiered structure for hot/warm/cold/frozen data:

  1. Content Tier (LIVE)

    • Default for active data.
    • Fast SSD storage.
    • High-performance indexing/search.
  2. Hot Tier (HOT)

    • Time-series data, still frequently queried or updated.
    • Also on SSD.
  3. Warm Tier (WARM)

    • Historical, infrequent access.
    • Slower or cheaper storage.
  4. Cold / Frozen Tiers

    • Rarely accessed data.
    • Often stored in snapshot repositories (e.g., object storage).
    • Reattached automatically on query, if needed.

Index Lifecycle Management (ILM)

  • Allows automation of index movement from hot → warm → cold/frozen based on age or usage patterns.

5. Queries and Application Behavior

5.1 Default Queries

  • By default, public APIs consult LIVE, HOT, and WARM data only for faster performance.
  • ARCHIVED data is excluded from normal searches.

5.2 Searching Historical Data

  • To include historical data, a client or service must explicitly request it.
  • Partition Reattachment is needed to access ARCHIVED_* partitions. Currently, this reattachment is done manually (e.g., an administrative operation). In the future, this may be automated.
  • In Elasticsearch, one may query multiple index patterns or use aliases (e.g., myapp-logs-*, including older indices).

5.3 Using X-Data-Partition Header

  • Microservices in the tSM ecosystem and Kafka events can utilize an X-Data-Partition HTTP header to optimize queries.
  • This header can contain a pattern like LIVE*, HOT*, or multiple values such as LIVE*,HOT*.
    • Example: a service wanting to query only active processes might set X-Data-Partition: LIVE*,HOT*.
  • When querying child entities for a parent entity, the exact partition is provided. For instance, if an Order is in LIVE_202401, its OrderLine records must be looked up in LIVE_202401 as well.

5.3.1 Multiple partitioned tables endpoint

  • When calling endpoint that query multiple partitioned tables, you may use the following HTTP header syntax, especially, when queried entities use different partitioning schema or belongs to different partitions.
  • You may append the entity name to the partitioning header name. Since the HTTP header names are not case-sensitive (RFC 7230), there is no need to take care about case sensitivity.
  • Multiple HTTP headers may be sent.
  • For example, X-Data-Partition-entity1: LIVE, X-Data-Partition-entity2: HOT will apply partition queries
    • "entity1 in LIVE"
    • "entity2 in HOT"
  • The "entity name" MUST be set correctly on entity partition filter (FilterDef.name) - you must follow the endpoint documentation.

5.4 Restoring Archived Data

  • If an archived entity must be updated (e.g., a late correction on a closed order):
    1. Reattach or reindex the relevant ARCHIVED_* partition.
    2. Move the entity into a writable state (WARM).
    3. Make the update.
    4. Potentially archive it again afterward.

6. Practical Examples

Example A: Time-Based Ticketing System

This example shows how to manage ticket data over time using a combination of LIVE, HOT, WARM, and ARCHIVE partitions, along with sample ticket records.


Partition Strategy

  • LIVE

    • Two partitions for the current month and the previous month.
    • Example partition names:
      • LIVE_202501 (current month)
      • LIVE_202412 (previous month)
  • HOT

    • A single partition for all tickets that are active but updated less frequently.
    • Example partition name:
      • HOT
  • WARM

    • One partition per year, retaining up to 10 years back.
    • Example partition names:
      • WARM_2022
      • WARM_2023
      • WARM_2024
  • ARCHIVE

    • One partition per month for archived data, retaining up to 5 additional months.
    • Example partition names:
      • ARCHIVE_202310
      • ARCHIVE_202311
      • ARCHIVE_202312
      • ARCHIVE_202401
      • ARCHIVE_202402

Note: Only tickets in LIVE or HOT status are considered to have an active process running. Once a ticket moves to WARM or ARCHIVE, its process is considered finished or closed.


Sample Tickets Across Lifecycle Stages

Below are examples of how ticket records might appear in each partition phase. Attributes shown include:

  • ID (UUID)
  • KEY (human-readable identifier, e.g., INC-23)
  • NAME (title, e.g., Server outage)
  • LIFECYCLE_STATUS (e.g., LIVE, HOT, WARM, ARCHIVE)
  • REMOVAL_TIME (if assigned, indicates when the ticket should move to ARCHIVE)
  1. LIVE (Current Month)

    ID:               879b6b06-1afa-4ff5-bfcb-c3d630f7b7af
    KEY: INC-23
    NAME: Server outage
    LIFECYCLE_STATUS: LIVE_202501
    REMOVAL_TIME: (not set)
    • This ticket is actively updated and queried in real time.
  2. LIVE (Previous Month)

    ID:               454f82c2-c920-415f-aa31-62f5f996a02f
    KEY: INC-24
    NAME: Database performance issue
    LIFECYCLE_STATUS: LIVE_202412
    REMOVAL_TIME: (not set)
    • Still in LIVE, but created in the prior month’s partition.
  3. HOT

    ID:               2e365532-7541-41b7-b8d3-2a840081c6af
    KEY: INC-25
    NAME: Network upgrade
    LIFECYCLE_STATUS: HOT_2025
    REMOVAL_TIME: (not set)
    • This ticket is still active, but updates are infrequent.
    • Remains accessible with relatively high performance.
    • Active process continues to run because it’s HOT.
  4. WARM

    ID:               be87b77d-5366-4b7f-bd20-ce3cf53df43f
    KEY: INC-26
    NAME: Hardware replacement
    LIFECYCLE_STATUS: WARM_2024
    REMOVAL_TIME: 2025-09-01
    • Ticket is considered closed.
    • Rarely accessed (read-only).
    • After REMOVAL_TIME, this ticket will move to ARCHIVE.
  5. ARCHIVE

    ID:               be87b77d-5366-4b7f-bd20-ce3cf53df43f
    KEY: INC-26
    NAME: Hardware replacement
    LIFECYCLE_STATUS: ARCHIVE_202509
    REMOVAL_TIME: 2025-09-01
    • Ticket is archived and partition is typically detached.
    • Not included in default queries.
    • If someone needs to view or update it (extremely rare), an admin would reattach this partition to make it accessible.

Key Point:
Only LIVE and HOT tickets are considered to have “active” processes running. Once a ticket transitions to WARM or ARCHIVE, its process is deemed finished or closed, and the data is no longer part of standard real-time queries.

Example B: Master Data (Customers)

  • Partition by ID Hash.
  • Lifecycle status is typically LIVE (unless a business event triggers archiving).
  • If archiving occurs:
    • The record is flagged as INACTIVE or ARCHIVED.
    • It moves or replicates to a dedicated partition.

Example C: Child Entities (Order and OrderLine)

  • Order is LIVE_202405.
  • All associated OrderLine records are also in LIVE_202405.
  • When Order transitions to HOT_202407, every OrderLine transitions to HOT_202407 as well.

Example D: Audit Logs

  • Stored primarily in Elasticsearch with ILM:
    • Hot for 7 days
    • Warm for 30 days
    • Cold for 180 days
    • Frozen or deleted afterward
  • If compliance requires, logs can also be mirrored in PostgreSQL partitions and detached monthly or quarterly.

7. Additional Tips and Considerations

  1. Simplicity vs. Granularity

    • More partitions → finer granularity for archiving/reattaching but more overhead.
    • Fewer partitions → easier administration but less flexibility.
  2. Business Requirements

    • Align lifecycle statuses, retention periods, and partitioning with legal/compliance mandates.
    • Avoid deleting or archiving data prematurely.
  3. Automation

    • Automate partition creation, detachment, and lifecycle transitions.
    • In Elasticsearch, rely on ILM policies.
    • In PostgreSQL, consider scheduling maintenance jobs.
  4. Monitoring

    • Track disk usage, partition sizes, and query performance.
    • Adjust partitioning or rebalancing as data grows/shrinks.
  5. Backup and Recovery

    • Ensure backups cover all partitions (including detached ones).
    • Test restoring from archives or snapshots periodically.

Summary

This document outlines a data lifecycle approach that categorizes data into LIVE, HOT, WARM, and ARCHIVED statuses. By combining a well-defined lifecycle model with carefully chosen PostgreSQL partitioning strategies (time-based or ID-based) and Elasticsearch ILM (hot/warm/cold/frozen tiers), organizations can:

  • Efficiently manage large volumes of active and historical data.
  • Meet compliance and retention requirements.
  • Maintain good query performance for day-to-day operations.
  • Still retain access to historical (archived) data when truly needed.