tSM Management Service

The tSM Management Service is a core microservice responsible for overseeing the overall health of the tSM platform and managing dynamic scaling. It monitors the health of all tSM microservices and external systems, gathering both technical and business metrics for real-time decision-making and scaling.

Core Features

Health Monitoring: Tracks the health and readiness of each microservice through standardized endpoints.
Dynamic Scaling: Adjusts service processing based on system load and external dependencies.
Business Metrics: Gathers and processes important business-related metrics to assess platform performance.
Manual Overrides: Allows manual control over default system behaviors, such as pausing processing or increasing parallel consumers.

Health Monitoring

The tSM Management Service periodically checks the health of all registered microservices and related external systems. It monitors:

Whether a tSM service is alive and able to process requests.
The health of external systems (e.g., databases, Kafka).
The number of unprocessed requests in a queue (Kafka consumer lag, message backlog).
The status of system and business metrics (e.g., number of active processes).

All this information is stored in Redis for fast access (alternatively, it can be stored in a database).

Dynamic Scaling and Service Suspension

Based on the gathered metrics and health checks, the tSM Management Service dynamically adjusts service behavior:

Dynamic Scaling: The number of consumers for services like Kafka can be scaled up or down based on system load.
Service Suspension: If an external downstream system is unavailable or overloaded, the tSM Management Service can pause the processing of requests (e.g., stop processing messages in a Kafka queue if the downstream system is down).

These scaling and suspension rules are defined declaratively in the tSM configuration, but custom logic can be added based on business requirements.

Manual Overrides via tSM Management Service

In addition to automated scaling, the tSM Management Service allows operators to manually override system behaviors. Examples of overrides include:

Manually Pause Processing: Temporarily stop processing in specific services or queues.
Increase Parallel Consumers: Manually increase the number of Kafka consumers to speed up processing under high load.

These overrides can be useful during maintenance windows or in response to urgent business needs.

Microservice State Management

Initially, microservices were configured via a static "known-services" list, and their state was monitored via standard actuator endpoints. However, the new approach allows microservices to dynamically register themselves with the tSM Management Service upon startup.

Self-Registration of Microservices

Upon startup, each microservice registers itself with the tSM Management Service by calling a REST endpoint and providing the following information:

Microservice name and description.
URL at which it is running (this can be a challenge in Kubernetes, so we need to explore options here).
Health and metrics endpoints.

The service keeps retrying the registration process until it succeeds (e.g., in case the entire cluster is restarting, and the tSM Management Service is not available yet).

Service State in Redis

The state of each microservice is stored in Redis (or alternatively in a database). The following information is tracked:

Microservice State:
- Overall status (alive, ready).
- Kafka queues (e.g., consumer lag, queue status).
Instance State:
- Instance-specific status (e.g., up, down).
- Per-instance queue metrics (e.g., message backlog, processing status).

Monitoring API

The tSM Management Service exposes an API that provides real-time status information for monitoring and management purposes. This data is also visualized in a UI for easier monitoring by operations teams.

Business Metrics Collection

The tSM Management Service is responsible for gathering and processing key business metrics that provide insights into the platform's operational performance.

Business Metric Definition

Each business metric is defined by the following properties:

id: A unique identifier for the metric record.
eventType: The type of event, e.g., "Order.Ackn" for the time from order creation to acknowledgment of processing.
timestamp: The exact time the event occurred (in milliseconds).
duration: The time taken for processing in milliseconds.
metrics: Additional time-based metrics relevant to the event. These typically include a breakdown of processing times, e.g., {"orderCreation": 50, "processStart": 154, "validation": 46}.
ownerId / ownerType: Identifies the "owner" of the metric, e.g., "Order" and its UUID.
correlationId: Identifies the transaction for correlation purposes (if relevant in the context).
userId: Identifies the user associated with the event (if relevant).

Example

A business metric might capture the time it takes to process an order from its creation to completion. The metric includes both the total time and a breakdown of how long each step took (e.g., order creation, validation, process start).

Conclusion

The tSM Management Service provides critical infrastructure for monitoring, scaling, and managing the tSM platform. It dynamically adjusts processing based on system and business metrics, allowing the platform to scale efficiently and respond to failures in external systems. The service also facilitates the manual control of system behaviors and ensures that the platform's operational and business metrics are continuously collected and analyzed.

For more details on how to configure and use the tSM Management Service, refer to the tSM Configuration Guide.