Process and mailing monitoring via Prometheus

Process and mailing monitoring via Prometheus is used to control performance, processing stability, and message queue status.

Collecting metrics allows you to:

track mailing processing speed and individual lead processing stages;
identify delays in scenario execution and data processing;
monitor message publishing status in RabbitMQ;
detect queue growth, re-sends, and lost events;
analyze load on campaign, procworkflow, and proctrigger processes;
build dashboards and configure alerts in Prometheus and Grafana;
find bottlenecks during performance degradation or mailing processing errors.

Which processes support metrics

Currently, metrics are supported by the campaign, procworkflow, and proctrigger processes. Mailing metrics can be delivered in two ways:

via Pushgateway when campaign runs separately;
via the pull model inside procworkflow and proctrigger, if campaigns are executed within these processes.

Monitoring type	Processes	Collection method
Pull	`procworkflow`, `proctrigger`	Prometheus scrapes `/metrics`
Push	`campaign`	Metrics are sent to Pushgateway

Configuring pull metrics

The pull model is used to collect metrics from the procworkflow and proctrigger processes. In this mode, Prometheus periodically queries the HTTP endpoint /metrics exposed by the corresponding process.

Metrics from procworkflow and proctrigger also include mailing metrics if mailings are executed within these processes.

Pull metrics configuration example

Example platform configuration for enabling pull metrics for both procworkflow and proctrigger simultaneously:

{
  "PROMETHEUS_METRICS": {
    "ENABLE": true,
    "PROCESSES": [
      "procworkflow",
      "proctrigger"
    ]
  },
  "WF_METRIC_HOST": "0.0.0.0",
  "WF_METRIC_PORT": 8911,
  "PROC_TRIGGER_METRIC_HOST": "0.0.0.0",
  "PROC_TRIGGER_METRIC_PORT": 8912
}

If the PROCESSES array is empty ([]), metrics are automatically enabled for all supported processes.

Add metric scrape jobs to the Prometheus configuration:

scrape_configs:
  - job_name: 'procworkflow'
    metrics_path: /metrics
    static_configs:
      - targets:
          - '10.200.5.25:8911'

  - job_name: 'proctrigger'
    metrics_path: /metrics
    static_configs:
      - targets:
          - '10.200.5.25:8912'

After starting the processes, verify that the metric services are listening on the specified ports:

netstat -tlpn | grep 8911
netstat -tlpn | grep 8912

Example output:

tcp6       0      0 :::8911                 :::*                    LISTEN
tcp6       0      0 :::8912                 :::*                    LISTEN

Check the availability of the /metrics endpoint:

curl http://10.200.5.25:8911/metrics
curl http://10.200.5.25:8912/metrics

If the services are configured correctly, the endpoint will return a list of Prometheus metrics for the procworkflow and proctrigger processes.

Pull metric parameters

Parameter	Description
`PROMETHEUS_METRICS.ENABLE`	Global enable of Prometheus metrics
`PROMETHEUS_METRICS.PROCESSES`	List of processes for which metric collection is enabled
`WF_METRIC_HOST`	Address on which `procworkflow` publishes metrics
`WF_METRIC_PORT`	Port of the `procworkflow` metrics service
`PROC_TRIGGER_METRIC_HOST`	Address on which `proctrigger` publishes metrics
`PROC_TRIGGER_METRIC_PORT`	Port of the `proctrigger` metrics service

When configuring pull metrics, consider the Prometheus location relative to the platform server.

Value	When to use
`127.0.0.1`	Prometheus is installed on the same server as the platform
`0.0.0.0`	Prometheus is installed on a separate server

The WF_METRIC_HOST and PROC_TRIGGER_METRIC_HOST parameters define the internal address on which the processes will accept requests to the /metrics endpoint.

The WF_METRIC_PORT and PROC_TRIGGER_METRIC_PORT parameters set the metric service ports. You can use ports in the range 1024 to 9999, excluding ports occupied by other services.

caution

If Prometheus is located on a separate server, specify 0.0.0.0 in the WF_METRIC_HOST and PROC_TRIGGER_METRIC_HOST parameters.

Do not use the same port for WF_METRIC_PORT and PROC_TRIGGER_METRIC_PORT. Each process must serve metrics on a separate port.

If the metrics service does not start after changing WF_METRIC_HOST, WF_METRIC_PORT, PROC_TRIGGER_METRIC_HOST, or PROC_TRIGGER_METRIC_PORT, verify that the specified port is free and available on the server.

Configuring push metrics for `campaign`

The push model is used to send metrics from the campaign process to the Prometheus Pushgateway.

In this mode, the campaign process independently sends metrics to the Pushgateway, after which Prometheus scrapes them from the gateway server.

info

The push model is supported only for the campaign process. Metrics will not appear in Pushgateway until the mailing has been started at least once.

Before configuration, you must deploy and start the Prometheus Pushgateway.

The platform does not start Pushgateway automatically. In the ADDRESS parameter, you must specify the address of an already running Pushgateway.

Pushgateway launch example via systemd

Example unit file:

[Unit]
Description=Prometheus Pushgateway
Wants=network-online.target
After=network-online.target

[Service]
User=pushgateway
Group=pushgateway
Type=simple
ExecStart=/usr/local/bin/pushgateway

[Install]
WantedBy=multi-user.target

After starting Pushgateway, configure metric delivery in the platform configuration:

{
  "PROMETHEUS_METRICS_PUSH_GATEWAY": {
    "ENABLE": true,
    "ADDRESS": "10.200.5.20:9091",
    "PERIOD_SEC": 5
  }
}

Parameter description:

Parameter	Description
`ENABLE`	Enables sending metrics to Pushgateway
`ADDRESS`	Address and port of the already running Pushgateway
`PERIOD_SEC`	Metric send interval in seconds

Add Pushgateway to the Prometheus configuration:

scrape_configs:
  - job_name: 'pushgateway'
    static_configs:
      - targets:
          - '10.200.5.20:9091'

Check Pushgateway metrics availability:

curl http://10.200.5.20:9091/metrics

campaign metrics start appearing in the Pushgateway after the mailing is launched.

Grouping metrics by campaign ID

The CAMPAIGN_ID_PROMETHEUS_GROUPING_ENABLE parameter controls grouping of push metrics by mailing ID.

Example configuration:

{
  "CAMPAIGN_ID_PROMETHEUS_GROUPING_ENABLE": true
}

Value	Behavior
`true`	Metrics are grouped by mailing ID
`false`	All metrics are sent to a single group

By default, the parameter is enabled (true).

When grouping is enabled, separate metric groups are created in the Pushgateway for each mailing. This simplifies:

analyzing performance of individual mailings;
building Grafana dashboards;
configuring alerting rules;
finding issues in a specific mailing.

When grouping is disabled, metrics from all mailings are aggregated into a single Pushgateway group.

RabbitMQ publisher metrics

RabbitMQ publisher business metrics are available for the procworkflow and proctrigger processes.

These metrics are used to monitor message publishing, delivery confirmation time, and the number of retry attempts.

Configuring histogram bucket values

For the total_duration and confirm_duration metrics, you can configure custom histogram bucket values.

Example configuration:

{
  "PROMETHEUS_METRICS_RMQ_PUBLISHER": {
    "MSEC_BUCKETS": {
      "total_duration": [10, 25, 50, 75.5],
      "confirm_duration": [10, 25, 50, 75.5]
    }
  }
}

Available metrics

Metric	Description
`total_duration`	Total message publishing duration
`confirm_duration`	Publishing confirmation duration
`retry_counts`	Number of retry attempts
`retried_count`	Number of messages sent with at least one retry
`lost_failed_events_count`	Number of messages discarded after exceeding the retry limit

Metric interpretation

When analyzing RabbitMQ publisher metrics, pay attention to the following changes:

Metric	Possible cause
Increase in `confirm_duration`	RabbitMQ slowdown or network issues
Increase in `retry_counts`	Unstable message delivery
Increase in `retried_count`	Elevated number of publishing errors
Non-zero `lost_failed_events_count`	Event loss after exceeding the retry limit

Mailing metrics

Mailing metrics are used to monitor lead processing performance, individual stage execution time, and error counts during mailing execution.

The tables below list the main metrics. Actual names in Prometheus may contain additional prefixes, suffixes, and labels depending on the platform configuration.

Lag metrics

Metric	Description
`cursor_lag_milliseconds`	Mailing processing lag relative to the current queue state

An increase in cursor_lag_milliseconds may indicate insufficient resources, queue overload, or slowed lead processing.

General lead processing metrics

Metric	Description
`lead_prepare_milliseconds`	Lead preparation time
`lead_processing_milliseconds`	Lead processing time
`lead_wait_milliseconds`	Total wait time
`lead_total_milliseconds`	Total lead processing time

Stage processing metrics

Metric	Description
`lead_suppress_lists_check_milliseconds`	Suppress list check
`lead_policy_check_milliseconds`	Policy check
`lead_static_milliseconds`	Static data processing
`lead_form_milliseconds`	Form processing
`lead_relation_milliseconds`	Relations processing
`lead_query_milliseconds`	Query execution
`lead_loyalty_milliseconds`	Loyalty data processing
`lead_loyalty_program_milliseconds`	Loyalty program processing
`lead_site_milliseconds`	Site data processing
`lead_json_milliseconds`	JSON processing
`lead_render_milliseconds`	Content rendering
`lead_links_milliseconds`	Link generation
`lead_sends_milliseconds`	Message sending

Stage processing metrics are used to find bottlenecks during mailing execution.

Stage wait metrics

Metric	Description
`lead_suppress_lists_check_wait_milliseconds`	Wait time for suppress list check
`lead_policy_check_wait_milliseconds`	Wait time for policy check
`lead_static_wait_milliseconds`	Wait time for static data processing
`lead_form_wait_milliseconds`	Wait time for form processing
`lead_relation_wait_milliseconds`	Wait time for Relations processing
`lead_query_wait_milliseconds`	Wait time for query execution
`lead_loyalty_wait_milliseconds`	Wait time for loyalty data
`lead_loyalty_program_wait_milliseconds`	Wait time for loyalty programs
`lead_site_wait_milliseconds`	Wait time for site data
`lead_json_wait_milliseconds`	Wait time for JSON processing
`lead_render_wait_milliseconds`	Wait time for rendering
`lead_links_wait_milliseconds`	Wait time for link generation

An increase in wait metrics typically indicates insufficient resources, locks, or overloaded dependent services.

Stage error metrics

Metric	Description
`lead_suppress_list_check_failure_count`	Suppress list check errors
`lead_policy_check_failure_count`	Policy check errors
`lead_static_failure_count`	Static data processing errors
`lead_form_failure_count`	Form processing errors
`lead_relation_failure_count`	Relations processing errors
`lead_query_failure_count`	Query execution errors
`lead_loyalty_failure_count`	Loyalty data processing errors
`lead_loyalty_program_failure_count`	Loyalty program processing errors
`lead_site_failure_count`	Site data processing errors
`lead_json_failure_count`	JSON processing errors
`lead_render_failure_count`	Rendering errors
`lead_links_failure_count`	Link generation errors
`lead_sends_failure_count`	Message sending errors

An increase in failure metrics indicates errors at specific mailing processing stages and can be used to configure alerting rules in Prometheus and Grafana.

Monitoring verification

After configuration, verify that Prometheus receives metrics.

For pull metrics, run the following requests:

curl http://10.200.5.25:8911/metrics
curl http://10.200.5.25:8912/metrics

For push metrics, check the Pushgateway:

curl http://10.200.5.20:9091/metrics

Also check the target status in the Prometheus UI. Jobs procworkflow, proctrigger, and pushgateway should show status UP.

Common issues

Issue	Possible cause	What to check
Prometheus does not receive pull metrics	Process listens only on the local interface	Check `WF_METRIC_HOST` and `PROC_TRIGGER_METRIC_HOST` values
`/metrics` endpoint is unavailable	Metrics service did not start	Check the port with `netstat -tlpn`
Metrics service does not start	Port is occupied by another process	Specify a free port in the `1024–9999` range
No `campaign` metrics in Pushgateway	Mailing has not been launched yet	Launch the mailing and re-check
All campaign metrics fall into a single group	Mailing ID grouping is disabled	Check `CAMPAIGN_ID_PROMETHEUS_GROUPING_ENABLE`

Which processes support metrics​

Configuring pull metrics​

Pull metrics configuration example​

Pull metric parameters​

Configuring push metrics for campaign​

Pushgateway launch example via systemd​

Grouping metrics by campaign ID​

RabbitMQ publisher metrics​

Configuring histogram bucket values​

Available metrics​

Metric interpretation​

Mailing metrics​

Lag metrics​

General lead processing metrics​

Stage processing metrics​

Stage wait metrics​

Stage error metrics​

Monitoring verification​

Common issues​