How We Tripled Automation Throughput on the Same Hardware

How We Tripled Automation Throughput on the Same Hardware

Automate everything at scale with just 16 vCPUs

Synerise Automation is a low-code workflow platform that handles both event-driven automation and ETL workloads. What makes it special is its scale, multi-tenancy support, and broad applicability across use cases. On an average day it consumes 1 billion events, matches roughly 400 million of them against active workflows, and triggers 200K scheduled jobs — all while keeping sub-second reaction time for 99.99% of events. Achieving that headroom required deliberate engineering investment. We pushed throughput from ~300K to ~900K messages per minute while cutting CPU usage from 60 vCPUs down to 16 vCPUs. This post breaks down the architecture that made it possible and the optimization patterns that got us there.

Automation execution stats from the last 24 hours:

Article content

Automation Architecture

Our Automation module is designed as a collection of independently scalable services running on a Kubernetes platform. These services communicate asynchronously via Kafka, allowing the system to process large volumes of events efficiently, and to remain resilient under varying workloads. Most Automation workflow node types — such as delay, event generation, client filter, metric filter, or any external integration — are implemented as dedicated microservices. This design enables fine-grained and independent scaling and clear separation of responsibilities, simplifying maintenance and future development.

Article content

At the center of the system is the Automation Heart — the orchestration service responsible for executing automation workflows. It operates in coordination with the Automation Brain, which defines and manages workflow configurations through the user interface. Automation Heart consumes events from a global real time event stream. When an event arrives, it identifies matching workflow definitions and either initiates new workflow instances, as multiple workflows may start with the same event, or advances existing ones. As a workflow progresses, Automation Heart updates workflow status and visited nodes statistics. When a visited node is defined by an external automation microservice, a Kafka message is sent with node configuration and referenced data e.g. triggering event body.

Performance-First Engineering

Reaching a throughput of 1 billion messages per day didn’t happen by accident — it was the result of deliberate effort from day one. Performance was one of the primary design goals, not an afterthought. That mindset influenced the technology stack — we chose Rust for its predictable performance, memory safety without garbage collection, and fine-grained control over resources. However, Rust is not without trade-offs. Development is slower compared to a mature Java or Scala stack, the library ecosystem is younger and less battle-tested, and performance measurement tooling still lags far behind what the JVM offers. For our use case — a long-running, CPU-bound message processing core — the runtime predictability justified that cost. It also shaped our observability approach: from the first prototype, we embedded internal metrics and built extensive Grafana dashboards that show message flow, latency, Kafka group lags, and resource utilization in real time. These dashboards became our primary feedback loop for every optimization that followed.

Once the system was stable in production, the next step was to push message processing performance even further. We began by establishing a clear baseline through load testing and iterative profiling — an essential foundation for any data-driven optimization effort.

Establishing the Baseline

We used the Grafana k6 framework to simulate realistic message workloads and measure end-to-end throughput under controlled conditions. These early tests gave us consistent reference points for evaluating every subsequent change. With the baseline established, we instrumented our Heart service with Rust profiling tools to identify hot paths and inefficiencies:

  • dhat – for analyzing memory allocations and identifying potential leaks or unnecessary heap pressure.
  • flamegraph — a visualization of sampled CPU call stacks where wider bars indicate more time spent.
  • criterion – for precise benchmarking of critical execution paths and validating performance improvements.

A flamegraph of an automation workflow revealed several wide bars, clearly identifying CPU hotspots. We followed a disciplined loop of test → profile → fix → measure, gradually removing one bottleneck after another. Each iteration built on the gains of the previous one, creating a compound effect where small optimizations accumulated into a substantial overall speedup.

Article content

Key Optimization Patterns

Through this process, we identified several recurring performance themes — lessons applicable to most large-scale message processing systems regardless of programming language:

1. Kafka consumer batch size: We increased the Kafka consumer batch size from 64 to 512, which improved throughput while keeping total batch processing time under 125 ms. Larger batches also made it practical to parallelize processing within each batch.

2. Concurrent processing of messages: We grouped messages by customer/entity so each group could be processed concurrently. Think of it as partitioning a batch into independent units of work and running them in parallel using a restricted number of concurrent threads — a pattern available in any language with async or thread-pool support.

3. Kafka message filtering: We introduced early filtering based on Kafka headers to skip irrelevant messages before deserialization. Using two primary headers, we avoided parsing 30-40% of inbound messages and reduced CPU utilization. This required upstream producer changes and a compact match tree for fast pre-checks:

match parse_opt_message_headers(&msg) { // extract optional header values from Kafka message
    (Some(business_id), Some(event_action)) => {
        if prematch_tree.matches(&business_id, &event_action) { // use shallow match tree with just two conditions to check match
            self.process_message(msg)
        } else {
            SKIPPED_NUMBER.inc();
            None
        }
    },
    _ =>  self.process_message(msg)
} 

4. Conditional logging: We discovered that info-level logs consumed nearly 10% of total execution time. Moving irrelevant messages to debug level and using dedicated log targets improved throughput significantly. We also made sure to guard expensive log argument evaluation — logging macro arguments are eagerly evaluated, so any costly computation runs even when the log level won't emit the message.

if log_enabled!(Debug) { // This guard prevents expensive formatting and computation when the log level will not emit the message.
    debug!("{}", do_the_slow_and_expensive_stuff());
} 

5. Memory reallocations: We pre-allocated collection sizes based on known input dimensions instead of letting data structures resize dynamically. In one case, replacing a multi-pass unzip-then-flatten with a single-pass pre-allocated approach eliminated thousands of unnecessary reallocations per batch.

6. Replacing our cache implementation: Originally, all automation workflow instances were cached using a cache built on Mutex + RefCell. Replacing it with a purpose-built concurrent cache library (moka) significantly reduced lock contention.

7. Kafka producer tuning: We tuned Kafka producer parameters like linger.ms and batch.size, choosing values that balanced throughput for large batches with fast response times for smaller ones. Producer-side tuning had a materially larger impact than expected, particularly for large batches of messages.

Results

After applying these optimizations and validating each one against our k6 baselines:

  • Throughput: ~300K → ~900K messages/min on the same hardware (3x)
  • CPU usage: 60 vCPUs → 16 vCPUs (73% reduction)
  • Efficiency per vCPU: 10-12x improvement
Article content

None of these required additional resources, additional Kafka partitions, or architectural rewrites. The work spanned six sprints and three months, with each change rolled out gradually, usually once per week. No optimization broke an integration test, violated a service contract, or altered business logic — every one was a pure efficiency gain under identical behavior. Every gain came from profiling, understanding hot paths and incoming event distributions, and systematically applying proven optimization patterns.

What Surprised Us

There was no single silver bullet. No optimization on its own delivered a dramatic win. The 10-12x efficiency gain came from compounding many small, disciplined improvements. The real payoff came during Black Week and the Christmas season, when peak traffic did not require firefighting or emergency patches. The only component that needed additional scaling was the backend database, to keep up with Automation's pace.

With performance off the table, we shifted focus to cost. We began experimenting with spot instances and dynamic scaling driven by Kafka consumer group lag, looking to push operational costs even lower without sacrificing the headroom we'd earned.

Article written by Piotr Masko, Automation Team Lead at Synerise

Subscribe to Synerise AI/BigData Research

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe