Research Paper

When to Start Communicating: Adaptive Stigmergy Gates Improve Multi-Agent RL Training Dynamics

Ashish KhandelwalIndependent ResearcherApril 6, 2026DOI: 10.5281/zenodo.19436476

Abstract

Shared communication channels are often treated as an unconditional good in multi-agent reinforcement learning (MARL): giving agents access to messages, shared memory, or stigmergic traces should improve coordination. Yet communication can also destabilize learning when it is available before it is informative, inducing spurious correlations that interfere with credit assignment and representation learning.

We present an Adaptive Stigmergy Engine that controls when agents may communicate through a shared pheromone field. The engine uses a simple gate evaluated mid-episode to activate the field only under population-level distress signals (population decline, energy crisis, or delivery inequality).

Across 10 seeds in a clean end-to-end run (10M training steps) and a 20-experiment development arc, adaptive deployment outperforms both ALWAYS_ON and ALWAYS_OFF baselines, improving hard-task deliveries by +34% and +19% respectively while yielding the lowest population variance.

Cross-domain validation with LLM agent collectives (481 runs across two models and three task domains) confirms the principle: on adversarial tasks with planted decoy bugs, communication exposure increases false consensus capture monotonically (0% at 0 shared rounds to 100% at 3 shared rounds), demonstrating that communication costs depend on information quality in both RL and LLM settings.

1. Introduction

Coordination is a central challenge in multi-agent reinforcement learning (MARL). Agents that learn independently must still solve joint problems: avoid redundant exploration, allocate roles, and exploit complementary information under partial observability. A common response is to provide a shared information channel — explicit communication, differentiable message passing, shared memory, or an external “blackboard” — so that coordination can be learned end-to-end.

Stigmergy offers a particularly lightweight variant: agents coordinate indirectly by writing to and reading from a persistent medium (e.g., pheromone trails), as in social insects.

However, a coordination channel is not a free lunch. In most MARL systems, communication is treated as static: the channel is either always available (always-on) or fully removed (always-off). This design choice implicitly assumes that “more information” is monotonically helpful. In practice, an always-on channel can be actively harmful early in learning.

1.1 When communication hurts

We study a multi-agent foraging task in which agents must discover food, carry it back to a nest, and maintain viability under evolutionary pressure (birth, death, and reproduction). Agents can optionally interact through a shared pheromone field.

Surprisingly, always-on pheromone access can degrade performance during training. In our setting, the ALWAYS_ON condition exhibits the worst performance at 5M training steps, even compared to a no-communication baseline. A plausible explanation is that at step 0 the pheromone field is effectively uninformative: deposits are random, spatial structure has not yet emerged, and the field provides a high-dimensional input stream that is easy for a function approximator to overfit.

This observation reframes the communication design problem: the question is not only what agents should share, but when the channel should be available.

1.2 Adaptive stigmergy as a channel-curriculum

We propose an Adaptive Stigmergy Engine that gates access to the pheromone field based on population-level distress signals. Rather than learning a continuous attention weight over communication, we start with a simple, interpretable mechanism: a rule-based gate evaluated at a fixed midpoint of each episode that decides whether the pheromone channel is active.

The key hypothesis is that adaptive deployment induces a channel-curriculum: by default, agents learn to forage using only local observations, which encourages stable individual competence and reduces reliance on a noisy shared channel. When conditions warrant it, the gate activates and agents learn to use pheromone information as a residual correction layered on top of a competent base policy.

1.3 Results preview

Across easy and hard variants of the task, the adaptive condition dominates both static baselines: it achieves the highest mean deliveries while also exhibiting the lowest population variance and reduced tail risk. The gains are nonlinear in training duration: at 5M training steps, all modes are comparable, but by 10M steps adaptive gating pulls ahead, consistent with a learning-dynamics explanation rather than an execution-time trick.

2. Method

2.1 Environment

We study a central-place foraging task implemented as a discrete-time grid-world. The default world is a 20 × 20 grid containing (i) a fixed nest at the center, (ii) F food sources placed in the environment, and (iii) a population of up to N_max = 32 agents. Each agent occupies a single cell and acts synchronously at each environment step.

Agents have a discrete action space of size 5: stay, up, down, left, right. The core loop is central-place foraging: an agent searches for a food source, picks up food upon contact, carries it back to the nest, and delivers it for reward.

Each agent maintains a scalar energy variable that decreases by a fixed amount each step (energy drain). If an agent's energy reaches zero, it dies and is removed from the active population. Reproduction is triggered when an agent returns to the nest with energy exceeding a threshold.

2.2 Pheromone field

Agents may coordinate through a shared pheromone field that acts as an external, persistent communication medium. The field is a dense tensor P ∈ ℜ^H×W×C aligned with the grid world, with H = W = 20 and C = 4 channels.

At each step, each agent can (i) read pheromone values from a local neighborhood around its position, and (ii) write by incrementing the pheromone value at its current cell according to channel-specific rules.

Channel 0 (Recruitment): A pheromone trail used to recruit other agents toward recently successful foraging locations. Deposits are success-gated: only agents currently carrying food are permitted to deposit on this channel.

2.3 Adaptive stigmergy gate

A stigmergic pheromone field is only useful after it encodes information about successful interaction histories. At the start of an episode (and, more importantly, at the start of training), the field is typically empty or reflects random exploration, so reading it provides little signal and substantial opportunity for overfitting.

The adaptive gate operationalizes a two-phase structure directly: it enforces an initial individual-learning phase (field OFF), then conditionally enables stigmergic coordination (field ON) only when the population appears to need additional coordination capacity.

Step 0Step 250Step 499

Phase 1: Individual Foraging

Field OFF — agents learn independently

Phase 2: Collective Coordination

Field ON — agents read/write pheromone trails

Gate fires when: population decline, energy crisis, or delivery inequality

Performance Comparison — Hard Task

ALWAYS_OFF

OFF

43 del

ALWAYS_ON

38 del

ADAPTIVE

OFF

GATE

51 del ★

Figure 1: Schematic of the adaptive gate mechanism showing Phase 1 (individual foraging, field OFF), gate evaluation at step 250, and Phase 2 (collective coordination, field ON). Performance comparison on hard task: ADAPTIVE achieves 51 deliveries vs 43 (ALWAYS_OFF) and 38 (ALWAYS_ON).

Phase 1 (steps 0–249): We set field_enabled = False. All pheromone reads and writes are masked to zero, ensuring that agents learn a baseline foraging policy without access to stigmergic state.

Gate evaluation (step 250): At the midpoint of each episode, we evaluate three OR-conditions using population-level statistics:

Population decline: current alive population N_alive is less than the starting population N₀.
Energy crisis: mean energy among alive agents is less than 0.5 times the starting energy.
Delivery inequality: the Gini coefficient of per-agent delivery counts exceeds 0.6.

If any condition is true, we set field_enabled = True for the remainder of the episode; otherwise it remains false.

3. Experiments & Results

3.1 Experimental setup

All experiments use 10 random seeds per condition. Episodes have fixed horizon T = 500 steps. We report two primary evaluation metrics: (i) total food deliveries per episode (higher is better), and (ii) final population count at episode end (higher indicates greater viability).

We compare three communication modes: ALWAYS_OFF, ALWAYS_ON, and ADAPTIVE. Training uses PPO for 10M environment steps with 32 parallel environments.

3.2 Main results

System	Easy Del	Easy Pop	Hard Del	Hard Pop
ALWAYS_OFF	36 ± 18	9.0 ± 9.6	43 ± 21	22.0 ± 8.2
ALWAYS_ON	38 ± 19	1.6 ± 2.7	38 ± 23	19.5 ± 10.1
ADAPTIVE	44 ± 21	12.5 ± 10.2	51 ± 20	24.8 ± 7.9

Table 1: Main evaluation results (mean ± std over 10 seeds). “Pop” denotes final population count.

ADAPTIVE achieves the highest mean deliveries on both tasks (44 on Easy and 51 on Hard) and the highest final population on both tasks. On the hard task, ADAPTIVE also has the lowest population variance (7.9 vs 8.2 for ALWAYS_OFF and 10.1 for ALWAYS_ON). The hard-task delivery advantage is substantial: ADAPTIVE improves over ALWAYS_OFF by 19% and over ALWAYS_ON by 34%.

Total Deliveries

Final Population

Figure 2: Main evaluation results (bar chart with error bars). Left: Total deliveries. Right: Final population. ADAPTIVE (green) achieves the highest mean and lowest variance on both metrics across easy and hard tasks.

3.3 Timing ablation

To test whether the evaluation-time timing of field activation explains performance, we ran a timing ablation on the hard task in which the pheromone field becomes available at a fixed step t_on (or never), keeping all other components unchanged.

All timing conditions fall within 15% of the overall mean (threshold 20%), and we do not observe a systematic advantage attributable to turning the field on earlier or later at evaluation time. This is critical: the benefit of ADAPTIVE cannot be explained as an execution-time switching trick, and instead points to differences in training dynamics.

Condition	Deliveries	Population
ALWAYS_OFF	31.8 ± 17.7	18.3 ± 7.6
ALWAYS_ON	32.3 ± 12.9	20.5 ± 7.6
EARLY_ON (100)	35.1 ± 24.7	19.3 ± 11.1
ADAPTIVE (250)	30.6 ± 13.4	19.0 ± 7.5
LATE_ON (350)	35.4 ± 18.9	21.1 ± 8.1

Table 2: Timing ablation on the hard task (mean ± std over 10 seeds).

3.4 Training duration effect

At 5M steps the three modes are statistically indistinguishable, whereas by 10M steps ADAPTIVE pulls ahead nonlinearly. This pattern supports the channel-curriculum hypothesis: the policy appears to require sufficient optimization time to cross a “field legibility” threshold at which pheromone information becomes reliably exploitable.

Figure 3: Training duration effect showing nonlinear ADAPTIVE advantage. At 5M steps all modes are comparable; by 10M steps ADAPTIVE pulls ahead, consistent with a channel-curriculum mechanism rather than an execution-time trick.

Figure 4: Timing ablation on hard-task deliveries. Similar means across activation times indicate no systematic execution-time timing effect. Dashed line denotes the overall mean.

4. Cross-Domain Validation: LLM Agent Collectives

The RL results show that communication timing affects training dynamics in gradient-based multi-agent learning. A natural question is whether the same principle — that premature communication can be harmful — transfers to a fundamentally different multi-agent paradigm: large language model (LLM) agents performing collaborative reasoning. We conducted a series of experiments totaling 481 runs across two models, three task domains, and six communication conditions.

4.1 Experimental design

We instantiate three LLM agents that independently investigate Python bug-finding tasks over five deliberation rounds. Each agent maintains a structured scratchpad containing its current hypothesis, a confidence score in [0, 1], supporting findings, and free-form reasoning. At the end of each round, a sharing policy determines whether agents see each other's scratchpads.

We test six conditions: NEVER_SHARE, ALWAYS_SHARE, FIXED_DELAY_1 through FIXED_DELAY_3, and ADAPTIVE. Models tested: Claude Sonnet 4.5 and Claude Haiku 4.5.

4.2 Standard tasks: ceiling effect

Condition	Sonnet / Debug	Haiku / Debug	Haiku / Incident
NEVER_SHARE	50%	70%	85%
ALWAYS_SHARE	100%	100%	100%
FIXED_DELAY_1	100%	100%	100%
FIXED_DELAY_2	100%	100%	100%
FIXED_DELAY_3	100%	100%	100%
ADAPTIVE	100%	100%	100%

Table 3: Accuracy on standard tasks by condition, model, and domain (20 tasks per cell).

On standard tasks, communication is redundant but harmless: any sharing strategy reaches 100% accuracy. ADAPTIVE functions as an efficiency optimizer — same outcome, less overhead. ADAPTIVE shares scratchpads for approximately 1 of 5 rounds (80% less communication overhead) because the gate delays sharing until convergence or saturation is detected.

4.3 Adversarial task: epistemic cascades

To test whether premature sharing can actively harm accuracy — as ALWAYS_ON harmed RL training — we designed an adversarial task with a planted epistemic trap. The task contains a real bug (a subtle async race condition) and a decoy bug (an obvious non-atomic read-modify-write). Three agents each receive different code files, with the hypothesis that early sharing causes anchoring on the decoy.

Dosage-response curve

Shared rounds	Decoy capture rate
0	0–20%
1	60%
2	80%
3	100%

Table 4: Dosage-response — decoy capture rate as a function of shared rounds (Haiku 4.5, 3-round regime).

Figure 5: Dosage-response: decoy capture rate increases monotonically with communication exposure (Haiku 4.5, 3-round regime). Each additional shared round increases false consensus by ~20-30 percentage points.

Decoy capture increases monotonically with communication exposure. Each additional shared round increases decoy capture by approximately 20–30 percentage points. Error propagation is fast (one shared round is sufficient to anchor), while error correction is slow (requiring 4–5 rounds of exposure to a dissenting view).

5. Analysis & Discussion

5.1 The channel-curriculum mechanism

The central empirical pattern is that adaptive activation improves mean performance and stability, while timing ablations show that execution-time switching alone does not explain the gains. We therefore interpret ADAPTIVE as a training-dynamics intervention.

ALWAYS_ON: learning with an initially weak channel. Under ALWAYS_ON training, the policy receives pheromone features from step 0, when the field is mostly uninformative. PPO updates can then fit transient correlations between noisy field values and actions. As trails become meaningful later, the policy must reconfigure those representations, which increases interference and variance.

ADAPTIVE: staged learning via delayed channel exposure. Under ADAPTIVE training, Phase 1 (field OFF) first builds robust individual control (search, homing, and energy management). Phase 2 (field ON) introduces pheromone features only after useful trail structure emerges. This encourages communication to act as a residual correction on top of a competent base policy, rather than as a brittle early dependency.

5.2 Stigmergy as both stabilizer and amplifier

Across our experiments, stigmergic communication has a horizon-dependent role. Early in training, it acts mainly as a risk controller (variance and tail-risk effects); later, once trails are legible, it acts as a performance amplifier (mean-delivery gains with stability retained).

In our setting, the key is not communication per se, but whether learning has crossed the legibility horizon. Adaptive gating is effective because it suppresses early interference while preserving late-stage amplification.

5.3 Communication is not monotonically helpful

The LLM experiments confirm and extend the RL finding that communication has information-quality-dependent costs. Both domains demonstrate the same principle — premature access to a shared information channel degrades collective performance when the channel content is misleading.

In the RL domain, an empty pheromone field acts as noise: agents that read it from step 0 overfit to uninformative gradients. In the LLM domain, a compelling decoy acts as a loud wrong signal: agents that share early anchor on it, and more sharing rounds monotonically increase decoy capture.

6. Conclusion

Adaptive stigmergic communication can outperform static communication strategies in multi-agent learning when the channel is informative only after agents have acquired basic competence. In a JAX multi-agent foraging simulation trained with PPO and evolutionary pressure, ADAPTIVE improves mean deliveries on the hard task by 19% over ALWAYS_OFF and 34% over ALWAYS_ON, while also reducing population variance across 10-seed evaluations.

We argue that the mechanism is a channel-curriculum: delaying access to the pheromone field allows policies to first learn robust individual foraging skills, then incorporate pheromone information as residual corrections once meaningful trails exist.

Cross-domain validation with LLM agent collectives extends these findings beyond RL. On standard tasks, adaptive gating achieves the same accuracy as unconditional sharing while reducing communication overhead by 80%. On an adversarial task with a planted decoy bug, we observe dosage-dependent epistemic cascades: each additional round of shared deliberation increases false consensus capture by 20–30 percentage points, reaching 100% at 3 shared rounds.

The practical takeaway is simple: do not coordinate until there is something worth coordinating about — and regulate how long the channel stays open, because errors propagate faster than corrections.

Prefer the full academic paper with all figures, tables, and references?

Download PDF