Discussion about this post

User's avatar
Pawel Jozefiak's avatar

The local-to-production gap in multi-agent systems is brutal. Three agents coordinate perfectly in dev, then one misfires in prod and you have no visibility into why.

OpenTelemetry for this is the right call. Went through a version of this with my own setup - built a flat-file state log before finding proper tooling. Running Claude with computer use made it more obvious where agent coordination was breaking (https://thoughts.jock.pl/p/claude-cowork-dispatch-computer-use-honest-agent-review-2026).

The automated evaluation part is where I'm still figuring things out. How are you handling evaluation when expected output isn't well-defined?

No posts

Ready for more?