FlowSeg: Dynamic Semantic Flow for LLM-Conditioned Segmentation

Abstract

LLM-conditioned segmentation has recently advanced by coupling large language models with iterative mask generation frameworks. However, current query-based propose-then-select pipelines can generate high-quality mask candidates while still failing to select the mask that matches the linguistic condition. FlowSeg addresses this semantic misalignment by introducing dynamic semantic guidance through a bidirectional semantic flow between intermediate decoding states and LLM-derived condition embeddings.

Language conditions actively guide mask refinement at each decoding stage, while condition embeddings are progressively updated by emerging visual evidence. A lightweight boundary-aware refinement module further enhances uncertain regions without perturbing confident interiors. Experiments on referring expression segmentation and reasoning segmentation demonstrate consistent improvements and state-of-the-art performance.

Motivation

In query-based LLM-conditioned segmentation, the model may already produce a candidate mask that overlaps well with the target object, but the final matching step can select a semantically wrong candidate. FlowSeg treats language grounding as part of the generation dynamics rather than only a post-hoc selection signal.

Method

FlowSeg is built on a standard LLM-segmentor scaffold with dual visual encoders and a query-based segmentation decoder. Its key contribution is the decoder-side Bidirectional Semantic Flow, where condition embeddings guide query refinement and are updated by decoder queries throughout the generation process.

Semantic Cross-Attention Queries attend to LLM-derived condition embeddings at each decoder layer, injecting linguistic constraints during mask generation.

Condition Refinement Condition embeddings absorb emerging visual evidence from refined queries, making language representations visually grounded.

Boundary-Aware Refinement A lightweight refinement module selectively improves uncertain boundary regions while preserving confident mask interiors.

Main pipeline of FlowSeg — Bidirectional Semantic Flow enables language condition embeddings to guide mask generation at each decoding stage, while progressively updating them with emerging query embeddings.

Results

Referring Expression Segmentation

FlowSeg improves over prior methods on RefCOCO, RefCOCO+, and RefCOCOg, with stronger gains on more challenging splits.

Reasoning Segmentation

On ReasonSeg test, FlowSeg reaches 54.7 cIoU, outperforming the baseline by 13.7 points.

Qualitative Results

FlowSeg produces more accurate masks with finer details compared with prior work, especially for ambiguous referring expressions and complex object boundaries.

Qualitative comparison on RefCOCO, RefCOCO+, and RefCOCOg — Qualitative comparison on RefCOCO/+/g.

Citation

@inproceedings{flowseg2026,
  title     = {FlowSeg: Dynamic Semantic Flow for LLM-Conditioned Segmentation},
  author    = {Zekang Zhang and Guangyu Gao and Youyun Tang and ChengJing Wu and Xiaochao Qu and Chi Harold Liu and Jianbo Jiao and Yunchao Wei and Luoqi Liu and Ting Liu},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year      = {2026}
}