HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos

Abstract

Video outpainting generates plausible visual content beyond a video's original spatial extent, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one challenge or lack explicit mechanisms, leaving notable limitations.

In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct a Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, the GCG is built via a global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling.

This enables the GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint performs high-resolution outpainting to generate spatially detailed and temporally consistent content. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios with wide spatial extrapolation and long video sequences.

Method Overview

Overall framework of the proposed HL-OutPaint. (a) HL-OutPaint consists of two stages: Global Coarse Guidance Construction and GCG-Guided Video Outpainting. (b) Global Coarse Guidance Construction generates GCG from a spatio-temporally compressed video; at every diffusion timestep, global-local frame swapping aligns global keyframes with their local temporal windows, producing a globally consistent yet locally well-aligned GCG. (c) GCG-Guided Video Outpainting uses the GCG to outpaint large-scale videos.

Key Idea

The keyframes at the top provide global temporal anchors: they see the video sparsely over a long time span, so they are good at preserving the overall scene layout and long-range motion. However, because they are sampled sparsely, a keyframe alone can miss local details that happen between keyframes, such as the correct shape and position of a moving sign.

To recover these missing cues, HL-OutPaint also builds a local temporal window around each keyframe, shown in the middle row. This window contains neighboring frames around the selected keyframe, so it carries short-term motion and object appearance that the global keyframe sequence may overlook. During early denoising, global-local frame swapping copies the latent state of the same time instant between the global keyframe path and its local window path.

Intuitively, the global keyframes lend long-range structure, while the local windows lend nearby temporal evidence. Without this exchange, the outpainted keyframe can become structurally inconsistent, as shown in (a). With swapping, as shown in (b), the keyframe inherits the local window's fine cues while staying aligned with the global video context, producing a GCG that is both globally stable and locally coherent.

BibTeX

@inproceedings{HLOutPaint, author = {Park, Jeongeun and Han, Janghyeok and Kim, Geonung and Lee, Hyun-Seung and Choi, Kyuha and Han, Youngseok and Cho, Sunghyun}, title = {HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos}, year = {2026}, booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers}, series = {SIGGRAPH Conference Papers '26} }

HL-OutPaint

Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos

Abstract

Method Overview

Key Idea

HL-OutPaint Demo Video

BibTeX