The browser-use project has released video-use, an open-source tool that turns video editing into a conversation with a coding agent. Developers drop raw footage into a folder, describe the desired output in natural language, and the agent writes and executes the code to produce a finished video. The browser-use project has reached over 10,800 GitHub stars.
How It Works
The agent never watches the video. It reads it through two layers: an audio transcript with word-level timestamps and speaker diarization from ElevenLabs Scribe, and an on-demand visual composite that generates filmstrip, waveform, and word label views for any time range, according to the project’s README.
The system cuts filler words and dead space between takes, auto color grades each segment, adds 30-millisecond audio fades at every cut to prevent pops, and burns subtitles in a configurable style. For animation overlays, it spawns parallel sub-agents that generate content through HyperFrames, Remotion, Manim, or PIL. Before showing anything to the user, the system self-evaluates the rendered output at every cut boundary.
Session memory persists in a project.md file, so returning to a project weeks later picks up where the previous session ended. The tool works with Claude Code, Codex, Hermes, OpenClaw, or any agent with shell access.
Agents in Creative Production
Most agent frameworks have focused on administrative, coding, or data tasks. Video-use represents a different category: creative production work where the quality of the user-facing output matters as much as whether the task completed. Cutting a talking-head video, assembling a montage, or producing a tutorial all require subjective judgment about pacing, emphasis, and visual flow. The agent handles this by proposing a strategy and waiting for approval before executing, keeping a human in the loop on creative decisions while automating the mechanical work.
The Anthropic Stack noted the project points at a shift where production work that used to demand a skilled operator becomes a prompt. For anyone generating programmatic video at scale, whether content creators, marketing teams, or agencies, the tool offers a path to editing without learning a timeline-based editor.
Practical Constraints
Video-use is not a consumer tool. It requires comfort with a command line, an ElevenLabs API key, and ffmpeg installed locally. The quality ceiling depends on the underlying language model’s ability to make good editing decisions from transcript and visual data rather than from watching the footage directly. For simple cuts and assembly work, the transcript-based approach should perform well. For complex creative editing that requires understanding visual composition, motion, or emotional timing, the two-layer reading approach has inherent limitations compared to a human editor watching the footage.
The 10,800-star count signals demand for this kind of tool. Whether it holds up under the variety and messiness of real production footage will determine whether it stays a developer experiment or becomes a production workflow.