Our Story · Our Values

PlotPunk.

An easy but controllable tool for video generation. The user directs. The AI crew renders. Episode by episode, frame by frame — without losing the thread.

We believe creativity belongs to humans. The AI handles the plumbing — prompts, retries, asset linking, model selection. The human handles the story, the cast, the camera, the cut. That distinction is the whole point of PlotPunk.

Jury Access

URL: plotpunk.com/auth/login
Email: jury@plotpunk.demo
Password: PlotpunkJury2026!
user_id: 85f288cf-e394-4630-8f0d-a258d83257d2

01 · The Problem We Started With

One-shot prompts are easy. Telling a story across them is the hard part.

Generative video has reached the point where a single prompt can produce something genuinely impressive. The catch is what happens after that first clip. The moment a creator wants a second shot of the same character, a third with a different camera angle, a fourth in a new location at the same time of day — the tools we tried tend to lose the thread. Identity drifts. Wardrobe shifts between cuts. Lighting resets.

The workaround today is to stitch together a dozen separate services — a generator for stills, another for video, a TTS service, a lipsync tool, a non-linear editor, a prompt-engineering scratchpad, a colour grading step — and to spend most of your time managing handoffs between them rather than directing your story. That isn't filmmaking. It's plumbing.

We wanted to know whether a single environment could absorb all of that plumbing without absorbing the creative authorship along with it.

02 · The Idea

The user is the Studio Boss. The AI plays the crew.

PlotPunk is built on a deliberate distinction: the technical complexity belongs to the system, the creative decisions belong to the user. Modern one-shot video tools have a tendency to collapse both into the same prompt box — and in doing so they take the most interesting part of filmmaking, the directing, away from the human and hand it to the model.

We took the opposite stance. The user brings the characters, the story, the visual language, the pacing. The system handles prompt construction, asset linking, model selection, retry logic, and cross-shot context propagation. Bulk operations — multi-aspect exports, social-media dailies, alternate cuts — are supported as expansions of a user-authored idea, never as substitutes for one. A creator who has never written a prompt in their life should be able to realise their vision; a creator who knows exactly what they want should not be forced to surrender control to get it produced.

The mental model we ship is the user is the Studio Boss. AI agents play production-crew roles — a Showrunner, a Casting Director, a Production Designer, a Drehbuchautor (Screenwriter), a DOP, a Sound Designer, an Editor — each with their own persistent chat history and domain-expert system prompt. They propose work. The user approves it, edits it, or rejects it with a reason. The reason flows back into the next iteration so the persona learns what the boss does not want, not just what they do.

Three approval gates bound cost and protect the creative process from runaway inference spend: G1 after the script, G2 after the keyframes, G3 after final assembly. Nothing expensive runs until the previous step is signed off.

03 · How We Got Here

Three Runway models, one production pipeline.

We did not begin with this architecture. The journey to a working pipeline shaped both what we built and where Runway's API became most useful to us.

The concept came first. We began with GPT-4.1, asking it to take the user's raw idea and produce a Production Concept: a cast, a visual style, the pacing, the target audience, the background information that would later anchor every prompt downstream. The Production Concept became the persistent context every later persona reads from. We later split pre-production into discrete Showrunner, Casting Director, and Production Designer steps so each creative decision has its own conversation and its own approval point.

Keyframes followed. With the Production Concept locked, we asked GPT-4.1 to write image prompts for the start frame of each shot — what we call a keyframe. We tested several Runway image backends and settled on gemini_2.5_flash (colloquially "Nano Banana Pro") as our primary, with gen4_image as a safety fallback. Fed with the previous shot's keyframe plus a camera-only textual instruction, this chain reliably retained character identity, attire, and environment from one shot to the next. The user reviews every keyframe at G2 — an entire episode's frames laid out in a grid — and approves, rejects, or requests specific re-renders before any expensive video generation runs.

The dialogue problem

This is where the journey was the most instructive. We evaluated three Runway-backed approaches end-to-end before landing on the right architecture for our use case.

Path 1 · TTS · retired

eleven_multilingual_v2

Runway-proxied ElevenLabs gave us natural audio from user-written lines — but no motion. A speaking character needed pixels too. Dropped from the active pipeline once seedance2 covered both.

Path 2 · Avatars · dropped

gwm1_avatars

Native lipsync from still + audio. We built the full integration. Two deal-breakers: visual fidelity sat below the cinematic bar, and the model imposes a fixed head-and-shoulders camera that cannot be overridden — incompatible with a tool whose point is shot-by-shot framing.

Path 3 · Seedance · live

seedance2 → veo3.1

A single inference produces motion and synchronised speech with full per-shot framing freedom — wide, OTS, tracking, insert. This is the spine of every shot today. When seedance2 moderation rejects a prompt we cascade to veo3.1, which also renders audio natively so dialogue is never lost on fallback.

The trade we accepted: the exact spoken words come from the model, not the writer. That is the one creative authority we do not yet hand back to the user — and it is the single limit the rest of this submission orbits around.

04 · The System Today

Eight steps, three gates, one Runway key.

The user moves through a single web application:

Concept→Casting→Style→Script→G1→Shots→Keyframes→G2→Videos→Audio→Assembly→G3

Under the hood, Inngest orchestrates every long-running inference call as a durable, retry-aware step. Each Runway call uses a kickoff + poll pattern so no single network round-trip exceeds a few seconds — long generations run reliably without hitting edge-function timeouts. Coolify deploys the whole stack to a Hetzner VPS.

A small dedicated FFmpeg service handles post-production: per-shot concat, ambience mixing, libass subtitle burn-in, and parallel multi-aspect export (16:9 plus 9:16 plus 1:1 from a single generation pass). The generative layer is consolidated behind a Runway API key pool for image and video — gemini_2.5_flash and gen4_image for keyframes, seedance2 with veo3.1 as audio-preserving fallback for clips. The only second provider in the system is OpenAI for the LLM layer.

The asset library — characters, voices, styles, reference images — is account-scoped, so a creator who has built a recurring cast in one project can pull them straight into the next. Bindings into projects are time-scoped and snapshotted, so a generated asset records the exact inputs it was made from and is never retroactively invalidated by an upstream edit. This is what lets a creator iterate confidently over weeks without their earlier work silently shifting under them.

05 · What's Next

One feature could close the last hole.

The work surfaced a productive limit in today's API-accessible generative video — and a concrete opportunity we're hopeful Runway will unlock next. The three properties a fully directable narrative pipeline needs:

Arbitrary user-provided speech audio — the exact words and performance the writer authored.
Cinematic visual fidelity — at the bar that lets generated clips live next to a storyboard.
Unconstrained camera framing — wide, OTS, tracking, insert, the full DOP vocabulary.

No single model we tested clears all three. TTS hits (1) but gives no moving image. Avatar models hit (1) and a degraded (2), but lock framing — you lose (3). seedance2 and veo3.1 hit (2) and (3) beautifully — but the spoken words come from the model, not the writer. That is the gap. In our view it is the single most useful feature Runway could add in the next year: a video model that accepts an audio waveform as input alongside a reference image and a free-form camera instruction. Such a model would close the last hole in a fully directable narrative-video pipeline.

On our side, the roadmap is full of additions we've already scoped: a Genre-Bender that remixes an entire episode into noir, silent film, or pulp horror with one click; per-scene alt-takes the user can pick from side-by-side; an optional nature-documentary narrator track over the final cut.

PlotPunk is what a generative video tool looks like when you take seriously the idea that filmmaking is directing, not prompt-writing. The technical complexity is gone. The creative work is exactly where it belongs — with the human in the chair. The Studio Boss approves the work. The AI crew renders it. That distinction is the whole point.

Open PlotPunk →Create an account ↓ Download PDF

Calluna Labs · Hamburg · plotpunk.com