Text-to-Video AI: How It Works & What It Costs

Text-to-video AI turns a written prompt into a short video clip — no camera, no footage, no editing timeline. You describe a scene; the model generates it. The quality you get depends almost entirely on how you write the prompt. Here's how it works, how to prompt it well, and what a clip actually costs.

TL;DR

What it is: generate a 5–12 second clip from a text description alone.
How to prompt: cover five parts — subject, action, camera, setting, style. Vague prompts get vague video.
Cost: ~300 credits for a 5-second 720p clip on Seedance 2; the cost is shown before you run. Free credits to start.
When to use image-to-video instead: if you already have a photo of the look you want, animating it gives more control for less. (guide)

Anatomy of a good text-to-video prompt: subject, action, camera, setting, style

How text-to-video AI works

You write a prompt; a video model generates frames that match it, with motion that's coherent from one frame to the next. Modern models (like Seedance 2) handle physical motion, camera moves, and lighting reasonably well, and can add audio. You don't direct it frame by frame — you describe the shot, and the model interprets it. That's why the prompt is the whole game: the model fills in everything you don't specify, and its guesses are generic.

How to write a prompt that works

Cover these five parts and you remove most of the guesswork:

Subject — who or what. "A red fox," not "an animal."
Action — what it does. "Trotting through fresh snow."
Camera — the shot and move. "Slow dolly-in, low angle." This is the most-skipped part and the one that most changes the result.
Setting — where. "Pine forest at dawn."
Style / lighting — the look. "Soft morning light, cinematic, shallow depth of field."

Put together: "A red fox trotting through fresh snow, slow dolly-in at a low angle, pine forest at dawn, soft morning light, cinematic, shallow depth of field." That gives the model a clear shot to build. Compare it to "a cinematic fox video" — which leaves everything to chance.

More tips:

One clear action beats three competing ones.
Name the camera move explicitly (pan, dolly, orbit, static) — "static shot" and "slow orbit" feel completely different.
Add lighting and time of day; they drive the mood.
Draft cheaply, then finalize (see costs below).

What text-to-video costs

On Seedance 2, generation runs on a single credit pool and the cost is previewed before each run:

Output (5s, 720p)	~Credits
Text-to-video — Seedance 2	~300
Text-to-video — Seedance 2 Fast / 2.0 Mini	~210
Image-to-video — Seedance 2	~150

Credits scale with resolution, duration, and audio. A practical workflow: iterate your prompt on the cheaper Fast / 2.0 Mini tier (~210 credits), then re-render the winner on standard Seedance 2. New accounts start with free credits; after that the cheapest pack is $29.90 → 3,150 credits (~ten text-to-video clips). Full pricing here.

Text-to-video vs image-to-video — which to use

Text-to-video when you're starting from an idea and want the model to build the whole scene (~300 credits).
Image-to-video when you already have a photo of the subject or look — animating it gives tighter control over the result for fewer credits (~150). How to do that →

Many creators use both: generate a base look, then animate the best frame.

FAQ

What is text-to-video AI?
Software that generates a short video clip from a written description, with no footage or editing required.

How much does it cost?
On Seedance 2, about 300 credits for a 5-second 720p clip (~$2.85 at the cheapest pack rate). The cost is shown before you generate, and failed runs aren't charged.

Is text-to-video free?
New accounts get free credits to start. After that it's pay-as-you-go credit packs — no subscription required.

Why is my video generic or off-prompt?
The prompt is probably under-specified. Add the camera move, setting, and lighting, and keep to one clear action.

Can it generate audio?
Seedance 2 supports audio on supported settings; enabling it raises the credit cost.