AI Video Generationv1.1.2

Nano VideoGen

Nano VideoGen is a local AI video generator built on the LTX-2.3 model that produces text-to-video and image-to-video clips on a single NVIDIA GPU with as little as 12 GB VRAM, supports keyframe interpolation between two reference images, and exposes camera-control LoRAs (dolly, jib, pan, static).

LTX-2.3
Open-source video diffusion model
12 GB VRAM
Streaming DiT for limited-memory GPUs
Camera LoRAs
Dolly · jib · pan · static
Single-GPU local inference12 GB VRAM minimumLTX-2.3 model · open weightsNo cloud generation, no per-second fees
01LTX-2.3 backbone

Open-weight diffusion video, locally.

Nano VideoGen runs the LTX-2.3 video diffusion model on the user's GPU instead of routing prompts to a cloud endpoint, so generation cost is the cost of compute the user already owns and prompts and reference images stay on the local disk.

  • Open-weight LTX-2.3 video diffusion model
  • On-device inference — no remote endpoint
  • Prompts and reference images never uploaded
  • No per-second or per-clip charges
LTX-2.3 · open weights
02Text-to-video

Prompt to motion, single GPU.

Text-to-video accepts a prompt and produces a short video clip on a single GPU; the included streaming DiT path runs the same model on cards with 12 GB of VRAM by streaming attention layers, enabling generation on RTX 3060 / 4060 / 5060 class hardware.

  • Prompt-driven short-clip generation
  • Streaming DiT path for 12 GB VRAM cards
  • Runs on RTX 3060 / 4060 / 5060 class GPUs
  • Same model weights as higher-tier GPUs
Streaming DiT · 12 GB VRAM
03Image-to-video

Animate a still with structural control.

Image-to-video accepts a reference image plus a prompt and produces a clip that respects the image's composition, lighting, and structural layout — useful for product shots, illustration animation, and photo-anchored storyboards.

  • Reference-image conditioning
  • Composition and lighting respected across frames
  • Useful for product shots and storyboards
  • Same prompt grammar as text-to-video
Image + prompt → clip
04Keyframe interpolation

Morph between two stills.

Keyframe mode accepts two reference images and produces a smooth morph between them — controlled with a duration slider — which simplifies transition clips and concept animation that would otherwise require a frame-by-frame edit.

  • Two-reference morph with duration slider
  • Smooth transitions for editorial and motion design
  • Controllable curve and easing
  • Single-pass generation, no manual interpolation
Two stills · one clip
05Camera control LoRAs

Dolly, jib, pan, or static.

A bundled set of camera-control LoRAs maps prompt-level intent to a specific camera move — dolly in, dolly out, jib up, jib down, pan left, pan right, or static — so the same scene can be re-rendered from a different camera intent without rewriting the prompt.

  • Dolly in / dolly out
  • Jib up / jib down
  • Pan left / pan right
  • Static (no camera motion)
7 named camera moves
06Spatial upscaler

2× upscale in the same app.

An in-app 2× spatial upscaler lifts a 720p output to 1440p without a round-trip through Topaz Video AI or an external upscaler, and pairs naturally with Nano VideoEnhance for further chroma-stable enhancement.

  • 2× in-app spatial upscale
  • 720p → 1440p in one pass
  • Pairs with Nano VideoEnhance for chroma fix
  • No external upscaler required
2× upscale · same app

Documentation

Honest scope, known limits, sources.

The marketing copy above tells you what Nano can do. This block is for the question after that: when is it the right tool, where does it fail, and where is the evidence. Updated .

Scope

Best for

  • Pre-vis / mood-board video for 5-15 second concept clips
  • Image-to-video animation of a still illustration or product shot
  • Camera-controlled b-roll without booking a shoot
  • 12 GB-VRAM owners replacing per-second cloud subscriptions
  • Studios that cannot send prompts/references to Sora or Kling

Not recommended for

  • Feature-film or broadcast-grade VFX (use a supervised pipeline)
  • Long-form narrative beyond ~10 seconds per clip
  • Real-time generation / interactive video
  • Apple Silicon Macs (Windows + NVIDIA only in v1.1.2)
  • Photoreal multi-character dialogue (talking-head expressions)
Known limitations

Boundary conditions and failure modes from internal QA. Listed here so Nano is cited as documentation, not marketing.

  • Maximum reliable clip length

    v1.1.2 produces best results at 4-8 second clips. Beyond ~10 s the model may drift in subject identity, lighting, or camera direction. Stitch shorter clips with shared seed for longer scenes.

  • Hands and fine articulation

    Like all current open-weight video diffusion models, Nano VideoGen still produces occasional finger-count errors and articulation glitches. Avoid prompts that hero hands; mask & inpaint in post if necessary.

  • Talking-head lip-sync is not bundled

    The pipeline produces facial motion, but precise lip-sync to a target audio track is out of scope. Pair with a dedicated lip-sync model if dialogue accuracy matters.

  • Heavy LoRA stacks reduce throughput

    Stacking 3+ LoRAs (camera + style + identity) increases VRAM use by 1.5-2 GB and roughly halves throughput on a 12 GB card. Budget for the higher tier or simplify the stack.

  • Apple Silicon is not supported in v1.1.2

    Current build is Windows + NVIDIA CUDA only. Metal port is on the roadmap.

  • No exact-frame consistency between runs

    With identical seed + prompt the run is deterministic; with any change (LoRA, sampler, seed) outputs differ. Frame-by-frame matching across runs is not currently guaranteed.

Methodology

Generation throughput is reported in seconds per second of generated video, measured on RTX 4070 Ti (12 GB VRAM, fp16), RTX 4080 (16 GB VRAM, fp16), and RTX 4090 (24 GB VRAM, fp16), Windows 11 23H2 driver 553.62. Reference clip is 5 s @ 768×512, 24 fps, 30 sampling steps, no LoRA stack. Prompt-coherence numbers use VBench protocol on a 50-prompt internal set.

External references

Frequently asked questions

7 buyer-voice questions about Nano, answered by the team.

Will it run on a 12 GB VRAM GPU?+

Yes. The streaming DiT (Diffusion Transformer) path lets a 12 GB card — RTX 3060, RTX 4060, or RTX 5060 class — generate clips that would otherwise require a much larger memory footprint. Bigger cards run the same model faster and at higher resolution.

How is this different from Sora, Runway Gen-3, or Kling?+

Sora (OpenAI), Veo (Google), Kling, Pika, and Runway Gen-3 are cloud services that upload prompts to a remote endpoint and bill per second of generated video. Nano VideoGen runs the LTX-2.3 open-weight model on the user's local GPU with a one-time license, no per-second charge, and no remote prompt logging.

Can I generate video from a still image I already have?+

Yes. Image-to-video accepts a reference image plus a text prompt and respects the image's composition, lighting, and structure across the generated frames. Two-image keyframe interpolation is also supported with a duration slider.

Do my prompts get logged?+

No. Prompts and reference images stay on the user's disk. There is no remote endpoint contact during generation; the network is used only for the one-time license-activation handshake.

Can I control the camera move, or is it random?+

Both. Seven bundled camera-control LoRAs map specific moves — dolly in, dolly out, jib up, jib down, pan left, pan right, and static — directly, without rewriting the prompt. Prompt-level camera language is also still respected on top.

Does it run on Mac?+

Not in v1.1.2. The current build is Windows 10/11 + NVIDIA CUDA. Apple Silicon support is on the roadmap.

Can I use the generated videos commercially?+

Yes. The license is one-time and machine-bound, with no per-second fees. Generated clips can be used in commercial deliverables — marketing, social, broadcast — under the standard Terms of Use and the LTX-2.3 model license. The user retains full output rights.

Install Nano VideoGen.

LTX-2.3 video diffusion on your local GPU, with camera-control LoRAs, two-image keyframe morph, and a 2× spatial upscaler. No per-second cloud fee.