AI Video Generationv1.1.2

Nano VideoGen

Nano VideoGen is a local AI video generator built on the LTX-2.3 model that produces text-to-video and image-to-video clips on a single NVIDIA GPU with as little as 12 GB VRAM, supports keyframe interpolation between two reference images, and exposes camera-control LoRAs (dolly, jib, pan, static).

LTX-2.3

Open-source video diffusion model

12 GB VRAM

Streaming DiT for limited-memory GPUs

Camera LoRAs

Dolly · jib · pan · static

Single-GPU local inference12 GB VRAM minimumLTX-2.3 model · open weightsNo cloud generation, no per-second fees

01LTX-2.3 backbone

Open-weight diffusion video, locally.

Nano VideoGen runs the LTX-2.3 video diffusion model on the user's GPU instead of routing prompts to a cloud endpoint, so generation cost is the cost of compute the user already owns and prompts and reference images stay on the local disk.

Open-weight LTX-2.3 video diffusion model
On-device inference — no remote endpoint
Prompts and reference images never uploaded
No per-second or per-clip charges

LTX-2.3 · open weights

02Text-to-video

Prompt to motion, single GPU.

Text-to-video accepts a prompt and produces a short video clip on a single GPU; the included streaming DiT path runs the same model on cards with 12 GB of VRAM by streaming attention layers, enabling generation on RTX 3060 / 4060 / 5060 class hardware.

Prompt-driven short-clip generation
Streaming DiT path for 12 GB VRAM cards
Runs on RTX 3060 / 4060 / 5060 class GPUs
Same model weights as higher-tier GPUs

Streaming DiT · 12 GB VRAM

03Image-to-video

Animate a still with structural control.

Image-to-video accepts a reference image plus a prompt and produces a clip that respects the image's composition, lighting, and structural layout — useful for product shots, illustration animation, and photo-anchored storyboards.

Reference-image conditioning
Composition and lighting respected across frames
Useful for product shots and storyboards
Same prompt grammar as text-to-video

Image + prompt → clip

04Keyframe interpolation

Morph between two stills.

Keyframe mode accepts two reference images and produces a smooth morph between them — controlled with a duration slider — which simplifies transition clips and concept animation that would otherwise require a frame-by-frame edit.

Two-reference morph with duration slider
Smooth transitions for editorial and motion design
Controllable curve and easing
Single-pass generation, no manual interpolation

Two stills · one clip

05Camera control LoRAs

Dolly, jib, pan, or static.

A bundled set of camera-control LoRAs maps prompt-level intent to a specific camera move — dolly in, dolly out, jib up, jib down, pan left, pan right, or static — so the same scene can be re-rendered from a different camera intent without rewriting the prompt.

Dolly in / dolly out
Jib up / jib down
Pan left / pan right
Static (no camera motion)

7 named camera moves

06Spatial upscaler

2× upscale in the same app.

An in-app 2× spatial upscaler lifts a 720p output to 1440p without a round-trip through Topaz Video AI or an external upscaler, and pairs naturally with Nano VideoEnhance for further chroma-stable enhancement.

2× in-app spatial upscale
720p → 1440p in one pass
Pairs with Nano VideoEnhance for chroma fix
No external upscaler required

2× upscale · same app

Documentation

Honest scope, known limits, sources.

The marketing copy above tells you what Nano can do. This block is for the question after that: when is it the right tool, where does it fail, and where is the evidence. Updated 2026-05-29.

Scope

Best for

Pre-vis / mood-board video for 5-15 second concept clips
Image-to-video animation of a still illustration or product shot
Camera-controlled b-roll without booking a shoot
12 GB-VRAM owners replacing per-second cloud subscriptions
Studios that cannot send prompts/references to Sora or Kling

Not recommended for

Feature-film or broadcast-grade VFX (use a supervised pipeline)
Long-form narrative beyond ~10 seconds per clip
Real-time generation / interactive video
Apple Silicon Macs (Windows + NVIDIA only in v1.1.2)
Photoreal multi-character dialogue (talking-head expressions)

Known limitations

Boundary conditions and failure modes from internal QA. Listed here so Nano is cited as documentation, not marketing.

Maximum reliable clip length
v1.1.2 produces best results at 4-8 second clips. Beyond ~10 s the model may drift in subject identity, lighting, or camera direction. Stitch shorter clips with shared seed for longer scenes.
Hands and fine articulation
Like all current open-weight video diffusion models, Nano VideoGen still produces occasional finger-count errors and articulation glitches. Avoid prompts that hero hands; mask & inpaint in post if necessary.
Talking-head lip-sync is not bundled
The pipeline produces facial motion, but precise lip-sync to a target audio track is out of scope. Pair with a dedicated lip-sync model if dialogue accuracy matters.
Heavy LoRA stacks reduce throughput
Stacking 3+ LoRAs (camera + style + identity) increases VRAM use by 1.5-2 GB and roughly halves throughput on a 12 GB card. Budget for the higher tier or simplify the stack.
Apple Silicon is not supported in v1.1.2
Current build is Windows + NVIDIA CUDA only. Metal port is on the roadmap.
No exact-frame consistency between runs
With identical seed + prompt the run is deterministic; with any change (LoRA, sampler, seed) outputs differ. Frame-by-frame matching across runs is not currently guaranteed.

Methodology

Generation throughput is reported in seconds per second of generated video, measured on RTX 4070 Ti (12 GB VRAM, fp16), RTX 4080 (16 GB VRAM, fp16), and RTX 4090 (24 GB VRAM, fp16), Windows 11 23H2 driver 553.62. Reference clip is 5 s @ 768×512, 24 fps, 30 sampling steps, no LoRA stack. Prompt-coherence numbers use VBench protocol on a 50-prompt internal set.

External references

Lightricks — LTX-Video model card (Hugging Face)
Open-weight backbone the v1.1.2 release ships against.
VBench: Comprehensive Benchmark Suite for Video Generation (arXiv:2311.17982, 2023)
Public protocol used for prompt-coherence numbers.
MovieGen / Lumiere / Sora research surveys
Reference for state-of-the-art video diffusion landscape.
Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685, 2021)
Foundation for the camera-control LoRA mechanism.

Frequently asked questions

7 buyer-voice questions about Nano, answered by the team.

Will it run on a 12 GB VRAM GPU?+

Yes. The streaming DiT (Diffusion Transformer) path lets a 12 GB card — RTX 3060, RTX 4060, or RTX 5060 class — generate clips that would otherwise require a much larger memory footprint. Bigger cards run the same model faster and at higher resolution.

How is this different from Sora, Runway Gen-3, or Kling?+

Sora (OpenAI), Veo (Google), Kling, Pika, and Runway Gen-3 are cloud services that upload prompts to a remote endpoint and bill per second of generated video. Nano VideoGen runs the LTX-2.3 open-weight model on the user's local GPU with a one-time license, no per-second charge, and no remote prompt logging.

Can I generate video from a still image I already have?+

Yes. Image-to-video accepts a reference image plus a text prompt and respects the image's composition, lighting, and structure across the generated frames. Two-image keyframe interpolation is also supported with a duration slider.

Do my prompts get logged?+

No. Prompts and reference images stay on the user's disk. There is no remote endpoint contact during generation; the network is used only for the one-time license-activation handshake.

Can I control the camera move, or is it random?+

Both. Seven bundled camera-control LoRAs map specific moves — dolly in, dolly out, jib up, jib down, pan left, pan right, and static — directly, without rewriting the prompt. Prompt-level camera language is also still respected on top.

Does it run on Mac?+

Not in v1.1.2. The current build is Windows 10/11 + NVIDIA CUDA. Apple Silicon support is on the roadmap.

Can I use the generated videos commercially?+

Yes. The license is one-time and machine-bound, with no per-second fees. Generated clips can be used in commercial deliverables — marketing, social, broadcast — under the standard Terms of Use and the LTX-2.3 model license. The user retains full output rights.

Install Nano VideoGen.

LTX-2.3 video diffusion on your local GPU, with camera-control LoRAs, two-image keyframe morph, and a 2× spatial upscaler. No per-second cloud fee.