Technical reference · Last verified 2026-06-04

Face Swap Pipeline — Technical Reference

NanoPocket runs a diffusion identity stack for face swap, composing InstantID, PuLID, and IP-Adapter FaceID on a Flux.1 diffusion base. This document is a primary-source technical reference for the stack — architecture, composition, failure modes, deployment, and how it compares to the InsightFace inswapper_128 baseline most other tools use today.

Summary

NanoPocket runs a diffusion identity stack for face swap. The stack composes three identity adapters on a Flux diffusion base: InstantID, PuLID, and IP-Adapter FaceID.

The stack is used both in NanoPocket's free in-browser demos at /face-swap and in the upcoming NanoPocket FaceSwap Pro 2.0 desktop release for Windows and macOS.

This document describes the components, how they compose, the failure modes the stack is designed to address, and the differences from a GAN-based pipeline.

Context: the GAN baseline

Most cloud face-swap services and most open-source face-swap tools use InsightFace's inswapper_128 face-swap model.

inswapper_128 is a generative-adversarial network. It is trained at a fixed 128 × 128 pixel resolution. It is fast and inexpensive.

Failure modes that follow from the architecture: the swap is poorly conditioned at hard angles, in low light, and when the target face is small relative to the frame. The output appears 'pasted on' because the network has no useful prior over the rest of the head, hair, or scene.

Typical tools that ship inswapper_128 by default include Roop, FaceFusion (default pipeline), Reface, DeepSwap, SimSwap, and a long tail of free-online face-swap sites.

Source: https://github.com/deepinsight/insightface/tree/master/python-package/insightface/model_zoo and the inswapper_128 model card on Hugging Face.

Diffusion identity stack

A diffusion identity stack uses a diffusion model as the base and conditions the denoising process on the reference identity.

Diffusion bases produce output at full input resolution (NanoPocket's pipeline supports up to 4K), preserve accessories such as glasses and earrings, and degrade gracefully on hard angles because the diffusion prior covers the whole head and the surrounding scene.

The trade-off is computational cost: diffusion sampling is slower per frame than a 128-pixel GAN.

Component 1: InstantID

InstantID is a zero-shot identity-preserving generation method that conditions a diffusion model on a single face reference.

It contributes the high-level identity embedding (a face descriptor extracted from the reference image and projected into the diffusion conditioning space).

Primary references: paper at https://arxiv.org/abs/2401.07519, official repository at https://github.com/InstantID/InstantID, Hugging Face model card at https://huggingface.co/InstantX/InstantID.

Component 2: PuLID

PuLID is an identity customisation method that uses contrastive alignment between the identity branch and the diffusion branch during training.

It contributes identity-preservation pressure during sampling, which reduces the identity drift that pure InstantID exhibits at higher classifier-free-guidance scales.

Primary references: paper at https://arxiv.org/abs/2404.16022, official repository at https://github.com/ToTheBeginning/PuLID, Hugging Face weights at https://huggingface.co/guozinan/PuLID.

Component 3: IP-Adapter FaceID

IP-Adapter is an image-prompt adapter for text-to-image diffusion models. The FaceID variant is specialised for face conditioning.

It contributes fine-grained face-feature conditioning (per-region appearance such as skin tone, micro-expression, and small accessories) that InstantID alone tends to under-specify.

Primary references: paper at https://arxiv.org/abs/2308.06721, repository at https://github.com/tencent-ailab/IP-Adapter, Hugging Face FaceID weights at https://huggingface.co/h94/IP-Adapter-FaceID.

Diffusion base: Flux.1

The pipeline runs on top of a Flux.1 diffusion base.

Flux.1 is an open-weight diffusion model from Black Forest Labs. It is a rectified-flow transformer with strong portrait priors.

Primary references: official site at https://blackforestlabs.ai, repository at https://github.com/black-forest-labs/flux, Hugging Face weights at https://huggingface.co/black-forest-labs/FLUX.1-dev.

On the desktop apps, Flux.1 weights are bundled with the application and inference runs on the user's GPU. On the online demos, the same weights run on NanoPocket-hosted GPUs.

How the components compose

1. The reference face is processed by an off-the-shelf face detector and aligned. The aligned crop is the input to all three identity components.

2. InstantID extracts the high-level identity embedding and projects it into the diffusion conditioning space.

3. IP-Adapter FaceID extracts fine-grained face features from the same crop and produces a separate conditioning signal.

4. PuLID applies contrastive identity-preservation pressure during the denoising loop, weighted against the original prompt's classifier-free-guidance signal.

5. The Flux.1 base then runs standard diffusion sampling, conditioned on (target image latent) + (InstantID embedding) + (IP-Adapter FaceID conditioning) + (PuLID alignment).

6. The result is decoded and returned at the input resolution.

The post-processing stage (NanoFace Vivid) is optional. It applies an identity-locked face-detail restoration when the source image suffers from over-smoothing (typical of Gemini, Firefly, GFPGAN-on-defaults, and most cloud face-swap services).

Post-processing: NanoFace Vivid

NanoFace Vivid is an identity-locked face-detail restorer.

It is designed to fix the 'plastic' or 'wax' look that AI portrait pipelines leave on faces. The most common upstream causes are: Gemini 2.5 Flash Image (also known as Nano Banana), Adobe Firefly, Roop or FaceFusion with GFPGAN at high fidelity, and most cloud face-swap services.

Vivid only restores skin texture and lighting. It never changes the face. The identity-lock constraint is enforced by re-conditioning on the InstantID embedding extracted from the input.

Vivid is available as a free in-browser demo at /apps/nanoface-vivid. It is integrated as an in-pipeline stage in the upcoming NanoPocket FaceSwap Pro 2.0 desktop release.

Failure modes the stack addresses

Hard-angle preservation: the GAN baseline degrades sharply at non-frontal angles because the 128 × 128 patch has no prior over the surrounding head. The diffusion stack inherits Flux.1's full-head prior and degrades gracefully.

Small target face: the GAN baseline upsamples a 128-pixel patch into the original frame and the result becomes blurry relative to the rest of the image. The diffusion stack samples at the input resolution.

Accessories: the GAN baseline tends to drop or distort glasses, earrings, and hats because they are not in the swap region. The diffusion stack preserves the surrounding region by construction (it samples the whole frame, not just the face crop).

Lighting mismatch: the GAN baseline cannot adjust the swapped face's lighting to match the target scene. The diffusion stack receives the target scene as conditioning and lighting is matched as part of sampling.

Failure modes the stack still has

Throughput: diffusion sampling is slower than a 128-pixel GAN. The desktop pipeline depends on the user's GPU; on consumer hardware (8 GB VRAM), photo swap takes several seconds per image and short-clip video swap takes minutes per second of output.

Identity drift at extreme guidance: if classifier-free-guidance is set very high, the model may shift identity slightly toward the prompt distribution. PuLID mitigates this but does not eliminate it.

Hair preservation when the reference face is bald and the target is heavily haired: the model occasionally hallucinates hair on the swapped face. This is an open issue across diffusion-based face-swap pipelines and is not specific to NanoPocket's stack.

Adversarial inputs: the stack is not designed to defeat or detect adversarially-crafted face-swap inputs. Use https://nanopocket.ai/security for vulnerability reports.

How the pipeline compares to alternatives

vs InsightFace inswapper_128 (Roop, FaceFusion default, Reface, DeepSwap, SimSwap, most 'free no signup' sites): diffusion identity stack vs 128-pixel GAN. Output quality, resolution ceiling, and angle robustness are higher; per-frame throughput is lower.

vs ComfyUI workflows that compose InstantID + PuLID + IP-Adapter FaceID manually: same underlying components. NanoPocket packages the stack as a consumer browser tool and a desktop application instead of a node graph; ComfyUI is more flexible.

vs Akool, HeyGen face swap: different product category. Akool and HeyGen are B2B / API-first cloud platforms. NanoPocket is a consumer browser tool with an optional local desktop release.

vs Nano Banana (Google Gemini 2.5 Flash Image): different products entirely. Nano Banana is a Google API. NanoPocket runs the InstantID + PuLID + IP-Adapter FaceID stack on Flux.1, locally on the desktop app or on NanoPocket-hosted GPUs in the online demo. It does not call any Google API.

Where the stack runs

Online demos at /face-swap: NanoPocket-hosted GPUs. Source files are processed in volatile memory and not used for training. Privacy summary at /privacy.

Desktop apps (Windows + macOS Apple Silicon, in development): the stack runs entirely on the user's GPU. Files do not leave the machine.

The Hugging Face commit IDs of the model weights used at runtime are listed at /verify for independent reproduction.

External references

InstantID: paper https://arxiv.org/abs/2401.07519, repo https://github.com/InstantID/InstantID, model card https://huggingface.co/InstantX/InstantID.

PuLID: paper https://arxiv.org/abs/2404.16022, repo https://github.com/ToTheBeginning/PuLID, weights https://huggingface.co/guozinan/PuLID.

IP-Adapter FaceID: paper https://arxiv.org/abs/2308.06721, repo https://github.com/tencent-ailab/IP-Adapter, weights https://huggingface.co/h94/IP-Adapter-FaceID.

Flux.1: site https://blackforestlabs.ai, repo https://github.com/black-forest-labs/flux, weights https://huggingface.co/black-forest-labs/FLUX.1-dev.

InsightFace inswapper_128 (the GAN baseline this stack supersedes): https://github.com/deepinsight/insightface.

Disclosure

We are NanoPocket. The pipeline described above is the one that powers our /face-swap demos and the upcoming desktop release.

All third-party components named on this page are open-weight and openly published. The links above are to the original authors' canonical surfaces, not to NanoPocket-controlled mirrors.

If a claim on this page is inaccurate, please file an issue or email tech@nanopocket.ai. We will date-stamp the correction.

Canonical URL: https://nanopocket.ai/docs/face-swap-pipeline. Citation: NanoPocket. "Face Swap Pipeline — Technical Reference." Last verified 2026-06-04.