ComfyUI Infinite Talk: Create Realistic AI Talking Avatars from a Single Image

A step-by-step guide to transforming a static portrait into a realistic, talking video using ComfyUI.

Michael Meister

Nov 23, 2025 • 4 min read

Following up on previous explorations into consistent AI voice generation and controlling ComfyUI via its powerful API, the next logical step is to combine these concepts into a complete, automated pipeline.

Podcast ComfyUI Infinite Talk

0:00

/259.770958

The Goal: Animating the Inanimate

The core objective is to create a dynamic video from a single portrait image, making it appear as though the person in the photograph is speaking the words from a provided audio track.

Here’s the photo that will be animated.

I used NanoBanana to place me in a clean office setting.

Here’s a full breakdown of the ComfyUI workflow used to animate the image:

Prerequisites: Gathering the Tools and Models

Before the workflow can be executed, a specific set of custom nodes and AI models must be installed and correctly placed within the ComfyUI directory structure.

MM_Inifinte_Talk

The very workflow I used for this example

MM_Inifinte_Talk.json

34 KB

Required Custom Nodes

These nodes add essential functionality to ComfyUI and can be installed using the ComfyUI Manager:

ComfyUI-WanVideoWrapper: A comprehensive toolkit for advanced video models.
ComfyUI-VideoHelperSuite: Provides nodes for combining image frames and audio into a final video file.
rgthree-comfy: A set of helpful utility nodes, including custom labels and workflow organization tools.

They are easily obtainable through the ComfyUI-Manager:

Required Models

Each model plays a specific role in the generation process. ComfyUI/models/.

Model Filename	Download Link	Placement Directory	Purpose
`wan2.1-i2v-14b-480p-Q4_0.gguf`	Hugging Face	`diffusion_models`	The primary Image-to-Video (I2V) model.
`lightx2v_I2V_14B_480p...safetensors`	Hugging Face	`loras`	A LoRA that improves motion stability.
`Wan2_1-InfiniteTalk_Single_Q8.gguf`	Hugging Face	`diffusion_models`	The core model for lip-syncing and talking motion.
`wan_2.1_vae.safetensors`	Hugging Face	`vae`	Variational Autoencoder for image processing.
`clip_vision_h.safetensors`	Hugging Face	`clip_vision`	Encodes the input image for the model to understand.
`wav2vec2-chinese-base_fp16.safetensors`	Hugging Face	`wav2vec2`	Processes the audio file to analyze speech patterns.
`umt5-xxl-enc-bf16.safetensors`	Hugging Face	`clip`	Text encoder used for interpreting the prompt.

The "Infinite Talk" Workflow: A Step-by-Step Breakdown

The workflow operates by passing data through a series of connected nodes, each performing a specific transformation.

Step 1: Loading the AI Brains (Models & Loaders)

The process begins on the left side of the workflow, where all the necessary models are loaded into memory. WanVideoModelLoader, which is responsible for the image-to-video generation, the MultiTalkModelLoader for the speech animation, and the supporting VAE, CLIP, and Wav2Vec models. WanVideoLoraSelect node is also used to load a LoRA model that enhances the stability and quality of the final video.

Step 2: Providing the Source Material (Image, Audio & Prompt)

The central inputs to the workflow are the source image and the audio file.

The LoadImage node is used to select the portrait to be animated (michael_bureau.png). ImageScale node to ensure its dimensions (832x480) are compatible with the model.
The LoadAudio node loads the speech (explain.mp3) that will drive the animation.
Finally, the WanVideoTextEncodeCached node takes a simple positive prompt (e.g., a man is talking) and a detailed negative prompt to guide the generation process and prevent unwanted artifacts like distorted features or a static background.

Step 3: The Core of Generation (Processing & Sampling)

This is where the core synthesis occurs. MultiTalkWav2VecEmbeds to create audio embeddings. WanVideoSampler.

Step 4: Final Assembly (Decoding & Video Combination)

Once the sampler has generated all the frames in latent space, the WanVideoDecode node translates them back into a sequence of viewable images. VHS_VideoCombine node. LoadAudio node.

Practical Demonstration: The Results

To showcase the workflow's capability, two videos were generated.

Video 1: Generated with VibeVoice using my original voice as reference

Since I’m using an English model for lip-sync video creation, the results look a bit off when applied to German.

0:00

/0:00

Video 2: Generated with a synthetic voice

0:00

/0:38

The results demonstrate the workflow's flexibility in handling different audio sources while maintaining consistent character identity and accurate lip-syncing.

A Note on Performance: Managing VRAM

High-resolution video generation is a resource-intensive task, particularly on consumer-grade GPUs with limited VRAM. I implemented the WanVideoBlockSwap node, which is designed to mitigate this issue. So if you run out of memory, you can just enable this node to offload VRAM into RAM.

Conclusion

By combining ComfyUI’s modular workflow system with the WanVideo and InfiniteTalk models, the process becomes predictable, scalable, and adaptable to different voices, prompts, and motion styles. With the right node configuration and model selection, the system delivers stable motion, clean lip-sync, and consistent character identity, even on limited hardware. This marks a clear step forward in my broader effort to build a fully automated pipeline for AI-generated video, setting the stage for more advanced experiments and future refinements.