r""" Generates video frames from input image and text prompt using diffusion process. Args: input_prompt (`str`): Text prompt for content generation. ref_image_path ('str'): Input image path audio_path ('str'):
(
self,
input_prompt,
ref_image_path,
audio_path,
enable_tts,
tts_prompt_audio,
tts_prompt_text,
tts_text,
num_repeat=1,
pose_video=None,
max_area=720 * 1280,
infer_frames=80,
shift=5.0,
sample_solver='unipc',
sampling_steps=40,
guide_scale=5.0,
n_prompt="",
seed=-1,
offload_model=True,
init_first_frame=False,
)
| 390 | return (HEIGHT, WIDTH) |
| 391 | |
| 392 | def generate( |
| 393 | self, |
| 394 | input_prompt, |
| 395 | ref_image_path, |
| 396 | audio_path, |
| 397 | enable_tts, |
| 398 | tts_prompt_audio, |
| 399 | tts_prompt_text, |
| 400 | tts_text, |
| 401 | num_repeat=1, |
| 402 | pose_video=None, |
| 403 | max_area=720 * 1280, |
| 404 | infer_frames=80, |
| 405 | shift=5.0, |
| 406 | sample_solver='unipc', |
| 407 | sampling_steps=40, |
| 408 | guide_scale=5.0, |
| 409 | n_prompt="", |
| 410 | seed=-1, |
| 411 | offload_model=True, |
| 412 | init_first_frame=False, |
| 413 | ): |
| 414 | r""" |
| 415 | Generates video frames from input image and text prompt using diffusion process. |
| 416 | |
| 417 | Args: |
| 418 | input_prompt (`str`): |
| 419 | Text prompt for content generation. |
| 420 | ref_image_path ('str'): |
| 421 | Input image path |
| 422 | audio_path ('str'): |
| 423 | Audio for video driven |
| 424 | num_repeat ('int'): |
| 425 | Number of clips to generate; will be automatically adjusted based on the audio length |
| 426 | pose_video ('str'): |
| 427 | If provided, uses a sequence of poses to drive the generated video |
| 428 | max_area (`int`, *optional*, defaults to 720*1280): |
| 429 | Maximum pixel area for latent space calculation. Controls video resolution scaling |
| 430 | infer_frames (`int`, *optional*, defaults to 80): |
| 431 | How many frames to generate per clips. The number should be 4n |
| 432 | shift (`float`, *optional*, defaults to 5.0): |
| 433 | Noise schedule shift parameter. Affects temporal dynamics |
| 434 | [NOTE]: If you want to generate a 480p video, it is recommended to set the shift value to 3.0. |
| 435 | sample_solver (`str`, *optional*, defaults to 'unipc'): |
| 436 | Solver used to sample the video. |
| 437 | sampling_steps (`int`, *optional*, defaults to 40): |
| 438 | Number of diffusion sampling steps. Higher values improve quality but slow generation |
| 439 | guide_scale (`float` or tuple[`float`], *optional*, defaults 5.0): |
| 440 | Classifier-free guidance scale. Controls prompt adherence vs. creativity. |
| 441 | If tuple, the first guide_scale will be used for low noise model and |
| 442 | the second guide_scale will be used for high noise model. |
| 443 | n_prompt (`str`, *optional*, defaults to ""): |
| 444 | Negative prompt for content exclusion. If not given, use `config.sample_neg_prompt` |
| 445 | seed (`int`, *optional*, defaults to -1): |
| 446 | Random seed for noise generation. If -1, use random seed |
| 447 | offload_model (`bool`, *optional*, defaults to True): |
| 448 | If True, offloads models to CPU during generation to save VRAM |
| 449 | init_first_frame (`bool`, *optional*, defaults to False): |
no test coverage detected