I am trying to convert the human image generated by StableDiffusion into a video, and furthermore, to lip-sync the face of the video to the audio WAV.
Step for Lipsync video making
A single still image, a single audio file, and a lip-sync video could easily be created.
- Generate images of a person with StablDiffusion
- Created “Konnichiwa” Japanese audio WAV file with VOICEBOX
- Generate lip-sync videos with StablDiffusion extension SadTalker.
I was surprised at how easy it was.
SadTalker made it very easy to create lip-sync videos with Stable Diffusion webui
Error
In some cases, errors occur depending on how the WAV file is exported.
Depending on the WAV file, the following errors may occur ,The error does not occur when exporting with DAW Studio One.
  File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\gradio\routes.py", line 488, in run_predict
    output = await app.get_blocks().process_api(
  File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\gradio\blocks.py", line 1431, in process_api
    result = await self.call_function(
  File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\gradio\blocks.py", line 1103, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\gradio\utils.py", line 707, in wrapper
    response = f(*args, **kwargs)
  File "F:\stable-diffusion\sd.webui3\webui\modules\call_queue.py", line 13, in f
    res = func(*args, **kwargs)
  File "F:\stable-diffusion\sd.webui3\webui/extensions/SadTalker\src\gradio_demo.py", line 134, in test
    batch = get_data(first_coeff_path, audio_path, self.device, ref_eyeblink_coeff_path=ref_eyeblink_coeff_path, still=still_mode, idlemode=use_idle_mode, length_of_audio=length_of_audio, use_blink=use_blink) # longer audio?
  File "F:\stable-diffusion\sd.webui3\webui/extensions/SadTalker\src\generate_batch.py", line 81, in get_data
    ratio = generate_blink_seq_randomly(num_frames)      # T
  File "F:\stable-diffusion\sd.webui3\webui/extensions/SadTalker\src\generate_batch.py", line 43, in generate_blink_seq_randomly
    start = random.choice(range(min(10,num_frames), min(int(num_frames/2), 70)))
  File "random.py", line 378, in choice
IndexError: range object index out of rangeProblem for Japanese Lipsync
There are some problems.
Many of the lip-sync functions currently implemented in Python and other languages are based on English-speaking pronunciation.
From our Japanese point of view, Japanese lip-sync videos made with these libraries look subtle and unnatural.