I am trying to convert the human image generated by StableDiffusion into a video, and furthermore, to lip-sync the face of the video to the audio WAV.
Step for Lipsync video making
A single still image, a single audio file, and a lip-sync video could easily be created.
- Generate images of a person with StablDiffusion
- Created “Konnichiwa” Japanese audio WAV file with VOICEBOX
- Generate lip-sync videos with StablDiffusion extension SadTalker.
I was surprised at how easy it was.
SadTalker made it very easy to create lip-sync videos with Stable Diffusion webui
Error
In some cases, errors occur depending on how the WAV file is exported.
Depending on the WAV file, the following errors may occur ,The error does not occur when exporting with DAW Studio One.
File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\gradio\routes.py", line 488, in run_predict
output = await app.get_blocks().process_api(
File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\gradio\blocks.py", line 1431, in process_api
result = await self.call_function(
File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\gradio\blocks.py", line 1103, in call_function
prediction = await anyio.to_thread.run_sync(
File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
result = context.run(func, *args)
File "F:\stable-diffusion\sd.webui3\system\python\lib\site-packages\gradio\utils.py", line 707, in wrapper
response = f(*args, **kwargs)
File "F:\stable-diffusion\sd.webui3\webui\modules\call_queue.py", line 13, in f
res = func(*args, **kwargs)
File "F:\stable-diffusion\sd.webui3\webui/extensions/SadTalker\src\gradio_demo.py", line 134, in test
batch = get_data(first_coeff_path, audio_path, self.device, ref_eyeblink_coeff_path=ref_eyeblink_coeff_path, still=still_mode, idlemode=use_idle_mode, length_of_audio=length_of_audio, use_blink=use_blink) # longer audio?
File "F:\stable-diffusion\sd.webui3\webui/extensions/SadTalker\src\generate_batch.py", line 81, in get_data
ratio = generate_blink_seq_randomly(num_frames) # T
File "F:\stable-diffusion\sd.webui3\webui/extensions/SadTalker\src\generate_batch.py", line 43, in generate_blink_seq_randomly
start = random.choice(range(min(10,num_frames), min(int(num_frames/2), 70)))
File "random.py", line 378, in choice
IndexError: range object index out of range
Problem for Japanese Lipsync
There are some problems.
Many of the lip-sync functions currently implemented in Python and other languages are based on English-speaking pronunciation.
From our Japanese point of view, Japanese lip-sync videos made with these libraries look subtle and unnatural.