464439 – Chinese characters are wrongly separated with spaces using speech to text

Bug 464439 - Chinese characters are wrongly separated with spaces using speech to text

Summary: Chinese characters are wrongly separated with spaces using speech to text

Status:	CONFIRMED

Alias:	None

Product:	kdenlive
Classification:	Applications
Component:	Video Effects & Transitions (show other bugs)
Version:	22.12.1
Platform:	Microsoft Windows Microsoft Windows

Importance:	NOR normal
Target Milestone:	---
Assignee:	Jean-Baptiste Mardelle

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-01-18 04:45 UTC by sunruikang2000
Modified:	2023-04-26 06:58 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description sunruikang2000 2023-01-18 04:45:55 UTC

When I use speech to text to recognize Chinese, Chinese characters are wrongly separated with spaces. I think it may be an issue of CJK words recognization engine. CJK characters should not be separated to read.

It looks like this:
希望 大家 明白
hope (wrong space) everyone (wrong space) understand

It should be like this:
希望大家明白
hope (no space) everyone (no space) understand

Comment 1 erjiang 2023-01-25 02:16:59 UTC

I think i’ve noticed this too in the past. My guess is that it’s just what vosk (the speech recognizer) outputs, but maybe we can just detect if the language is Chinese and remove the spaces in Kdenlive. A workaround is to edit the subtitle file in a text editor and remove the spaces.

Comment 2 sunruikang2000 2023-04-26 06:58:37 UTC

(In reply to erjiang from comment #1)
> I think i’ve noticed this too in the past. My guess is that it’s just what
> vosk (the speech recognizer) outputs, but maybe we can just detect if the
> language is Chinese and remove the spaces in Kdenlive. A workaround is to
> edit the subtitle file in a text editor and remove the spaces.

Yes, using text editor is a temporary method. But still a little bit difficult because ".srt" format maybe wrongly changed by users.