315460 – SPHINX: Training data of words with special characters in word names cause error during model compilation

Bug 315460 - SPHINX: Training data of words with special characters in word names cause error during model compilation

Summary: SPHINX: Training data of words with special characters in word names cause er...

Status:	RESOLVED FIXED

Alias:	None

Product:	simon
Classification:	Applications
Component:	simon (show other bugs)
Version:	unspecified
Platform:	Other Microsoft Windows

Importance:	NOR normal
Target Milestone:	---
Assignee:	Peter Grasch

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-02-19 16:40 UTC by Peter Grasch
Modified:	2013-06-03 13:35 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:	http://commits.kde.org/simon/a247b7f4abdd8833a11c63341d02c7765e788774
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Grasch 2013-02-19 16:40:06 UTC

On Windows, the file names with special (non-ascii) characters in the prompts file do not match the actual file names due to encoding problems.

This causes the feature estimation phase of SPHINX to fail.

Comment 1 root 2013-03-28 14:52:42 UTC

As i understand problem is in ModelCompilationAdapterSPHINX::storeTranscriptionAndFields 
(we using utf8-specific functions). Are there any appropriate way to find out encoding of file (or string)?

Comment 2 Peter Grasch 2013-03-28 17:17:46 UTC

Sadly there is no way to reliably retrieve the encoding from a given text. The only way to know the encoding is to keep track of it.

Simon internally should (and does) use UTF-8. AFAIK, NTFS uses UTF-16.
It doesn't really matter that much if the file names themselves are broken as these files will never be seen by an average end user but the encoding needs to remain consistent between the fs and the prompts table.

Comment 3 Peter Grasch 2013-06-03 13:35:01 UTC

Git commit a247b7f4abdd8833a11c63341d02c7765e788774 by Peter Grasch.
Committed on 03/06/2013 at 15:33.
Pushed by grasch into branch 'master'.

Encoding sample names in local 8 bit charset for SPHINX on Windows

M  +9    -2    simonlib/speechmodelcompilation/modelcompilationadaptersphinx.cpp

http://commits.kde.org/simon/a247b7f4abdd8833a11c63341d02c7765e788774