Bug 315460

Summary:	SPHINX: Training data of words with special characters in word names cause error during model compilation
Product:	[Unmaintained] simon	Reporter:	Peter Grasch <me>
Component:	simon	Assignee:	Peter Grasch <me>
Status:	RESOLVED FIXED
Severity:	normal	CC:	root
Priority:	NOR
Version First Reported In:	unspecified
Target Milestone:	---
Platform:	Other
OS:	Microsoft Windows
Latest Commit:	http://commits.kde.org/simon/a247b7f4abdd8833a11c63341d02c7765e788774	Version Fixed/Implemented In:
Sentry Crash Report:

Description Peter Grasch 2013-02-19 16:40:06 UTC

On Windows, the file names with special (non-ascii) characters in the prompts file do not match the actual file names due to encoding problems.

This causes the feature estimation phase of SPHINX to fail.

Comment 1 root 2013-03-28 14:52:42 UTC

As i understand problem is in ModelCompilationAdapterSPHINX::storeTranscriptionAndFields 
(we using utf8-specific functions). Are there any appropriate way to find out encoding of file (or string)?

Comment 2 Peter Grasch 2013-03-28 17:17:46 UTC

Sadly there is no way to reliably retrieve the encoding from a given text. The only way to know the encoding is to keep track of it.

Simon internally should (and does) use UTF-8. AFAIK, NTFS uses UTF-16.
It doesn't really matter that much if the file names themselves are broken as these files will never be seen by an average end user but the encoding needs to remain consistent between the fs and the prompts table.

Comment 3 Peter Grasch 2013-06-03 13:35:01 UTC

Git commit a247b7f4abdd8833a11c63341d02c7765e788774 by Peter Grasch.
Committed on 03/06/2013 at 15:33.
Pushed by grasch into branch 'master'.

Encoding sample names in local 8 bit charset for SPHINX on Windows

M  +9    -2    simonlib/speechmodelcompilation/modelcompilationadaptersphinx.cpp

http://commits.kde.org/simon/a247b7f4abdd8833a11c63341d02c7765e788774