Bug 315460

Summary: SPHINX: Training data of words with special characters in word names cause error during model compilation
Product: [Applications] simon Reporter: Peter Grasch <me>
Component: simonAssignee: Peter Grasch <me>
Status: RESOLVED FIXED    
Severity: normal CC: root
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: Other   
OS: Microsoft Windows   
Latest Commit: Version Fixed In:

Description Peter Grasch 2013-02-19 16:40:06 UTC
On Windows, the file names with special (non-ascii) characters in the prompts file do not match the actual file names due to encoding problems.

This causes the feature estimation phase of SPHINX to fail.
Comment 1 root 2013-03-28 14:52:42 UTC
As i understand problem is in ModelCompilationAdapterSPHINX::storeTranscriptionAndFields 
(we using utf8-specific functions). Are there any appropriate way to find out encoding of file (or string)?
Comment 2 Peter Grasch 2013-03-28 17:17:46 UTC
Sadly there is no way to reliably retrieve the encoding from a given text. The only way to know the encoding is to keep track of it.

Simon internally should (and does) use UTF-8. AFAIK, NTFS uses UTF-16.
It doesn't really matter that much if the file names themselves are broken as these files will never be seen by an average end user but the encoding needs to remain consistent between the fs and the prompts table.
Comment 3 Peter Grasch 2013-06-03 13:35:01 UTC
Git commit a247b7f4abdd8833a11c63341d02c7765e788774 by Peter Grasch.
Committed on 03/06/2013 at 15:33.
Pushed by grasch into branch 'master'.

Encoding sample names in local 8 bit charset for SPHINX on Windows

M  +9    -2    simonlib/speechmodelcompilation/modelcompilationadaptersphinx.cpp

http://commits.kde.org/simon/a247b7f4abdd8833a11c63341d02c7765e788774