Bug 315460 - SPHINX: Training data of words with special characters in word names cause error during model compilation
Summary: SPHINX: Training data of words with special characters in word names cause er...
Status: RESOLVED FIXED
Alias: None
Product: simon
Classification: Applications
Component: simon (show other bugs)
Version: unspecified
Platform: Other Microsoft Windows
: NOR normal
Target Milestone: ---
Assignee: Peter Grasch
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-02-19 16:40 UTC by Peter Grasch
Modified: 2013-06-03 13:35 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Grasch 2013-02-19 16:40:06 UTC
On Windows, the file names with special (non-ascii) characters in the prompts file do not match the actual file names due to encoding problems.

This causes the feature estimation phase of SPHINX to fail.
Comment 1 root 2013-03-28 14:52:42 UTC
As i understand problem is in ModelCompilationAdapterSPHINX::storeTranscriptionAndFields 
(we using utf8-specific functions). Are there any appropriate way to find out encoding of file (or string)?
Comment 2 Peter Grasch 2013-03-28 17:17:46 UTC
Sadly there is no way to reliably retrieve the encoding from a given text. The only way to know the encoding is to keep track of it.

Simon internally should (and does) use UTF-8. AFAIK, NTFS uses UTF-16.
It doesn't really matter that much if the file names themselves are broken as these files will never be seen by an average end user but the encoding needs to remain consistent between the fs and the prompts table.
Comment 3 Peter Grasch 2013-06-03 13:35:01 UTC
Git commit a247b7f4abdd8833a11c63341d02c7765e788774 by Peter Grasch.
Committed on 03/06/2013 at 15:33.
Pushed by grasch into branch 'master'.

Encoding sample names in local 8 bit charset for SPHINX on Windows

M  +9    -2    simonlib/speechmodelcompilation/modelcompilationadaptersphinx.cpp

http://commits.kde.org/simon/a247b7f4abdd8833a11c63341d02c7765e788774