On Windows, the file names with special (non-ascii) characters in the prompts file do not match the actual file names due to encoding problems. This causes the feature estimation phase of SPHINX to fail.
As i understand problem is in ModelCompilationAdapterSPHINX::storeTranscriptionAndFields (we using utf8-specific functions). Are there any appropriate way to find out encoding of file (or string)?
Sadly there is no way to reliably retrieve the encoding from a given text. The only way to know the encoding is to keep track of it. Simon internally should (and does) use UTF-8. AFAIK, NTFS uses UTF-16. It doesn't really matter that much if the file names themselves are broken as these files will never be seen by an average end user but the encoding needs to remain consistent between the fs and the prompts table.
Git commit a247b7f4abdd8833a11c63341d02c7765e788774 by Peter Grasch. Committed on 03/06/2013 at 15:33. Pushed by grasch into branch 'master'. Encoding sample names in local 8 bit charset for SPHINX on Windows M +9 -2 simonlib/speechmodelcompilation/modelcompilationadaptersphinx.cpp http://commits.kde.org/simon/a247b7f4abdd8833a11c63341d02c7765e788774