Summary: | CSV plugin mishandles UTF-16 files | ||
---|---|---|---|
Product: | [Applications] kmymoney | Reporter: | allan <agander93> |
Component: | importer | Assignee: | KMyMoney Devel Mailing List <kmymoney-devel> |
Status: | RESOLVED NOT A BUG | ||
Severity: | normal | CC: | lukasz.wojnilowicz, ostroffjh |
Priority: | NOR | ||
Version: | 4.8.0 | ||
Target Milestone: | --- | ||
Platform: | Mint (Ubuntu based) | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
UTF-16 file
UTF-16 file |
Description
allan
2016-10-18 10:25:51 UTC
Created attachment 101617 [details]
UTF-16 file
Created attachment 101618 [details]
UTF-16 file
It looks like my attempts to provide an edited sample file either produce rubbish, or remove whatever causes the problem. So, I may have to provide a complete file, but I don't wish to broadcast it, so would like to send it off-line. To whom? Allan (In reply to allan from comment #3) > It looks like my attempts to provide an edited sample file either produce > rubbish, or remove whatever causes the problem. > So, I may have to provide a complete file, but I don't wish to broadcast it, > so would like to send it off-line. To whom? > > Allan I've sent a gpg'ed copy to Thomas. Allan Thomas suggested using Okteta to look at the data to see if the BOM was correct. Here are the first few lines :- " 00000000 22 00 4D 00 52 00 20 00 41 00 4C 00 ".M.R. .A.L. 0000000C 4C 00 41 00 4E 00 20 00 41 00 4E 00 L.A.N. .A.N. 00000018 44 00 45 00 52 00 53 00 4F 00 4E 00 D.E.R.S.O.N. " So, no BOM, just the data, and still mis-formatted by the plugin. Ah, I've just checked against the Libre Office Calc file and, just looking at the beginning, the bad one has "22 00", and the good one has "FF FE". So, the BOM is wrong on the bank version. Allan I think you might be able to use dos2unix to modify the file into a usable format, or at least confirm info about the encoding and BOM. It may take a while to wade through all them options and parameters. [from Thomas] " Hi Allan, you found out yourself: the BOM is not wrong, it's missing. I am sure, you stumbled over https://en.wikipedia.org/wiki/UTF-16. I have not looked at the code of the CSV importer at that point, but you could check the beginning of the file (4 bytes) and see if the first two match a BOM or you find two 0x00 in those four bytes (where that would probably not work for Asian countries as they fill the upper byte with their characters). Reading the data through a QTextStream allows to setup the encoding/decoding. Please take a look at QTextStream::setCodec() and setAutoDetectUnicode(), though according to the docs I have, the automatic detection should be the default. Maybe, you add another UI selector for the Encoding, in case you don't have it. See QTextCodec::availableCodecs() for a list of them. Check Kate/Kwrite in the Tools/Encding menu how this may look like. So much for now. If you don't get the codec stuff going, please tell me where to find the relevant source code and I take a look at it. Thomas" [My reply] " I have /had - QTextStream inStream(&inFile); QTextCodec* codec = QTextCodec::codecForMib(m_codecs.value(m_encodeIndex)->mibEnum()); inStream.setCodec(codec); QString buf = inStream.readAll(); ... (void CSVWizard::readFile(const QString& fname) line c843) which I nicked from Qt, I think. I have encoding selection in the file selector. ... QPointer<QLabel> label = new QLabel(i18n("Encoding")); dialog->layout()->addWidget(label); // Add encoding selection to FileDialog QPointer<QComboBox> comboBoxEncode = new QComboBox(); setCodecList(m_codecs, comboBoxEncode); comboBoxEncode->setCurrentIndex(m_encodeIndex); connect(comboBoxEncode, SIGNAL(activated(int)), this, SLOT(encodingChanged(int))); dialog->layout()->addWidget(comboBoxEncode); (bool CSVWizard::getInFileName(QString& inFileName) line c798) I don't see a setAutoDetectUnicode(). I don't think I had auto-selection, but encoding was by manual selection from the list of codecs, but UTF-16 seems not to work. (in my code). Allan" I'm afraid I'm not able to commit to coding, still. I tried this on my KDE4, KMyMoney 4.8 production system (this is generated of HEAD on the 4.8 branch). What is annoying, that once I select a file it automatically goes off. No way to change parameters. One should be able to start the process the pressing the OK button. This causes the UTF-16 data to display weird data due to the 0's contained. When I change the encoding in the dialog to UTF-16 before I select the file, then things seem to work properly. I am looking at the following snippet in CSVDialog::readFile(const QString& fname): QFile m_inFile(m_inFileName); m_inFile.open(QIODevice::ReadOnly); // allow a Carriage return - // QIODevice::Text QTextStream inStream(&m_inFile); QTextCodec* codec = QTextCodec::codecForMib(m_codecs.value(m_encodeIndex)->mibEnum()); inStream.setCodec(codec); QString buf = inStream.readAll(); When selecting UTF-16 before selecting your file, QString buf contained the correct data. I verified this in the debugger and also data displayed in spread sheet form seemd to be correct. Hope that helps for further investigation. (In reply to Thomas Baumgart from comment #8) > I tried this on my KDE4, KMyMoney 4.8 production system (this is generated > of HEAD on the 4.8 branch). <snip> > When I change the encoding in the dialog to UTF-16 before I select the file, > then things seem to work properly. Just to be clear, are you saying that you do not then see the problem I reported, of the data being garbled? If I ensure that UTF-16 is already selected - displayed in the file selector - then I definitely do see the corruption. > I am looking at the following snippet in > CSVDialog::readFile(const QString& fname): Might this be a non-current git head version? There has recently been some reformatting of the source and I have void CSVWizard::readFile(const QString& fname), as does the current git head. In terms of the actual code, they are identical in this particular area. Just to be clear, again. So far as the selector is concerned, then, yes, there are a couple of problems, although if the decoding is for UTF-16, from previous activity or from setting it prior to selecting the file, then the file should have the required encoding. Some tweaking looks to be necessary though. Allan > > QFile m_inFile(m_inFileName); > m_inFile.open(QIODevice::ReadOnly); // allow a Carriage return - // > QIODevice::Text > QTextStream inStream(&m_inFile); > QTextCodec* codec = > QTextCodec::codecForMib(m_codecs.value(m_encodeIndex)->mibEnum()); > inStream.setCodec(codec); > > QString buf = inStream.readAll(); > > When selecting UTF-16 before selecting your file, QString buf contained the > correct data. I verified this in the debugger and also data displayed in > spread sheet form seemd to be correct. > > Hope that helps for further investigation. From what I can tell, the data looks good if I read it using the UTF-16 decoder on the next page in the wizard. Nothing fancy or garbled. And I just omitted the return type void when I referenced the method. It is a current git head of the 4.8 branch. Closing based on comment #8 and me successfully opening provided file in master version. Reopen if necessary. |