Bug 371069 - CSV plugin mishandles UTF-16 files
Summary: CSV plugin mishandles UTF-16 files
Status: RESOLVED NOT A BUG
Alias: None
Product: kmymoney
Classification: Applications
Component: importer (show other bugs)
Version: 4.8.0
Platform: Mint (Ubuntu based) Linux
: NOR normal
Target Milestone: ---
Assignee: KMyMoney Devel Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-10-18 10:25 UTC by allan
Modified: 2016-12-31 12:02 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
UTF-16 file (460 bytes, text/csv)
2016-10-18 10:30 UTC, allan
Details
UTF-16 file (456 bytes, text/plain)
2016-10-18 10:55 UTC, allan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description allan 2016-10-18 10:25:51 UTC
My credit card company has revised their web site, and the statement importing procedure and format has changed.

The KMM CSV importer is not happy with this new format.  All characters have an interspersed null character (I think, probably 00). Each line is followed by an additional line containing the same odd character.  So, the import is incorrect.

If I import via Libre Office Calc, all is correct.  It is shown as UTF-16.  If I load via Kate, this too looks wrong, unless I change the encoding from UTF-8 to UTF-16.  In the CSV importer, however, changing the encoding from UTF-8 to UTF-16 makes no apparent difference and the original incorrect result still appears.

Reloading an earlier file is as normal.

So, it appears that the import encoding has changed and the CSV plugin does not handle it correctly.


Reproducible: Always

Steps to Reproduce:
1. Import a CSV file encoded in UTF-16.
2. The file will show the incorrect format.
3.

Actual Results:  
As above.

Expected Results:  
The UTF-16 should be handled correctly.

There may be more to this, however.  When I tried to create a file in UTF-16, it could be imported correctly.  So, to be able to demonstrate the problem, I have used a file from my credit card company, deleted everything except the first line and then added a new first line of "12-10-2016,Description,1234.56"
Attachment follows.
Comment 1 allan 2016-10-18 10:30:59 UTC
Created attachment 101617 [details]
UTF-16 file
Comment 2 allan 2016-10-18 10:55:19 UTC
Created attachment 101618 [details]
UTF-16 file
Comment 3 allan 2016-10-18 18:24:42 UTC
It looks like my attempts to provide an edited sample file either produce rubbish, or remove whatever causes the problem.
So, I may have to provide a complete file, but I don't wish to broadcast it, so would like to send it off-line.  To whom?

Allan
Comment 4 allan 2016-10-21 11:18:45 UTC
(In reply to allan from comment #3)
> It looks like my attempts to provide an edited sample file either produce
> rubbish, or remove whatever causes the problem.
> So, I may have to provide a complete file, but I don't wish to broadcast it,
> so would like to send it off-line.  To whom?
> 
> Allan

I've sent a gpg'ed copy to Thomas.

Allan
Comment 5 allan 2016-10-21 21:03:12 UTC
Thomas suggested using Okteta to look at the data to see if the BOM was correct.
Here are the first few lines :-
"
00000000   22 00 4D 00  52 00 20 00  41 00 4C 00  ".M.R. .A.L.
0000000C   4C 00 41 00  4E 00 20 00  41 00 4E 00  L.A.N. .A.N.
00000018   44 00 45 00  52 00 53 00  4F 00 4E 00  D.E.R.S.O.N. "

So, no BOM, just the data, and still mis-formatted by the plugin.

Ah, I've just checked against the Libre Office Calc file and, just looking at the beginning, the bad one has "22 00", and the good one has "FF FE".  So, the BOM is wrong on the bank version.

Allan
Comment 6 Jack 2016-10-21 21:52:56 UTC
I think you might be able to use dos2unix to modify the file into a usable format, or at least confirm info about the encoding and BOM.  It may take a while to wade through all them options and parameters.
Comment 7 allan 2016-10-22 10:29:02 UTC
[from Thomas]
"
Hi Allan,

you found out yourself: the BOM is not wrong, it's missing. I am sure, you 
stumbled over https://en.wikipedia.org/wiki/UTF-16.

I have not looked at the code of the CSV importer at that point, but you could 
check the beginning of the file (4 bytes) and see if the first two match a BOM 
or you find two 0x00 in those four bytes (where that would probably not work 
for Asian countries as they fill the upper byte with their characters).

Reading the data through a QTextStream allows to setup the encoding/decoding. 
Please take a look at QTextStream::setCodec() and setAutoDetectUnicode(), 
though according to the docs I have, the automatic detection should be the 
default. Maybe, you add another UI selector for the Encoding, in case you 
don't have it. See QTextCodec::availableCodecs() for a list of them. Check 
Kate/Kwrite in the Tools/Encding menu how this may look like.

So much for now. If you don't get the codec stuff going, please tell me where 
to find the relevant source code and I take a look at it.

Thomas"

[My reply]
"
I have /had -

QTextStream inStream(&inFile);
QTextCodec* codec =
QTextCodec::codecForMib(m_codecs.value(m_encodeIndex)->mibEnum());
inStream.setCodec(codec);

QString buf = inStream.readAll();
...
(void CSVWizard::readFile(const QString& fname) line c843)
which I nicked from Qt, I think.

I have encoding selection in the file selector.
...
QPointer<QLabel> label = new QLabel(i18n("Encoding"));
  dialog->layout()->addWidget(label);
  //    Add encoding selection to FileDialog
  QPointer<QComboBox> comboBoxEncode = new QComboBox();
  setCodecList(m_codecs, comboBoxEncode);
  comboBoxEncode->setCurrentIndex(m_encodeIndex);
  connect(comboBoxEncode, SIGNAL(activated(int)), this,
SLOT(encodingChanged(int)));
  dialog->layout()->addWidget(comboBoxEncode);

(bool CSVWizard::getInFileName(QString& inFileName) line c798)

I don't see a setAutoDetectUnicode().

I don't think I had auto-selection, but encoding was by manual selection
from the list of codecs, but UTF-16 seems not to work. (in my code).

Allan"

I'm afraid I'm not able to commit to coding, still.
Comment 8 Thomas Baumgart 2016-10-22 10:57:22 UTC
I tried this on my KDE4, KMyMoney 4.8 production system (this is generated of HEAD on the 4.8 branch).

What is annoying, that once I select a file it automatically goes off. No way to change parameters. One should be able to start the process the pressing the OK button. This causes the UTF-16 data to display weird data due to the 0's contained.

When I change the encoding in the dialog to UTF-16 before I select the file, then things seem to work properly. I am looking at the following snippet in CSVDialog::readFile(const QString& fname):

  QFile  m_inFile(m_inFileName);
  m_inFile.open(QIODevice::ReadOnly);  // allow a Carriage return - // QIODevice::Text
  QTextStream inStream(&m_inFile);
  QTextCodec* codec = QTextCodec::codecForMib(m_codecs.value(m_encodeIndex)->mibEnum());
  inStream.setCodec(codec);

  QString buf = inStream.readAll();

When selecting UTF-16 before selecting your file, QString buf contained the correct data. I verified this in the debugger and also data displayed in spread sheet form seemd to be correct.

Hope that helps for further investigation.
Comment 9 allan 2016-10-22 13:06:19 UTC
(In reply to Thomas Baumgart from comment #8)
> I tried this on my KDE4, KMyMoney 4.8 production system (this is generated
> of HEAD on the 4.8 branch).
<snip>
> When I change the encoding in the dialog to UTF-16 before I select the file,
> then things seem to work properly.

Just to be clear, are you saying that you do not then see the problem I reported, of the data being garbled?  If I ensure that UTF-16 is already selected - displayed in the file selector - then I definitely do see the corruption.

> I am looking at the following snippet in
> CSVDialog::readFile(const QString& fname):

Might this be a non-current git head version?  There has recently been some reformatting of the source and I have void CSVWizard::readFile(const QString& fname), as does the current git head.  In terms of the actual code, they are identical in this particular area.  Just to be clear, again.

So far as the selector is concerned, then, yes, there are a couple of problems, although if the decoding is for UTF-16, from previous activity or from setting it prior to selecting the file, then the file should have the required encoding.  Some tweaking looks to be necessary though.

Allan
> 
>   QFile  m_inFile(m_inFileName);
>   m_inFile.open(QIODevice::ReadOnly);  // allow a Carriage return - //
> QIODevice::Text
>   QTextStream inStream(&m_inFile);
>   QTextCodec* codec =
> QTextCodec::codecForMib(m_codecs.value(m_encodeIndex)->mibEnum());
>   inStream.setCodec(codec);
> 
>   QString buf = inStream.readAll();
> 
> When selecting UTF-16 before selecting your file, QString buf contained the
> correct data. I verified this in the debugger and also data displayed in
> spread sheet form seemd to be correct.
> 
> Hope that helps for further investigation.
Comment 10 Thomas Baumgart 2016-10-25 13:43:26 UTC
From what I can tell, the data looks good if I read it using the UTF-16 decoder on the next page in the wizard. Nothing fancy or garbled.

And I just omitted the return type void when I referenced the method. It is a current git head of the 4.8 branch.
Comment 11 wojnilowicz 2016-12-31 12:02:16 UTC
Closing based on comment #8 and me successfully opening provided file in master version. Reopen if necessary.