463708 – CSV Import very slow

Bug 463708 - CSV Import very slow

Summary: CSV Import very slow

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	kmymoney
Classification:	Applications
Component:	importer (show other bugs)
Version:	5.1.3
Platform:	unspecified Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	KMyMoney Devel Mailing List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-01-01 22:04 UTC by Giuseppe Fuggiano
Modified:	2023-02-01 05:06 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
test data (920.76 KB, text/csv) 2023-01-01 23:19 UTC, Giuseppe Fuggiano	Details
File with updated payee name from excessive 213 chars to reasonable. (151.04 KB, text/csv) 2023-01-02 23:30 UTC, alx.kuzza	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Giuseppe Fuggiano 2023-01-01 22:04:55 UTC

SUMMARY
When importing a simple and small CSV kmymoney takes 100% CPU and it's getting really too much time (more than 15 minutes).


STEPS TO REPRODUCE
1. Import a CSV file with about 4k lines


OBSERVED RESULT
Too slow and 100% CPU used.

EXPECTED RESULT
Fast import.

SOFTWARE/OS VERSIONS
Windows: 
macOS: 
Linux/KDE Plasma: 
(available in About System)
KDE Plasma Version: 
KDE Frameworks Version: 
Qt Version: 

ADDITIONAL INFORMATION

Comment 1 Jack 2023-01-01 22:41:25 UTC

I"m not sure I'd call 4000 lines a small csv, but I do agree it shouldn't take 15 minutes to import.  First question, which distribution are you running, and where did you download the flatpak from?  I am not aware of any official flatpak version of KMyMoney.
Next step is have you tried importing a csv with only ten or so rows?  This could help determine if the program is getting stuck importing any csv data, or if the problem is related to the size of the file.  Also probably not relevant, but how large is your kmy file?  Just a few accounts, or hundreds?  Again, it shouldn't really matter, but I'm looking for any information which might help figure out where it is getting stuck.  Finally, how long did it eventually take for the import, or did you interrupt it before it finished?

Comment 2 Giuseppe Fuggiano 2023-01-01 23:19:59 UTC

Created attachment 154945 [details]
test data

Use this data file to reproduce the bug. I've imported it to a new XML file. It tool more that 10 minutes to finish.

Comment 3 Jack 2023-01-01 23:55:09 UTC

I'm going to skip a repeated request for more information (which distribution, where did you download KMM from) and go straight to the csv file.  I see only three columns in the file.  The first is always "01-01-2023" a valid date.  The second is always -1.  I don't know what this is, but it doesn't look like an amount or payee name or a category to me.  The third is a consistent very long string followed by a space and some digits.  While the digits might be an amount depending on your currency, they are not in a separate column, so I don't see any way KMyMoney could parse the value.
Please tell us how you mapped the columns to the various fields in the csv import wizard, or at least what you expect KMyMoney to do with those three values.

Comment 4 Jack 2023-01-02 00:07:15 UTC

What I will ask is what type of PC are you using?  I just did import the csv to a newly created XML file with one checking account.  It accepted column 1 as the date, column 2 as the payee, and column 3 as the payment amount.  I have no idea what currency processing it used, but it did take the digits at the end of the values in the third column as whole dollars.  The entire import took well under one minute, although I didn't use a stopwatch.  My PC has an AMD Ryzen at about 3.5GHz.  The multiple cores are irrelevant, as KMyMoney only uses a single thread.
The cause of your slow import seems not to be your XML file, the csv file, or KMyMoney.  Are both your files on a local hard drive or on a network?  If local, are you sure the drive is healthy?   If network, have you testing network throughput?  Is there perhaps anything else odd or uncommon about your setup?

Comment 5 Giuseppe Fuggiano 2023-01-02 00:29:14 UTC

Hi, sorry for the few information provided. You imported my sample data as expected. I'm using a "standard" setup, no network involved. My drive is healthy. My PC is a DELL with 6 cores i7-8750H CPU @ 2.20GHz. 

I'm under Alpine Linux. I'm checking under another distro.

Comment 6 Giuseppe Fuggiano 2023-01-02 00:39:11 UTC

No sorry you should import it as follows:

first column: date
second column: amount
third column: payee

I'm using the third column as payee because I don't have the payee specified in my CSV exported from the bank. So I'll try to match the correct payee later using the automated matching feature.

I've downloaded and installed kmymoney using the official Alpine Linux repository.

Can you reproduce the bug now?

Comment 7 alx.kuzza 2023-01-02 01:28:31 UTC

confirmed, yeah, this is what I have:
1) same 100% cpu
2) took 20 minutes to import that file.

Comment 8 Jack 2023-01-02 01:28:56 UTC

Importing 100 rows took under 3 seconds.  I only let the whole file go about 7 minutes before killing it, so it may well have gone near or over 10 minutes..  My best guess is that the time to process each row is increasing linearly with the number of payees, because it is creating a new payee for each row (the text may be the same, but the digits at the end are different) and for each row, it needs to compare against all the existing payees to see if the payee already exists.  I would personally call your sample csv is a pathological example (meaning it triggers a possibly problem behavior in the worst possible way.)  To be sure, I would time the import of 100, 200, 300, ... rows to see how the time compares to number of rows.

My personal data file, which has data for well over 10 years, has 950 accounts (including 723 stock accounts, many closed) but only 335 payees.  Note I don't keep a separate payee for absolutely every different possible one - I have some like "Miscellaneous gas station" and "Miscellaneous grocery store" with individual ones only for Payees I frequently use.  

So - I see how that file can take an excessively long time to import, but I think it is because the file is not really a good example of real data.  If you really do expect to be importing thousands of transactions at a time, with very little possibility of recognizing payees as already known, then you will end up with slow imports, and I can't think of any way to make the program faster at what it has to do.  Note it might well be notably faster if you used shorter payee examples, as your samples are all over 200 characters.

Comment 9 Jack 2023-01-02 17:49:10 UTC

I have a suggestion.  Add a fourth value to each row, "payee."  Then import that column (4) as the payee, and import column 3 as the memo.  I did this, and it took 45 seconds.  If your point is to manually assign the payee later, then you probably don't really want 4000 new payees you will have to delete anyway.  I am also assuming that column 3 is intended to be what the bank provides, which actually includes multiple bits of information, so it can't directly be the payee.
Let me know how this works for you.

Comment 10 Giuseppe Fuggiano 2023-01-02 17:59:04 UTC

Hi Jack, thanks for your suggestion.

I can indeed import it as Memo, as it really is.   But I also need to import it as a Payee because I can't match the Memo field to determine the Payee (and thus it's category) automatically, using the bank-provided information alone.  

As you can imagine, I can't / don't want to do that job manually each time I import a CSV, even a small one.  Instead, I need something that can add all the metadata automatically (including payees and categories).   If I could match the Memo field and popupate the Payee and the Category fields, for example using another kMyMoney feature with regular expressions, as done in the Payee tab, it would be perfect.  Is that possible?  If not, do you have some other suggestions? 

Remember: my bank-provided CSVs don't contain the Payee nor Category information but only the Description field.  My banks don't give me that information ready to import (and even if so, I would want to use my own categories...).

Thank you.

Comment 11 Jack 2023-01-02 18:20:17 UTC

First, I do not believe there is any way to match the Category on import.  The Category is set if the imported transaction is matched to an existing transaction.  If there is not already a wishlist for that, perhaps we can eventually alter this bug for that purpose.
Second, there are some import features available for online OFX import which might help, but are unfortunately not available for csv import.  I have no idea how hard it would be to extend those features to file import.
Finally, I'm actually not sure I understand how you plan to find the actual Payee from the long imported Payee.  Payee matching happens at import time - I'm not aware of being able to do it after the import.
Have you read the Handbook chapter on Payees, specifically the section on Transaction Matching Settings?  https://docs.kde.org/stable5/en/kmymoney/kmymoney/details.payees.personalinformation.html#details.payees.personalinformation.matching  Before I make any suggestions, I would need to see some real examples of the Description field your bank provides.  Can you provide some?  If it does include the Payee name, then you could set that payee to match on that name.
Finally, I wonder if what you really want to do is to pre-process your downloaded csv to extract explicit payee and category columns from the description.  It might be an extra step, but once you write the script, the extra work for each import might be less than what you plan.
Separately, even if you continue as you started, how may payees do you eventually expect to have?  The excessive import time is because each row needs to compare the "payee" (description) to every existing  payee, looking for a match.  If you are only importing a hundred rows at a time, and you only have a few hundred Payees, it won't be instant, but much less than 20 minutes.

Comment 12 Giuseppe Fuggiano 2023-01-02 18:40:24 UTC

Hi Jack,

thank you for your suggestions.  Since there is no other solution, I think I will end up preprocessing the CSV as you said.

Regards

Comment 13 alx.kuzza 2023-01-02 23:29:45 UTC

First: what is the point to match a payee name having 213 characters?

I updated a sample file to have "Some Payee Name" and an import finished in less than 2 minutes, well it is very slow, but not too excessive. 
For the future we need to see why it is slow. I would say it should run for maybe like 5-10 seconds.

File may contain everything, this is an example what i'm importing:
"Trans. Date","Post Date",Description,Amount,Category
2022-02-14,02/14/2022,"ZYBOOK COLLEGE TXTBOOK 4086349926 CA",72.00,Services

"Services" is  a category, it gets automatically imported and matched.

NOTE: updated file is attached.

Comment 14 alx.kuzza 2023-01-02 23:30:53 UTC

Created attachment 154977 [details]
File with updated payee name from excessive 213 chars to reasonable.

Comment 15 Bug Janitor Service 2023-01-17 05:17:05 UTC

Dear Bug Submitter,

This bug has been in NEEDSINFO status with no change for at least
15 days. Please provide the requested information as soon as
possible and set the bug status as REPORTED. Due to regular bug
tracker maintenance, if the bug is still in NEEDSINFO status with
no change in 30 days the bug will be closed as RESOLVED > WORKSFORME
due to lack of needed information.

For more information about our bug triaging procedures please read the
wiki located here:
https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging

If you have already provided the requested information, please
mark the bug as REPORTED so that the KDE team knows that the bug is
ready to be confirmed.

Thank you for helping us make KDE software even better for everyone!

Comment 16 Bug Janitor Service 2023-02-01 05:06:15 UTC

This bug has been in NEEDSINFO status with no change for at least
30 days. The bug is now closed as RESOLVED > WORKSFORME
due to lack of needed information.

For more information about our bug triaging procedures please read the
wiki located here:
https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging

Thank you for helping us make KDE software even better for everyone!