477294 – Import from CSV fails when double-quoted text contains commas and quotes are stripped

Bug 477294 - Import from CSV fails when double-quoted text contains commas and quotes are stripped

Summary: Import from CSV fails when double-quoted text contains commas and quotes are ...

Status:	RESOLVED FIXED

Alias:	None

Product:	LabPlot2
Classification:	Applications
Component:	frontend (other bugs)
Version First Reported In:	2.8.2
Platform:	Ubuntu Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Alexander Semke

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-11-20 17:27 UTC by Dreas Nielsen
Modified:	2023-11-22 20:19 UTC (History)
CC List:	0 users

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
attachment-3918188-0.html (1.36 KB, text/html) 2023-11-20 21:58 UTC, Dreas Nielsen	Details
attachment-4151682-0.html (6.93 KB, text/html) 2023-11-22 18:18 UTC, Dreas Nielsen	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Dreas Nielsen 2023-11-20 17:27:33 UTC

SUMMARY
***
Import from a CSV file to a spreadsheet produces mis-aligned columns when the CSV file contains double-quoted text containing commas, and quotes are stripped on import, and commas are also used as column delimiters.

It appears that double quotes are stripped *before* parsing each row into columns, instead of *after*.
***


STEPS TO REPRODUCE
1. Create a CSV file containing double-quoted text columns where the text includes a comma, and commas are also used as column delimiters.
2. Use File/Import to import that CSV file to a spreadsheet, using a custom data format and checking the "Remove quotes" checkbox.
3. 

OBSERVED RESULT
Tokens separated by commas within double-quoted text are put in separate columns, pushing all subsequent columns to the right.

EXPECTED RESULT
Tokens separated by commas within double-quoted text should all be in the same column.

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: 
(available in About System)
System: Pop!_OS_22.04 LTS
Installed from Ubuntu repository, pulling in all KDE dependencies
KDE Plasma Version: 
KDE Frameworks Version:  5.92.0
Qt Version:  5.15.3

ADDITIONAL INFORMATION

Comment 1 Alexander Semke 2023-11-20 21:08:11 UTC

(In reply to Dreas Nielsen from comment #0)
> SUMMARY
> ***
> Import from a CSV file to a spreadsheet produces mis-aligned columns when
> the CSV file contains double-quoted text containing commas, and quotes are
> stripped on import, and commas are also used as column delimiters.
> 
> It appears that double quotes are stripped *before* parsing each row into
> columns, instead of *after*.
> ***
This was fixed in 2.9. Is it possible for you to update to this release and to try again?

Comment 2 Dreas Nielsen 2023-11-20 21:58:15 UTC

Created attachment 163325 [details]
attachment-3918188-0.html

v. 2.9 is not in the Ubuntu repository.  Is there another place to get 
a .deb package?

By the way, another bug I've encountered occurs when the first column 
of a CSV file contains a text value, but the column is interpreted as 
integers--but that's more easily worked around.

On Mon, Nov 20 2023 at 09:08:11 PM +0000, Alexander Semke 
<bugzilla_noreply@kde.org> wrote:
> <https://bugs.kde.org/show_bug.cgi?id=477294>
> 
> --- Comment #1 from Alexander Semke <alexander.semke@web.de 
> <mailto:alexander.semke@web.de>> ---
> (In reply to Dreas Nielsen from comment #0)
>>  SUMMARY
>>  ***
>>  Import from a CSV file to a spreadsheet produces mis-aligned 
>> columns when
>>  the CSV file contains double-quoted text containing commas, and 
>> quotes are
>>  stripped on import, and commas are also used as column delimiters.
>> 
>>  It appears that double quotes are stripped *before* parsing each 
>> row into
>>  columns, instead of *after*.
>>  ***
> This was fixed in 2.9. Is it possible for you to update to this 
> release and to
> try again?
> 
> --
> You are receiving this mail because:
> You reported the bug.

Comment 3 Alexander Semke 2023-11-21 07:30:52 UTC

(In reply to Dreas Nielsen from comment #2)
> Created attachment 163325 [details]
> attachment-3918188-0.html
> 
> v. 2.9 is not in the Ubuntu repository.  Is there another place to get 
> a .deb package?
You can use flatpack to get the new version of LabPlot. Also, there're Ubuntu package (as well as flatpacks) available for the current development version. Please check the information on https://labplot.kde.org/download/ to see what is more feasible for you.

Comment 4 Dreas Nielsen 2023-11-22 18:18:25 UTC

Created attachment 163370 [details]
attachment-4151682-0.html

Thanks.  I closed that bug report.

I installed the Windows version and evaluated that.  The purpose of my 
evaluation was to determine whether LabData2 is a good tool to provide 
to a group of scientists and engineers who want to easily visualize 
data (primarily spatially-explicit data).  This group includes some who 
are comfortable wielding R or Python for data analysis, but also a 
group who are technically oriented, but will not devote a lot of time 
to learning new software, so ease of use is extremely important.  Tools 
that they currently have available include Orange 
(<https://orangedatamining.com/>), GeoDa 
(<https://geodacenter.github.io/>), KNIME (<https://www.knime.com/>), 
and mapdata.py (<https://mapdata.readthedocs.io/en/latest/>).  LabPlot2 
looks promising because it can display data in a spreadsheet (familiar 
to all potential users) and can potentially assemble multiple plots and 
text in something that looks like a dashboard.

After evaluation, I won't be adding LabPlot2 to our toolbox, for the 
reasons illustrated by the following comments that I made during my 
evaluation.  These don't necessarily qualify as bugs, but they are 
usability issues that perhaps you would want to consider.

1. Import of data from a CSV file or spreadsheet should be simpler: a 
project should be created by default if one does not exist, an import 
option should be on the right-click context menu for a project, and 
import to a new spreadsheet should be the default.

2. Data selection is carried out by masking values of particular 
columns, rather than selecting them. Users may be interested in only a 
subset of a large data set, and masking everything that is *not* of 
interest requires them to know everything about the data set that they 
do not care about.  Masking inverts the importance of elements of the 
data set.

3. When values of a column are masked, they still show up in column 
statistics.

4. The right-click menu for a column header does not include a 'Mask' 
(or 'Select') option, which it reasonably ought to--this functionality 
is on a 'Manipulate data' submenu. Most of the other options on this 
submenu are greyed out because they cannot be applied to imported data. 
 Data selection options should be more obvious and require fewer clicks.

5. Plotting functionality should be more accessible. When a spreadsheet 
has been imported and is displayed, there is no 'Plot' menu or button 
bar option to easily plot data from the spreadsheet.

6. If 'Plot data' is selected from the right-click menu for a column, 
only that column can be selected as the X and Y variable. It is 
necessary to click on one column and control-click on a second column 
to allow an X-Y plot of different variables. This is not obvious and is 
not documented.

7. When an X-Y plot is made, the points are connected by a line, by 
default, but the X values are not sorted in order. There is no property 
or option to allow sorting of X values after the plot is created. If 
the column of X values in the spreadsheet is sorted after the plot is 
displayed, the plot is not automatically update, and there is no option 
associated with the plot that allows it to be refreshed.

8. When plots are created in a new worksheet, both the plot and the 
worksheet are very small and need to be manually resized to a 
reasonable size.

9. The axis title can be resized, but the axis labels (i.e., data 
values) evidently cannot.

10. If multiple columns are selected in the spreadsheet, and a plot 
then produced, on returning to the spreadsheet the columns are not 
selected--i.e., selections are not preserved.

11. When a box plot of multiple variables is produced, there are no 
X-axis labels identifying the data column corresponding to each 
box-and-whisker figure.

12. When a histogram of a single variable is produced, there is no way 
to modify the number of bins used.

13. When the 'Sort' option is selected from the right-click menu for a 
column, or the sorting buttons on the button bar are used, only that 
column is sorted rather than the entire spreadsheet. This can (almost 
certainly will) lead to a loss of data integrity.

14. The relationships between folders, workbooks, spreadsheets, 
matrixes, and worksheets is not obvious from the UI. It is not clear 
which of them may be required, and how they optionally can be used 
together for different purposes. The documentation describes then 
individually but does not illustrate alternative workflows.

On Tue, Nov 21 2023 at 07:30:52 AM +0000, Alexander Semke 
<bugzilla_noreply@kde.org> wrote:
> <https://bugs.kde.org/show_bug.cgi?id=477294>
> 
> --- Comment #3 from Alexander Semke <alexander.semke@web.de 
> <mailto:alexander.semke@web.de>> ---
> (In reply to Dreas Nielsen from comment #2)
>>  Created attachment 163325 [details]
>>  attachment-3918188-0.html
>> 
>>  v. 2.9 is not in the Ubuntu repository.  Is there another place to 
>> get
>>  a .deb package?
> You can use flatpack to get the new version of LabPlot. Also, 
> there're Ubuntu
> package (as well as flatpacks) available for the current development 
> version.
> Please check the information on <https://labplot.kde.org/download/> 
> to see what
> is more feasible for you.
> 
> --
> You are receiving this mail because:
> You reported the bug.

Comment 5 Alexander Semke 2023-11-22 20:19:22 UTC

(In reply to Dreas Nielsen from comment #4)
> Created attachment 163370 [details]
> attachment-4151682-0.html
> 
> Thanks.  I closed that bug report.
> 
> I installed the Windows version and evaluated that.  The purpose of my 
> evaluation was to determine whether LabData2 is a good tool to provide 
> to a group of scientists and engineers who want to easily visualize 
> data (primarily spatially-explicit data).  This group includes some who 
> are comfortable wielding R or Python for data analysis, but also a 
> group who are technically oriented, but will not devote a lot of time 
> to learning new software, so ease of use is extremely important.  Tools 
> that they currently have available include Orange 
> (<https://orangedatamining.com/>), GeoDa 
> (<https://geodacenter.github.io/>), KNIME (<https://www.knime.com/>), 
> and mapdata.py (<https://mapdata.readthedocs.io/en/latest/>).  LabPlot2 
> looks promising because it can display data in a spreadsheet (familiar 
> to all potential users) and can potentially assemble multiple plots and 
> text in something that looks like a dashboard.
>
> After evaluation, I won't be adding LabPlot2 to our toolbox, for the 
> reasons illustrated by the following comments that I made during my 
> evaluation.  These don't necessarily qualify as bugs, but they are 
> usability issues that perhaps you would want to consider.
Thank you for your feedback. Let me quickly comment on the points your raised.


> 1. Import of data from a CSV file or spreadsheet should be simpler: a 
> project should be created by default if one does not exist, an import 
> option should be on the right-click context menu for a project, and 
> import to a new spreadsheet should be the default.
In the application settings you can determine what should happen on application start - do nothing, create a new project, create a new project with a spreadsheet, etc. In 2.11 we implemented additionally an option for which notebook to create on startup (Python, R, etc.)


> 2. Data selection is carried out by masking values of particular 
> columns, rather than selecting them. Users may be interested in only a 
> subset of a large data set, and masking everything that is *not* of 
> interest requires them to know everything about the data set that they 
> do not care about.  Masking inverts the importance of elements of the 
> data set.
Plotting only parts of the data is something that needs more elaboration on our side for the UX part, the technical implementation is straightworfard. Masking is the only solution right now with the problems you mentioned. We'll definitely address this in future releases.

 
> 3. When values of a column are masked, they still show up in column 
> statistics.
This is a bug that we'll fix for the upcoming release 2.11.


> 4. The right-click menu for a column header does not include a 'Mask' 
> (or 'Select') option, which it reasonably ought to--this functionality 
> is on a 'Manipulate data' submenu. Most of the other options on this 
> submenu are greyed out because they cannot be applied to imported data. 
>  Data selection options should be more obvious and require fewer clicks.
Manipulation of data is also possible for imported data, of course. The entries are greyed out most probably since you don't have any numeric data in this column. Did you pay attention to the decimal separator settings during the import step so the data is properly imported as numeric and not as text?


> 5. Plotting functionality should be more accessible. When a spreadsheet 
> has been imported and is displayed, there is no 'Plot' menu or button 
> bar option to easily plot data from the spreadsheet.
In the context menu of the spreadsheet columns there is "Plot Data" menu to quickly plot the data, you can check our first tutorial video to see how it works (https://labplot.kde.org/video-tutorials/).


> 6. If 'Plot data' is selected from the right-click menu for a column, 
> only that column can be selected as the X and Y variable. It is 
> necessary to click on one column and control-click on a second column 
> to allow an X-Y plot of different variables. This is not obvious and is 
> not documented.
Lack of good and detailed documentation is an issue, yes. If you select one single column and want to plot it on y, LabPlot is trying to find another column in the spreadsheet having the "plot designation" x and to use it as the column for x. If the plot designation is not properly set, this logic fails and you need to select two columns.


> 7. When an X-Y plot is made, the points are connected by a line, by 
> default, but the X values are not sorted in order. There is no property 
> or option to allow sorting of X values after the plot is created. If 
> the column of X values in the spreadsheet is sorted after the plot is 
> displayed, the plot is not automatically update, and there is no option 
> associated with the plot that allows it to be refreshed.
I cannot reproduce this issue - the plot is properly updated if the data is sorted in the spreadsheet. It would be great if you could report this issue in a separate bug ticket and provide the steps for how to reproduce.


> 8. When plots are created in a new worksheet, both the plot and the 
> worksheet are very small and need to be manually resized to a 
> reasonable size.
With the help of "templates"  you can have new and "yours" default values for any object properties that you need to modify.


> 9. The axis title can be resized, but the axis labels (i.e., data 
> values) evidently cannot.
You can change the font size of the axis value labels by selecting the axis and modifying the size in the tab "Labels" in the properties explorer.


> 10. If multiple columns are selected in the spreadsheet, and a plot 
> then produced, on returning to the spreadsheet the columns are not 
> selected--i.e., selections are not preserved.
This is the correct behavior UX-wise since the previous object doesn't have the focus anymore and there is a new selection in the application.


> 11. When a box plot of multiple variables is produced, there are no 
> X-axis labels identifying the data column corresponding to each 
> box-and-whisker figure.
This is possible by using a customer column for the axis label type with arbitrary text values that are used then for the labels.


> 12. When a histogram of a single variable is produced, there is no way 
> to modify the number of bins used.
For this you need to select "By Number" for the binning method and to specify the number of bins you need.


> 13. When the 'Sort' option is selected from the right-click menu for a 
> column, or the sorting buttons on the button bar are used, only that 
> column is sorted rather than the entire spreadsheet. This can (almost 
> certainly will) lead to a loss of data integrity.
In 2.11 we re-worked the UX so this part is more clear and consistent. Prior to 2.11 you need the sort option from the context menu of the spreadsheet to sort multiple columns together.

> 14. The relationships between folders, workbooks, spreadsheets, 
> matrixes, and worksheets is not obvious from the UI. It is not clear 
> which of them may be required, and how they optionally can be used 
> together for different purposes. The documentation describes then 
> individually but does not illustrate alternative workflows.
Spreadsheet and Matrix are used for different data layouts. Workbook can be seen as a parent object that can hold multiple such "sheets" together, similar to Excel, etc. The documentation should be improved, yes.

Feel free to reach out to us via the support email if more clarification or help is needed.