Bug 470338 - CSV Import only supports ASCII instead of via locale such as unicode, utf-8
Summary: CSV Import only supports ASCII instead of via locale such as unicode, utf-8
Status: RESOLVED FIXED
Alias: None
Product: LabPlot2
Classification: Applications
Component: general (show other bugs)
Version: 2.10.0
Platform: Arch Linux Linux
: NOR minor
Target Milestone: ---
Assignee: Alexander Semke
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-05-27 18:33 UTC by Mikael Lövqvist
Modified: 2023-06-21 20:06 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In: 2.10.1
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mikael Lövqvist 2023-05-27 18:33:11 UTC
SUMMARY
***
NOTE: If you are reporting a crash, please try to attach a backtrace with debug symbols.
See https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports
***


STEPS TO REPRODUCE
1. Create new spreadsheet and select a text file for import
2. Make note that the only generic CSV option is "ASCII data" but hope it still uses your locale

OBSERVED RESULT
Data is interpreted as ASCII.

EXPECTED RESULT
I was expecting to be able to select encoding but in the absence of this my next expectation was that LabPlot2 would use the system locale and not ASCII. I did not expect this modern and nice software to not support utf-8.

SOFTWARE/OS VERSIONS
KDE Frameworks Version: 5.105.0
Qt Version: 5.15.9 (built against 5.15.8)

ADDITIONAL INFORMATION
Debug build 
May 6 2023, 00:04:35
System: Arch Linux
Locale: English,United States (Decimal point '.', Group separator ','
Number settings: Decimal point '.', Group separator ',', Exponential 'e', Zero digit '0', Percent '%', Positive/Negative sign '+'/'-' (Updated on restart)
Architecture: x86_64-little_endian-lp64
Kernel: linux 6.3.1-arch1-1
C++ Compiler: GNU 13.1.1
C++ Compiler Flags: -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -Wp,-D_GLIBCXX_ASSERTIONS -g -ffile-prefix-map=/build/labplot/src=/usr/src/debug/labplot -flto=auto -fno-operator-names -fno-exceptions -Wall -Wextra -Wcast-align -Wchar-subscripts -Wformat-security -Wno-long-long -Wpointer-arith -Wundef -Wnon-virtual-dtor -Woverloaded-virtual -Werror=return-type -Werror=init-self -Wvla -Wdate-time -Wsuggest-override -Wlogical-op -Wall -Wextra -Wundef -Wpointer-arith -Wunreachable-code -Wunused -Wdeprecated-declarations -fno-omit-frame-pointer -fstack-protector -fexceptions -std=c++11 -O2 -Wcast-align -Wswitch-enum -fvisibility=default -pedantic -Wzero-as-null-pointer-constant
Comment 1 Bug Janitor Service 2023-06-06 16:39:02 UTC
A possibly relevant merge request was started @ https://invent.kde.org/education/labplot/-/merge_requests/306
Comment 2 Alexander Semke 2023-06-09 11:05:01 UTC
Git commit 85adf7723afebf81801d1e0f948a38c7b1b43123 by Alexander Semke.
Committed on 09/06/2023 at 08:15.
Pushed by marmsoler into branch 'master'.

[ascii import] when reading data,  assume it's UTF8 encoded and not Latin1. 

UTF8 is the commonly used encoding nowadays.  We used it in the past and switched to Latin1 because of performance optimizations
which was clearly a very bad decision. We go back to UTF8 now.  This won't solve the problems with UTF16 as reported in 
https://invent.kde.org/education/labplot/-/issues/541 and probably other encodings that are not subset of UTF8 but this needs to be
addressed differently if we want to support them properly in future.
FIXED-IN: 2.10.1

M  +28   -0    tests/import_export/ASCII/AsciiFilterTest.cpp
M  +1    -0    tests/import_export/ASCII/AsciiFilterTest.h
A  +3    -0    tests/import_export/ASCII/data/utf8_cyrillic.txt

https://invent.kde.org/education/labplot/-/commit/85adf7723afebf81801d1e0f948a38c7b1b43123
Comment 3 Alexander Semke 2023-06-09 11:12:12 UTC
(In reply to Mikael Lövqvist from comment #0)
> SUMMARY
> ***
> NOTE: If you are reporting a crash, please try to attach a backtrace with
> debug symbols.
> See
> https://community.kde.org/Guidelines_and_HOWTOs/Debugging/
> How_to_create_useful_crash_reports
> ***
> 
> 
> STEPS TO REPRODUCE
> 1. Create new spreadsheet and select a text file for import
> 2. Make note that the only generic CSV option is "ASCII data" but hope it
> still uses your locale
> 
> OBSERVED RESULT
> Data is interpreted as ASCII.
> 
> EXPECTED RESULT
> I was expecting to be able to select encoding but in the absence of this my
> next expectation was that LabPlot2 would use the system locale and not
> ASCII. I did not expect this modern and nice software to not support utf-8.
Thank you for reporting this issue. We fixed it now and the fix will be part of the patch for 2.10 that we plan to release in the next weeks. If you need this fix earlier, please consider using our nightly builds available in the download section on the homepage.

> 
> SOFTWARE/OS VERSIONS
> KDE Frameworks Version: 5.105.0
> Qt Version: 5.15.9 (built against 5.15.8)
> 
> ADDITIONAL INFORMATION
> Debug build 
> May 6 2023, 00:04:35
> System: Arch Linux
> Locale: English,United States (Decimal point '.', Group separator ','
> Number settings: Decimal point '.', Group separator ',', Exponential 'e',
> Zero digit '0', Percent '%', Positive/Negative sign '+'/'-' (Updated on
> restart)
> Architecture: x86_64-little_endian-lp64
> Kernel: linux 6.3.1-arch1-1
> C++ Compiler: GNU 13.1.1
> C++ Compiler Flags: -march=x86-64 -mtune=generic -O2 -pipe -fno-plt
> -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security
> -fstack-clash-protection -fcf-protection -Wp,-D_GLIBCXX_ASSERTIONS -g
> -ffile-prefix-map=/build/labplot/src=/usr/src/debug/labplot -flto=auto
> -fno-operator-names -fno-exceptions -Wall -Wextra -Wcast-align
> -Wchar-subscripts -Wformat-security -Wno-long-long -Wpointer-arith -Wundef
> -Wnon-virtual-dtor -Woverloaded-virtual -Werror=return-type
> -Werror=init-self -Wvla -Wdate-time -Wsuggest-override -Wlogical-op -Wall
> -Wextra -Wundef -Wpointer-arith -Wunreachable-code -Wunused
> -Wdeprecated-declarations -fno-omit-frame-pointer -fstack-protector
> -fexceptions -std=c++11 -O2 -Wcast-align -Wswitch-enum -fvisibility=default
> -pedantic -Wzero-as-null-pointer-constant
Comment 4 Alexander Semke 2023-06-21 20:06:33 UTC
Git commit f0f65a46633b1fdf80c3c7a5a73699d34de137af by Alexander Semke.
Committed on 21/06/2023 at 20:06.
Pushed by asemke into branch 'release/2.10'.

[ascii import] when reading data,  assume it's UTF8 encoded and not Latin1.

UTF8 is the commonly used encoding nowadays.  We used it in the past and switched to Latin1 because of performance optimizations
which was clearly a very bad decision. We go back to UTF8 now.  This won't solve the problems with UTF16 as reported in
https://invent.kde.org/education/labplot/-/issues/541 and probably other encodings that are not subset of UTF8 but this needs to be
addressed differently if we want to support them properly in future.
FIXED-IN: 2.10.1

Added the actual changes in AsciiFilter.cpp that were forgotten in the previous commit.

M  +15   -15   src/backend/datasources/filters/AsciiFilter.cpp
M  +31   -3    tests/import_export/ASCII/AsciiFilterTest.cpp
M  +1    -0    tests/import_export/ASCII/AsciiFilterTest.h
A  +3    -0    tests/import_export/ASCII/data/utf8_cyrillic.txt

https://invent.kde.org/education/labplot/-/commit/f0f65a46633b1fdf80c3c7a5a73699d34de137af