Bug 442299 - Filelight on Windows ignoring folders with CJK characters on NTFS
Summary: Filelight on Windows ignoring folders with CJK characters on NTFS
Status: RESOLVED FIXED
Alias: None
Product: filelight
Classification: Applications
Component: general (show other bugs)
Version: 20.12.2
Platform: Microsoft Windows Microsoft Windows
: NOR normal
Target Milestone: ---
Assignee: Martin Sandsmark
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-11 10:08 UTC by Unknown
Modified: 2022-04-28 11:36 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Unknown 2021-09-11 10:08:40 UTC
SUMMARY

I am running Filelight 20.12.2 as installed from the Microsoft Store on Windows 10 on a folder containing subfolders that contain CJK characters, in this case, Korean. Here is the folder listing. Note that a folder name with nordic characters is included as well (Ófærð).

Confession - 자백 (2019)/
Heartless City - 무정도시 (2013)/
Partners for Justice - 검법남녀 (2018-2019)/
Police University - 경찰수업 (2021)/
Run On - 런 온 (2020)/
Silicon Valley (2014)/
So I Married The Anti-Fan - 그래서 나는 안티팬과 결혼했다 (2021)/
Tell Me What You Saw - 본 대로 말하라 (2020)/
The Road - The Tragedy of One - 더 로드 1의 비극 (2021)/
Trapped (Ófærð) (2015)/
What's Wrong With Secretary Kim? 비서가 왜 그럴까 (2018)/

In Filelight's display of the space usage of the folder containing the aforementioned subfolders, all folders containing the Korean script are completely omitted and don't factor into the analysis. The folder with nordic characters is getting included correctly.

STEPS TO REPRODUCE
1. Take any NTFS volume. Create a directory in it, and create subdirectories with Korean character names.
2. Run Filelight on the volume or the upper directory.

OBSERVED RESULT

Any folders that contain Korean characters in their names are excluded from the analysis and view, as if they did not exist.

EXPECTED RESULT

Folders with Korean characters should get included normally.

SOFTWARE/OS VERSIONS
Windows: Windows 10 21H1 19043.1165 Education
macOS: -
Linux/KDE Plasma: - 
(available in About System)
KDE Plasma Version: n/a
KDE Frameworks Version: 5.79.0
Qt Version: 5.15.2

ADDITIONAL INFORMATION

The same problem occurs when the folder name contains Japanese characters including mixed language such as "Some latin text 完全版".
Comment 1 Harald Sitter 2022-04-06 20:13:20 UTC
Is this still a problem with the latest release on the store?
Comment 2 Unknown 2022-04-12 07:09:16 UTC
(In reply to Harald Sitter from comment #1)
> Is this still a problem with the latest release on the store?

I have Filelight 21.12.3 with kdeframeworks 5.91.0 and qt 5.15.2 from the store at the moment and entries with CJK characters in their names are still getting omitted from the scan.
Comment 3 Harald Sitter 2022-04-12 09:22:44 UTC
There is a bug somewhere in the iteration system where it doesn't properly use unicode. Haven't managed to find it yet though :((

Can confirm the issue.
Comment 4 Harald Sitter 2022-04-16 12:35:12 UTC
The wrapper API we use for directory iteration is garbage and not properly unicode aware. I think the most solid fix both short and long term is to port away from it and properly abstract the code paths for windows and posix so we can have solid iteration results on either platform. Does require lots of new code though, so that's a bit sad.
Comment 5 Harald Sitter 2022-04-28 11:36:14 UTC
Git commit e4c9db692acf2969ef14a927a842fa5edc657887 by Harald Sitter.
Committed on 28/04/2022 at 11:33.
Pushed by sitter into branch 'release/22.04'.

rebuild the iteration tech using better architecture

the previous approach just didn't cut it for windows.

the new code sports a forward iterator that fronts for a
platform-dependent walker object that encapsulates the iteration logic

this looks and feels a lot like std::filesystem API but unfortunately we
cannot really use that API directly because I want this change to be
conservative enough to land in 22.04 as a bugfix for windows, also on
POSIX std::filesystem returns the st_size (size in bytes) whereas we
want the actual occupied blocks (st_blocks*size), and lastly it's also a
tad slower because of heavier abstraction

should we choose to go the std::filesystem route in the future anyway it
should be a trivial switch because of how similar the APIs are.

furthermore move to always convert from/to utf8. the QFile helpers
ultimately end up in the same code paths anyway, so it seems simpler to
just go with the utf8 variants directly (also on windows QFile somehow
produces bogus output for actual unicode characters)

the combined set of changes improves windows support substantially. it's
now correctly iterating unicode entries, and correctly displaying
unicode characters. iteration in general now has unit testing.

M  +13   -0    autotests/CMakeLists.txt
A  +129  -0    autotests/directoryIteratorTest.cpp     [License: GPL(3+eV) GPL(v3.0) GPL(v2.0)]
A  +0    -0    autotests/iterator-tree.in/Con 자백/.keep
A  +1    -0    autotests/iterator-tree.in/bar
A  +0    -0    autotests/iterator-tree.in/foo/.keep
A  +7    -0    autotests/test-config.h.cmake
M  +12   -1    src/CMakeLists.txt
A  +14   -0    src/directoryEntry.h     [License: GPL(3+eV) GPL(v3.0) GPL(v2.0)]
A  +4    -0    src/directoryIterator.cpp     [License: GPL(3+eV) GPL(v3.0) GPL(v2.0)]
A  +66   -0    src/directoryIterator.h     [License: GPL(3+eV) GPL(v3.0) GPL(v2.0)]
M  +2    -2    src/fileTree.cpp
M  +1    -2    src/fileTree.h
M  +25   -137  src/localLister.cpp
A  +105  -0    src/posixWalker.cpp     [License: GPL(3+eV) GPL(v3.0) GPL(v2.0)]
A  +37   -0    src/posixWalker.h     [License: GPL(3+eV) GPL(v3.0) GPL(v2.0)]
M  +1    -1    src/radialMap/map.cpp
A  +115  -0    src/windowsWalker.cpp     [License: GPL(3+eV) GPL(v3.0) GPL(v2.0)]
A  +36   -0    src/windowsWalker.h     [License: GPL(3+eV) GPL(v3.0) GPL(v2.0)]

https://invent.kde.org/utilities/filelight/commit/e4c9db692acf2969ef14a927a842fa5edc657887