Open any epub files that contain arabic text then try to search any arabic word. Reproducible: Always Steps to Reproduce: 1. Open attach file 2. open search dialogue by crtl+f 3. type كلمة Actual Results: the text not found Expected Results: the word "كلمة" is in the doc
Created attachment 94770 [details] an arabic epub file
Confirmed with the epub provided and okular from git master. Picked highest version as no git in version field.
*** Bug 353301 has been marked as a duplicate of this bug. ***
I have been investigating this with ChatGPT's help: https://chat.openai.com/share/69484c6f-d974-4047-8d52-c92d6ea0ebf9 There seem to be two issues affecting Arabic and, apparently, other RTL languages as well. One of them is in TextDocumentUtils::calculateBoundingRect(), where the final if statement is incorrectly triggered for all Arabic text, causing all characters to be substituted with newlines. Trying to work around the issue with this patch, the TinyTextEntity's are created with the correct characters and the correct bounding rectangles, as far as I can see. But the search still seems to expect the TinyTextEntity's in left to right order. I can only find Arabic words if I type them backwards in the Find bar. diff --git a/core/textdocumentgenerator_p.h b/core/textdocumentgenerator_p.h index fb1a5e395..bf17e2e52 100644 --- a/core/textdocumentgenerator_p.h +++ b/core/textdocumentgenerator_p.h @@ -44,19 +44,23 @@ static void calculateBoundingRect(QTextDocument *document, int startPosition, in const QTextLine startLine = startLayout->lineForTextPosition(startPos); const QTextLine endLine = endLayout->lineForTextPosition(endPos); - const double x = startBoundingRect.x() + startLine.cursorToX(startPos); + double x = startBoundingRect.x() + startLine.cursorToX(startPos); const double y = startBoundingRect.y() + startLine.y(); - const double r = endBoundingRect.x() + endLine.cursorToX(endPos); + double r = endBoundingRect.x() + endLine.cursorToX(endPos); const double b = endBoundingRect.y() + endLine.y() + endLine.height(); const int offset = qRound(y) % qRound(pageSize.height()); - if (x > r) { // line break, so return a pseudo character on the start line + if (startLine.y() != endLine.y()) { // line break, so return a pseudo character on the start line rect = QRectF(x / pageSize.width(), offset / pageSize.height(), 3 / pageSize.width(), startLine.height() / pageSize.height()); page = -1; return; } + const auto direction = document->characterAt(startPosition).direction(); + if (direction == QChar::DirAL || direction == QChar::DirR) + std::swap(x, r); + page = qRound(y) / qRound(pageSize.height()); rect = QRectF(x / pageSize.width(), offset / pageSize.height(), (r - x) / pageSize.width(), (b - y) / pageSize.height()); }
Diggin some more into the issue: https://chat.openai.com/share/d2f74bc7-6c7b-437d-ac17-e71ae73f689b The problem now is the way TinyTextEntity's are sorted according to their positions from left to right. This seems to be crucial to the text selection logic, and cannot be changed without extensive modifications. I'm not sure about the right way to proceed from here. Some possibilities are: 1- Reorder the text in the Find bar before performing a search, with Qt's equivalent of fribidi_reorder_line(). 2- For text-based documents, delegate search and selection to the QTextDocument class itself. 3- Skip sorting the text entities, and adjust the search and selection methods accordingly. The first option is the least intrusive, while the second seems to offer greater possibilities, like handling the peculiarities of Arabic search and other languages: https://community.scripture.software.sil.org/t/search-arabic-text-without-tashkeel/882 https://help-nl.oclc.org/Discovery_and_Reference/WorldCat_Discovery/Release_notes/2023_release_notes/040WorldCat_Discovery_release_notes_Arabic_language_support QTextDocument doesn't seem to support these yet, but it's something I might be able to contribute to Qt.
(In reply to Saladin Shaban from comment #5) > Diggin some more into the issue: > https://chat.openai.com/share/d2f74bc7-6c7b-437d-ac17-e71ae73f689b > > The problem now is the way TinyTextEntity's are sorted according to their > positions from left to right. This seems to be crucial to the text selection > logic, and cannot be changed without extensive modifications. > > I'm not sure about the right way to proceed from here. Some possibilities > are: > > 1- Reorder the text in the Find bar before performing a search, with Qt's > equivalent of fribidi_reorder_line(). > > 2- For text-based documents, delegate search and selection to the > QTextDocument class itself. > > 3- Skip sorting the text entities, and adjust the search and selection > methods accordingly. > > The first option is the least intrusive, while the second seems to offer > greater possibilities, like handling the peculiarities of Arabic search and > other languages: > > https://community.scripture.software.sil.org/t/search-arabic-text-without- > tashkeel/882 > https://help-nl.oclc.org/Discovery_and_Reference/WorldCat_Discovery/ > Release_notes/2023_release_notes/ > 040WorldCat_Discovery_release_notes_Arabic_language_support > > QTextDocument doesn't seem to support these yet, but it's something I might > be able to contribute to Qt. There is https://invent.kde.org/graphics/okular/-/merge_requests/595 which claims to fix it. Though it likely introduces a significant memory leak. I don't know enough about any RTL language to be able to really test and work on it, but I have prepared https://invent.kde.org/graphics/okular/-/merge_requests/905 which should make it easier with the memory management on top of it.
(In reply to Sune Vuorela from comment #6) > > There is https://invent.kde.org/graphics/okular/-/merge_requests/595 which > claims to fix it. Though it likely introduces a significant memory leak. > > I don't know enough about any RTL language to be able to really test and > work on it, but I have prepared I tested !595 and I'm having similar issues to those reported by chfanzil in: https://phabricator.kde.org/D10455 > https://invent.kde.org/graphics/okular/-/merge_requests/905 which should > make it easier with the memory management on top of it. Thank you for that. I haven't used C++ in a while. All that memory management stuff looks scary to me. I need to get a few things off my hands, then I'll be able to investigate RTL issues more thoroughly in different backends.