Bug 353300 - Text search does not work in Arabic
Summary: Text search does not work in Arabic
Status: CONFIRMED
Alias: None
Product: okular
Classification: Applications
Component: EPub backend (show other bugs)
Version: 1.11.3
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Okular developers
URL:
Keywords: rtl
: 353301 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-09-29 04:56 UTC by Fahad Al-Saidi
Modified: 2024-01-28 04:10 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
an arabic epub file (2.96 MB, application/epub+zip)
2015-09-29 04:57 UTC, Fahad Al-Saidi
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Fahad Al-Saidi 2015-09-29 04:56:47 UTC
Open any epub files that contain arabic text then try to search any arabic word.

Reproducible: Always

Steps to Reproduce:
1. Open attach file 
2. open search dialogue by crtl+f 
3. type كلمة

Actual Results:  
the text not found

Expected Results:  
the word "كلمة" is in the doc
Comment 1 Fahad Al-Saidi 2015-09-29 04:57:22 UTC
Created attachment 94770 [details]
an arabic epub file
Comment 2 Justin Zobel 2020-11-08 10:23:02 UTC
Confirmed with the epub provided and okular from git master.

Picked highest version as no git in version field.
Comment 3 Justin Zobel 2020-11-08 10:23:33 UTC
*** Bug 353301 has been marked as a duplicate of this bug. ***
Comment 4 Saladin Shaban 2024-01-06 16:11:56 UTC
I have been investigating this with ChatGPT's help:
https://chat.openai.com/share/69484c6f-d974-4047-8d52-c92d6ea0ebf9

There seem to be two issues affecting Arabic and, apparently, other RTL languages as well.

One of them is in TextDocumentUtils::calculateBoundingRect(), where the final if statement is incorrectly triggered for all Arabic text, causing all characters to be substituted with newlines.

Trying to work around the issue with this patch, the TinyTextEntity's are created with the correct characters and the correct bounding rectangles, as far as I can see. But the search still seems to expect the TinyTextEntity's in left to right order. I can only find Arabic words if I type them backwards in the Find bar.


diff --git a/core/textdocumentgenerator_p.h b/core/textdocumentgenerator_p.h
index fb1a5e395..bf17e2e52 100644
--- a/core/textdocumentgenerator_p.h
+++ b/core/textdocumentgenerator_p.h
@@ -44,19 +44,23 @@ static void calculateBoundingRect(QTextDocument *document, int startPosition, in
     const QTextLine startLine = startLayout->lineForTextPosition(startPos);
     const QTextLine endLine = endLayout->lineForTextPosition(endPos);
 
-    const double x = startBoundingRect.x() + startLine.cursorToX(startPos);
+    double x = startBoundingRect.x() + startLine.cursorToX(startPos);
     const double y = startBoundingRect.y() + startLine.y();
-    const double r = endBoundingRect.x() + endLine.cursorToX(endPos);
+    double r = endBoundingRect.x() + endLine.cursorToX(endPos);
     const double b = endBoundingRect.y() + endLine.y() + endLine.height();
 
     const int offset = qRound(y) % qRound(pageSize.height());
 
-    if (x > r) { // line break, so return a pseudo character on the start line
+    if (startLine.y() != endLine.y()) { // line break, so return a pseudo character on the start line
         rect = QRectF(x / pageSize.width(), offset / pageSize.height(), 3 / pageSize.width(), startLine.height() / pageSize.height());
         page = -1;
         return;
     }
 
+    const auto direction = document->characterAt(startPosition).direction();
+    if (direction == QChar::DirAL || direction == QChar::DirR)
+        std::swap(x, r);
+
     page = qRound(y) / qRound(pageSize.height());
     rect = QRectF(x / pageSize.width(), offset / pageSize.height(), (r - x) / pageSize.width(), (b - y) / pageSize.height());
 }
Comment 5 Saladin Shaban 2024-01-24 04:57:13 UTC
Diggin some more into the issue:
https://chat.openai.com/share/d2f74bc7-6c7b-437d-ac17-e71ae73f689b

The problem now is the way TinyTextEntity's are sorted according to their positions from left to right. This seems to be crucial to the text selection logic, and cannot be changed without extensive modifications.

I'm not sure about the right way to proceed from here. Some possibilities are:

1- Reorder the text in the Find bar before performing a search, with Qt's equivalent of fribidi_reorder_line().

2- For text-based documents, delegate search and selection to the QTextDocument class itself.

3- Skip sorting the text entities, and adjust the search and selection methods accordingly.

The first option is the least intrusive, while the second seems to offer greater possibilities, like handling the peculiarities of Arabic search and other languages:

https://community.scripture.software.sil.org/t/search-arabic-text-without-tashkeel/882
https://help-nl.oclc.org/Discovery_and_Reference/WorldCat_Discovery/Release_notes/2023_release_notes/040WorldCat_Discovery_release_notes_Arabic_language_support

QTextDocument doesn't seem to support these yet, but it's something I might be able to contribute to Qt.
Comment 6 Sune Vuorela 2024-01-24 07:45:20 UTC
(In reply to Saladin Shaban from comment #5)
> Diggin some more into the issue:
> https://chat.openai.com/share/d2f74bc7-6c7b-437d-ac17-e71ae73f689b
> 
> The problem now is the way TinyTextEntity's are sorted according to their
> positions from left to right. This seems to be crucial to the text selection
> logic, and cannot be changed without extensive modifications.
> 
> I'm not sure about the right way to proceed from here. Some possibilities
> are:
> 
> 1- Reorder the text in the Find bar before performing a search, with Qt's
> equivalent of fribidi_reorder_line().
> 
> 2- For text-based documents, delegate search and selection to the
> QTextDocument class itself.
> 
> 3- Skip sorting the text entities, and adjust the search and selection
> methods accordingly.
> 
> The first option is the least intrusive, while the second seems to offer
> greater possibilities, like handling the peculiarities of Arabic search and
> other languages:
> 
> https://community.scripture.software.sil.org/t/search-arabic-text-without-
> tashkeel/882
> https://help-nl.oclc.org/Discovery_and_Reference/WorldCat_Discovery/
> Release_notes/2023_release_notes/
> 040WorldCat_Discovery_release_notes_Arabic_language_support
> 
> QTextDocument doesn't seem to support these yet, but it's something I might
> be able to contribute to Qt.

There is https://invent.kde.org/graphics/okular/-/merge_requests/595 which claims to fix it. Though it likely introduces a significant memory leak.

I don't know enough about any RTL language to be able to really test and work on it, but I have prepared https://invent.kde.org/graphics/okular/-/merge_requests/905 which should make it easier with the memory management on top of it.
Comment 7 Saladin Shaban 2024-01-28 04:10:25 UTC
(In reply to Sune Vuorela from comment #6)
> 
> There is https://invent.kde.org/graphics/okular/-/merge_requests/595 which
> claims to fix it. Though it likely introduces a significant memory leak.
> 
> I don't know enough about any RTL language to be able to really test and
> work on it, but I have prepared

I tested !595 and I'm having similar issues to those reported by chfanzil in:
https://phabricator.kde.org/D10455

> https://invent.kde.org/graphics/okular/-/merge_requests/905 which should
> make it easier with the memory management on top of it.

Thank you for that.
I haven't used C++ in a while. All that memory management stuff looks scary to me.

I need to get a few things off my hands, then I'll be able to investigate RTL issues more thoroughly in different backends.