Bug 33855

Summary: Konqueror does not handle soft-hyphen (http://www.newscientist.com/)
Product: [Applications] konqueror Reporter: Alain Knaff <kde>
Component: khtml rendererAssignee: Konqueror Developers <konq-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: andreaswieland, carlos-spam, esigra, hhielscher, odie
Priority: NOR    
Version: 3.0   
Target Milestone: ---   
Platform: RedHat Enterprise Linux   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Alain Knaff 2001-10-18 11:45:31 UTC
(*** This bug was imported into bugs.kde.org ***)

Package:           khtml
Version:           3.0 (using KDE 2.2.1 -0.rh71.1.cups)
Severity:          normal
Installed from:    Red Hat Linux 7.1
Compiler:          gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-85)
OS:                Linux (i686) release 2.4.9-0.18
OS/Compiler notes: 

The Ascii character <AD> which looks like a hyphen is a "soft" hyphen and is used to indicate places in words where a line-break may occur (just like \- in LaTeX). It should only be displayed at the end of the line when a break does indeed occur. However konqueror unconditionnally displays it just as if it were a normal hyphen (2D).
(See also http://www.w3.org/TR/REC-html40/struct/text.html#h-9.3.3)

Currently konqueror doesn't seem to handle the soft hyphen.

However when fixing this please be aware that unfortunately the soft hyphen is misused in a number of places such as for example http://www.newscientist.com/news/news.jsp?id=ns99991443
where it is misused as a numeric minus sign (­255). Obviously suppressing its display in this instance would severly distort the meaning of the text. In order to prevent that from happening maybe <AD> should always be displayed after a space or before a digit.


(Submitted via bugs.kde.org)
(Called from KBugReport dialog)
Comment 1 Alain Knaff 2003-01-29 21:09:16 UTC
The problem is still present in KDE 3.1.0. 
Moreover, Konqueror does never break a word containing the soft hyphen. 
 
Example:  
 
test
Comment 2 esigra 2003-08-12 17:23:44 UTC
This is a confirmation that the bug still exists in KDE 3.1.3. 
Comment 3 Thiago Macieira 2003-08-13 13:38:03 UTC
And in HEAD (20030725). 
Comment 4 Luís Pedro Coelho 2003-09-08 13:43:42 UTC
*** Bug 52956 has been marked as a duplicate of this bug. ***
Comment 5 Sashmit Bhaduri 2003-10-26 23:03:09 UTC
and in 20031016 HEAD
Comment 6 Waldo Bastian 2004-01-14 14:10:32 UTC
kedit / Qt3.3 handles it correctly but khtml (KDE 3.2) fails.

Testcase: khtmltests/regression/tests/unsorted/33855.html
Comment 7 Stephan Kulow 2004-02-02 20:19:37 UTC
*** Bug 60053 has been marked as a duplicate of this bug. ***
Comment 8 Luciano Montanaro 2004-05-12 15:09:17 UTC
Dave Hyatt in his blog says he has fixed this for Safari.
So hopefully this will be fixed in khtml too, in a not too distant future.
See
http://weblogs.mozillazine.org/hyatt/archives/2004_03.html#005130
Comment 9 Claus Agerskov 2004-06-17 10:58:37 UTC
ASCII codes above 127 isn't real ASCII codes so in some extended ASCII tables 173 aka AD is soft hyphen in others it isn't.

So please referre to soft hyphen as either &shy;, &173; or &x0AD; like the W3C HTML 4.0 recommendation says:

http://www.w3.org/TR/REC-html40/struct/text.html#hyphenation

Right now I use Konqueror 3.2.2-4 Red Hat and the soft hyphens are shown as small spaces which doesn't break
Comment 10 Helge Hielscher 2004-06-17 20:09:26 UTC
The subject of soft hyphenation is not that easy. Please read http://www.cs.tut.fi/~jkorpela/shy.html
Comment 11 Luciano Montanaro 2004-06-18 22:35:44 UTC
I have read the above link, and I understand the issue may be complicated.
However, I have seen in the above article that:
- in unicode, the character is meant to be a discretonary hyphen
- HTML 4 recommends to assign to the character the same meaning.
- Mozilla and Netscape hide the character from the stream, partially ignoring
  the problem, but acting substantially in accordance with the HTML 4 spec.

I think we should either do what Mozilla is doing in this case, or if it is not too difficult, do the extra step of using the hyphen as a line-break hint.

To make still better, the document should be hyphenated on the fly, but that
is not possible unless the document language is known. Until then, at least
we would give authors a way to improve the document look and legibility.

If the feature is implemented, the hyphen must be ignored when searching in the text, according to the above document.


Comment 12 Wilbert Berendsen 2005-01-31 09:55:17 UTC
The &shy; (&#173;) still always displays as a normal hyphen in KDE 3.3.2
Comment 13 Maksim Orlovich 2005-03-27 01:57:16 UTC
*** Bug 102577 has been marked as a duplicate of this bug. ***
Comment 14 Tommi Tervo 2005-08-21 13:37:03 UTC
*** Bug 111212 has been marked as a duplicate of this bug. ***
Comment 15 Germain Garand 2005-10-08 22:19:54 UTC
SVN commit 468638 by ggarand:

Port soft-hyphen support from WebCore

BUG: 33855, 113900



 M  +8 -0      ChangeLog  
 M  +65 -4     rendering/bidi.cpp  
 M  +5 -1      rendering/render_text.cpp  
 M  +3 -0      rendering/render_text.h  


--- branches/KDE/3.5/kdelibs/khtml/ChangeLog #468637:468638
@@ -1,5 +1,13 @@
 2005-10-08  Germain Garand  <germain@ebooksfrance.org>
 
+        Port soft-hyphen support from WebCore
+
+        * rendering/bidi.cpp (findNextLineBreak): main skipping/breaking logic
+
+        * rendering/render_text.{cpp,h} (calcMinMaxWidth): account for soft hyphen
+
+2005-10-08  Germain Garand  <germain@ebooksfrance.org>
+
         Follow Mozilla/Opera in considering the Window object as an implementation of the AbstractView/ViewCSS 
         DOM2 interfaces. Return the Window object when available ; an AbstractView if not.
 
--- branches/KDE/3.5/kdelibs/khtml/rendering/bidi.cpp #468637:468638
@@ -420,6 +420,19 @@
         sLastBidiRun = startRun;
 }
 
+static void chopMidpointsAt(RenderObject* obj, uint pos)
+{
+    if (!sNumMidpoints) return;
+    BidiIterator* midpoints = smidpoints->data();
+    for (uint i = 0; i < sNumMidpoints; i++) {
+        const BidiIterator& point = midpoints[i];
+        if (point.obj == obj && point.pos == pos) {
+            sNumMidpoints = i;
+            break;
+        }
+    }
+}
+
 static void checkMidpoints(BidiIterator& lBreak, BidiState &bidi)
 {
     // Check to see if our last midpoint is a start point beyond the line break.  If so,
@@ -435,8 +448,16 @@
         if (currpoint == lBreak) {
             // We hit the line break before the start point.  Shave off the start point.
             sNumMidpoints--;
-            if (!endpoint.obj->style()->preserveWS())
+            if (!endpoint.obj->style()->preserveWS()) {
+                if (endpoint.obj->isText()) {
+                    // Don't shave a character off the endpoint if it was from a soft hyphen.
+                    RenderText* textObj = static_cast<RenderText*>(endpoint.obj);
+                    if (endpoint.pos+1 < textObj->length() &&
+                        textObj->text()[endpoint.pos+1].unicode() == SOFT_HYPHEN)
+                        return;
+                }
                 endpoint.pos--;
+            }
         }
     }
 }
@@ -1817,19 +1838,47 @@
             bool appliedStartWidth = pos > 0; // If the span originated on a previous line,
                                               // then assume the start width has been applied.
             bool appliedEndWidth = false;
+            bool nextIsSoftBreakable = false;
 
             while(len) {
                 bool previousCharacterIsSpace = currentCharacterIsSpace;
+                bool isSoftBreakable = nextIsSoftBreakable;
+                nextIsSoftBreakable = false;
                 const QChar c = str[pos];
                 currentCharacterIsSpace = c == ' ';
 
                 if (preserveWS || !currentCharacterIsSpace)
                     isLineEmpty = false;
 
+                // Check for soft hyphens.  Go ahead and ignore them.
+                if (c.unicode() == SOFT_HYPHEN && pos > 0) {
+                    nextIsSoftBreakable = true;
+                    if (!ignoringSpaces) {
+                        // Ignore soft hyphens
+                        BidiIterator endMid(0, o, pos-1);
+                        addMidpoint(endMid);
+                        
+                        // Add the width up to but not including the hyphen.
+                        tmpW += t->width(lastSpace, pos - lastSpace, f);
+                        
+                        // For whitespace normal only, include the hyphen.  We need to ensure it will fit
+                        // on the line if it shows when we break.
+                        if (o->style()->whiteSpace() == NORMAL)
+                            tmpW += t->width(pos, 1, f);
+                        
+                        BidiIterator startMid(0, o, pos+1);
+                        addMidpoint(startMid);
+                    }
+                    
+                    pos++;
+                    len--;
+                    lastSpace = pos; // Cheesy hack to prevent adding in widths of the run twice.
+                    continue;
+                }
 #ifdef APPLE_CHANGES    // KDE applies wordspacing differently
                 bool applyWordSpacing = false;
 #endif
-                if ( (preserveLF && c == '\n') || (autoWrap && isBreakable( str, pos, strlen )) ) {
+                if ( (preserveLF && c == '\n') || (autoWrap && (isBreakable( str, pos, strlen ) || isSoftBreakable)) ) {
                     if (ignoringSpaces) {
                         if (!currentCharacterIsSpace) {
                             // Stop ignoring spaces and begin at this
@@ -1880,8 +1929,12 @@
                         }
                     }
 
-                    if (w + tmpW > width && autoWrap) {
-                        goto end;
+                    if (autoWrap) {
+                        if (w+tmpW > width)
+                            goto end;
+                        else if ( (pos > 1 && str[pos-1].unicode() == SOFT_HYPHEN) )
+                            // Subtract the width of the soft hyphen out since we fit on a line.
+                            tmpW -= t->width(pos-1, 1, f);                        
                     }
 
                     if( preserveLF && *(str+pos) == '\n' ) {
@@ -2114,6 +2167,14 @@
         lBreak.increment(bidi);
     }
 
+    if (lBreak.obj && lBreak.pos >= 2 && lBreak.obj->isText()) {
+        // For soft hyphens on line breaks, we have to chop out the midpoints that made us
+        // ignore the hyphen so that it will render at the end of the line.
+        QChar c = static_cast<RenderText*>(lBreak.obj)->text()[lBreak.pos-1];
+        if (c.unicode() == SOFT_HYPHEN)
+            chopMidpointsAt(lBreak.obj, lBreak.pos-2);
+    }
+
     return lBreak;
 }
 
--- branches/KDE/3.5/kdelibs/khtml/rendering/render_text.cpp #468637:468638
@@ -1041,9 +1041,13 @@
             m_hasBeginWS = true;
         if ((isSpace || isNewline) && i == len-1)
             m_hasEndWS = true;
+        
+        if (i && c == SOFT_HYPHEN)
+            continue;
 
         int wordlen = 0;
-        while( i+wordlen < len && !(isBreakable( str->s, i+wordlen, str->l )) )
+        while( i+wordlen < len && (i+wordlen == 0 || str->s[i+wordlen].unicode() != SOFT_HYPHEN) &&
+               !(isBreakable( str->s, i+wordlen, str->l )) )
             wordlen++;
 
         if (wordlen)
--- branches/KDE/3.5/kdelibs/khtml/rendering/render_text.h #468637:468638
@@ -36,6 +36,9 @@
 class QPainter;
 class QFontMetrics;
 
+// Define a constant for soft hyphen's unicode value.
+#define SOFT_HYPHEN 173
+
 namespace khtml
 {
     class RenderText;