Bug 116512

Summary: [site-issue] russian "о" letter on gmail.com
Product: [Applications] konqueror Reporter: Ilya Petrov <ilya.muromec>
Component: khtmlAssignee: Konqueror Developers <konq-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: anton.bugs, maksim, tdik123
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: unspecified   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Ilya Petrov 2005-11-16 19:11:42 UTC
Version:            (using KDE Devel)
OS:                Linux

konqueror cuting russian "о" letter in some places of gmail page. screenshot:
http://img352.imageshack.us/img352/2170/160lt.png
Comment 1 Thiago Macieira 2005-11-17 01:01:28 UTC
Can you give us a testcase?
Comment 2 Ilya Petrov 2005-11-17 01:45:16 UTC
I can`t creat it. Konqueror does`t save gmail inbox page. It saves only page with error about java-script. Also there are no error messages in terminal window. In non-java mode all shows normally.
Comment 3 Anton 2006-04-12 03:05:30 UTC
I can confirm the bug. It replaces some characters to '>' and '<'
as you can see from a screenshot.
Comment 4 Philip Rodrigues 2006-09-07 21:59:00 UTC
Does this still occur with a recent KDE version?
Comment 5 Ilya Petrov 2006-09-08 14:25:48 UTC
yes. with 3.5.4 from kubuntu dapper
Comment 6 Anton 2006-10-29 05:26:16 UTC
It's the same with 3.5.5.

Just to give you more details how to reproduce  it 'case it's too mush playing with gmail javascript to create a proper test case:

1. send the follow message to your gmail account:
"Мама мыла Рому"
2. open received message and press 'reply' button.
3.The quoted message will become:
"Ма<а <ыла ><у"
Comment 7 Anton 2006-10-29 05:44:53 UTC
one more thing:
you have to change browser identification to 'firefox 1.5.0.4' to avoid 'basic html view' of the gmail.

This bug is critical for me because it confuses people whom I reply to and screws text in each Russian email. So I have to use another browser.

Thanks.
Comment 8 Maksim Orlovich 2006-10-29 16:41:28 UTC
Given the nature of gmail, we'll likely have to hope that fixing some well-understood encoding bug will fix this... I have one guess as to what the issues may be, and will try to get it tested shortly..
Comment 9 Maksim Orlovich 2006-10-29 17:13:54 UTC
Cyrillic m is U+0x41C, Cyrillic 'o' is U+0x43E...  < is 3C, and > is 3E, so I have a strong suspicions that problems with unicode support in regexp's in KJS are responsible (but I may be overguessing, since I am working on fixing that bug)... Would any of you be able to test patches, in case I can't find a developer-tester? 
Comment 10 Maksim Orlovich 2006-10-29 19:45:40 UTC
*** Bug 134861 has been marked as a duplicate of this bug. ***
Comment 11 Maksim Orlovich 2006-10-29 23:39:33 UTC
It does appear to be that, so this may get fixed shortly if I don't go insane due to empty-to-something substitutions..
Comment 12 Anton 2006-10-30 03:15:25 UTC
Yes, I can test the patch. I'm a gentoo user with a developer background.
Let me know there can I get it.
Comment 13 Maksim Orlovich 2006-11-12 20:53:02 UTC
SVN commit 604411 by orlovich:

Fix Unicode support in RegExp handling, and also be more robust vs. embedded nulls.
Fixes problems with some cyrillic characters in gmail (#116512)
Makes google calendar somewhat work, though not fully (part #135445)
Also fixes #135246, #101398

BUG:135246
BUG:101398
BUG:116512
CCBUG:135445


 M  +149 -8    regexp.cpp  
 M  +24 -1     regexp.h  
 M  +2 -0      regexp_object.cpp  
 M  +28 -4     string_object.cpp  


--- branches/KDE/3.5/kdelibs/kjs/regexp.cpp #604410:604411
@@ -2,6 +2,8 @@
 /*
  *  This file is part of the KDE libraries
  *  Copyright (C) 1999-2001 Harri Porten (porten@kde.org)
+ *  Copyright (C) 2003,2004 Apple Computer, Inc.
+ *  Copyright (C) 2006      Maksim Orlovich (maksim@kde.org)
  *
  *  This library is free software; you can redistribute it and/or
  *  modify it under the terms of the GNU Lesser General Public
@@ -22,22 +24,35 @@
 #include "regexp.h"
 
 #include "lexer.h"
+#include <assert.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 
 using namespace KJS;
 
+RegExp::UTF8SupportState RegExp::utf8Support = RegExp::Unknown;
+
 RegExp::RegExp(const UString &p, int f)
-  : pat(p), flgs(f), m_notEmpty(false), valid(true)
+  : pat(p), flgs(f), m_notEmpty(false), valid(true), buffer(0), originalPos(0)
 {
+  // Determine whether libpcre has unicode support if need be..
+  if (utf8Support == Unknown) {
+    int supported;
+    pcre_config(PCRE_CONFIG_UTF8, (void*)&supported);
+    utf8Support = supported ? Supported : Unsupported;
+  }
+
   nrSubPatterns = 0; // determined in match() with POSIX regex.
 
   // JS regexps can contain Unicode escape sequences (\uxxxx) which
   // are rather uncommon elsewhere. As our regexp libs don't understand
   // them we do the unescaping ourselves internally.
+  // Also make sure to expand out any nulls as pcre_compile 
+  // expects null termination..
   UString intern;
-  if (p.find('\\') >= 0) {
+  const char* const nil = "\\x00";
+  if (p.find('\\') >= 0 || p.find(KJS::UChar('\0')) >= 0) {
     bool escape = false;
     for (int i = 0; i < p.size(); ++i) {
       UChar c = p[i];
@@ -52,7 +67,12 @@
           if (Lexer::isHexDigit(c0) && Lexer::isHexDigit(c1) &&
               Lexer::isHexDigit(c2) && Lexer::isHexDigit(c3)) {
             c = Lexer::convertUnicode(c0, c1, c2, c3);
-            intern += UString(&c, 1);
+            if (c.unicode() == 0) {
+                // Make sure to encode 0, to avoid terminating the string
+                intern += UString(nil);
+            } else {
+                intern += UString(&c, 1);
+            }
             i += 4;
             continue;
           }
@@ -62,6 +82,8 @@
       } else {
         if (c == '\\')
           escape = true;
+        else if (c == '\0')
+          intern += UString(nil);
         else
           intern += UString(&c, 1);
       }
@@ -81,8 +103,16 @@
   if (flgs & Multiline)
     pcreflags |= PCRE_MULTILINE;
 
-  pcregex = pcre_compile(intern.ascii(), pcreflags,
+  if (utf8Support == Supported)
+    pcreflags |= PCRE_UTF8;
+
+  // Fill our buffer with an encoded version, whether utf-8, or, 
+  // if PCRE is incapable, truncated.
+  prepareMatch(intern);
+
+  pcregex = pcre_compile(buffer, pcreflags,
 			 &perrormsg, &errorOffset, NULL);
+  doneMatch(); // Cleanup buffers
   if (!pcregex) {
 #ifndef NDEBUG
     fprintf(stderr, "KJS: pcre_compile() failed with '%s'\n", perrormsg);
@@ -128,6 +158,7 @@
 
 RegExp::~RegExp()
 {
+  doneMatch(); // Be 100% sure buffers are freed
 #ifdef HAVE_PCREPOSIX
   if (pcregex)
     pcre_free(pcregex);
@@ -137,8 +168,97 @@
 #endif
 }
 
+void RegExp::prepareUtf8(const UString& s)
+{
+  // Allocate a buffer big enough to hold all the characters plus \0
+  const int length = s.size();
+  buffer = new char[length * 3 + 1];
+
+  // Also create buffer for positions. We need one extra character in there,
+  // even past the \0 since the non-empty handling may jump one past the end
+  originalPos = new int[length * 3 + 2];
+
+  // Convert to runs of 8-bit characters, and generate indeces
+  // Note that we do NOT combine surrogate pairs here, as 
+  // regexps operate on them as separate characters
+  char *p      = buffer;
+  int  *posOut = originalPos;
+  const UChar *d = s.data();
+  for (int i = 0; i != length; ++i) {
+    unsigned short c = d[i].unicode();
+
+    int sequenceLen;
+    if (c < 0x80) {
+      *p++ = (char)c;
+      sequenceLen = 1;
+    } else if (c < 0x800) {
+      *p++ = (char)((c >> 6) | 0xC0); // C0 is the 2-byte flag for UTF-8
+      *p++ = (char)((c | 0x80) & 0xBF); // next 6 bits, with high bit set
+      sequenceLen = 2;
+    } else {
+      *p++ = (char)((c >> 12) | 0xE0); // E0 is the 3-byte flag for UTF-8
+      *p++ = (char)(((c >> 6) | 0x80) & 0xBF); // next 6 bits, with high bit set
+      *p++ = (char)((c | 0x80) & 0xBF); // next 6 bits, with high bit set
+      sequenceLen = 3;
+    }
+
+    while (sequenceLen > 0) {
+      *posOut = i;
+      ++posOut;
+      --sequenceLen;
+    }
+  }
+
+  bufferSize = p - buffer;
+
+  *p++ = '\0';
+
+  // Record positions for \0, and the fictional character after that.
+  *posOut     = length;
+  *(posOut+1) = length+1;
+}
+
+void RegExp::prepareASCII (const UString& s)
+{
+  originalPos = 0;
+
+  // Best-effort attempt to get something done
+  // when we don't have utf 8 available -- use 
+  // truncated version, and pray for the best 
+  CString truncated = s.cstring();
+  buffer = new char[truncated.size() + 1];
+  memcpy(buffer, truncated.c_str(), truncated.size());
+  buffer[truncated.size()] = '\0'; // For _compile use
+  bufferSize = truncated.size();
+}
+
+void RegExp::prepareMatch(const UString &s)
+{
+  delete[] originalPos; // Just to be sure..
+  delete[] buffer;
+  if (utf8Support == Supported)
+    prepareUtf8(s);
+  else
+    prepareASCII(s);
+
+#ifndef NDEBUG
+  originalS = s;
+#endif
+}
+
+void RegExp::doneMatch() 
+{
+  delete[] originalPos; originalPos = 0;
+  delete[] buffer;      buffer      = 0;
+}
+
 UString RegExp::match(const UString &s, int i, int *pos, int **ovector)
 {
+#ifndef NDEBUG
+  assert(s.data() == originalS.data()); // Make sure prepareMatch got called right..
+#endif
+  assert(valid);
+
   if (i < 0)
     i = 0;
   if (ovector)
@@ -151,14 +271,28 @@
     return UString::null;
 
 #ifdef HAVE_PCREPOSIX
-  CString buffer(s.cstring());
-  int bufferSize = buffer.size();
   int ovecsize = (nrSubPatterns+1)*3; // see pcre docu
   if (ovector) *ovector = new int[ovecsize];
   if (!pcregex)
     return UString::null;
 
-  if (pcre_exec(pcregex, NULL, buffer.c_str(), bufferSize, i,
+  int startPos;
+  int nextPos;
+
+  if (utf8Support == Supported) {
+    startPos = i;
+    while (originalPos[startPos] < i)
+      ++startPos;
+
+    nextPos = startPos;
+    while (originalPos[nextPos] < (i + 1))
+      ++nextPos;
+  } else {
+    startPos = i;
+    nextPos  = i + 1;
+  }
+
+  if (pcre_exec(pcregex, NULL, buffer, bufferSize, startPos,
                 m_notEmpty ? (PCRE_NOTEMPTY | PCRE_ANCHORED) : 0, // see man pcretest
                 ovector ? *ovector : 0L, ovecsize) == PCRE_ERROR_NOMATCH)
   {
@@ -172,7 +306,7 @@
       fprintf(stderr, "No match after m_notEmpty. +1 and keep going.\n");
 #endif
       m_notEmpty = 0;
-      if (pcre_exec(pcregex, NULL, buffer.c_str(), bufferSize, i+1, 0,
+      if (pcre_exec(pcregex, NULL, buffer, bufferSize, nextPos, 0,
                     ovector ? *ovector : 0L, ovecsize) == PCRE_ERROR_NOMATCH)
         return UString::null;
     }
@@ -181,6 +315,13 @@
   }
 
   // Got a match, proceed with it.
+  // But fix up the ovector if need be..
+  if (ovector && originalPos) {
+    for (unsigned c = 0; c < 2 * (nrSubPatterns + 1); ++c) {
+      if ((*ovector)[c] != -1)
+        (*ovector)[c] = originalPos[(*ovector)[c]];
+    }
+  }
 
   if (!ovector)
     return UString::null; // don't rely on the return value if you pass ovector==0
--- branches/KDE/3.5/kdelibs/kjs/regexp.h #604410:604411
@@ -51,16 +51,39 @@
     // RegExp.exec, so it has to store $1 etc.
     // bool test(const UString &s, int i = -1);
     unsigned int subPatterns() const { return nrSubPatterns; }
+    
+    //These methods should be called around the match of the same string..
+    void prepareMatch(const UString &s);
+    void doneMatch();
   private:
     const UString pat;
-    int flgs;
+    int flgs : 8;
     bool m_notEmpty;
     bool valid;
+    
+    // Cached encoding info...
+    char* buffer;
+    int*  originalPos;
+    int   bufferSize;
 
+    void prepareUtf8  (const UString& s);
+    void prepareASCII (const UString& s);
+#ifndef NDEBUG
+    UString originalS; // the original string, used for sanity-checking
+#endif
+
 #ifndef HAVE_PCREPOSIX
     regex_t preg;
 #else
     pcre *pcregex;
+    
+    enum UTF8SupportState {
+      Unknown,
+      Supported,
+      Unsupported
+    };
+    
+    static UTF8SupportState utf8Support;
 #endif
     unsigned int nrSubPatterns;
 
--- branches/KDE/3.5/kdelibs/kjs/regexp_object.cpp #604410:604411
@@ -120,7 +120,9 @@
     RegExpObjectImp* regExpObj = static_cast<RegExpObjectImp*>(exec->lexicalInterpreter()->builtinRegExp().imp());
     int **ovector = regExpObj->registerRegexp( re, s.value() );
 
+    re->prepareMatch(s.value());
     str = re->match(s.value(), i, 0L, ovector);
+    re->doneMatch();
     regExpObj->setSubPatterns(re->subPatterns());
 
     if (id == Test)
--- branches/KDE/3.5/kdelibs/kjs/string_object.cpp #604410:604411
@@ -294,6 +294,7 @@
     }
     RegExpObjectImp* regExpObj = static_cast<RegExpObjectImp*>(exec->interpreter()->builtinRegExp().imp());
     int **ovector = regExpObj->registerRegexp(reg, s);
+    reg->prepareMatch(s);
     UString mstr = reg->match(s, -1, &pos, ovector);
     if (id == Search) {
       result = Number(pos);
@@ -316,6 +317,7 @@
 	result = exec->lexicalInterpreter()->builtinArray().construct(exec, list);
       }
     }
+    reg->doneMatch();
     delete tmpReg;
     break;
   }
@@ -337,13 +339,17 @@
       else
         u3 = a1.toString(exec); // 2nd arg is the replacement string
 
+      UString out;
+      
       // This is either a loop (if global is set) or a one-way (if not).
+      reg->prepareMatch(s);
       do {
         int **ovector = regExpObj->registerRegexp( reg, s );
         UString mstr = reg->match(s, lastIndex, &pos, ovector);
         regExpObj->setSubPatterns(reg->subPatterns());
         if (pos == -1)
           break;
+          
         len = mstr.size();
 
         UString rstr;
@@ -381,12 +387,27 @@
           Object thisObj = exec->interpreter()->globalObject();
           rstr = o1.call( exec, thisObj, l ).toString(exec);
         }
-        lastIndex = pos + rstr.size();
-        s = s.substr(0, pos) + rstr + s.substr(pos + len);
-        //fprintf(stderr,"pos=%d,len=%d,lastIndex=%d,s=%s\n",pos,len,lastIndex,s.ascii());
+
+        // Append the stuff we skipped over to get to the match --
+        // that would be [lastIndex, pos) of the original..
+        if (pos != lastIndex)
+          out += s.substr(lastIndex, pos - lastIndex);
+
+        // Append the replacement..
+        out += rstr;
+        
+        lastIndex = pos + len; // Skip over the matched stuff...
       } while (global);
 
-      result = String(s);
+      // Append the rest of the string to the output...
+      if (lastIndex == 0 && out.size() == 0) // Don't copy stuff if nothing changed
+        out = s;
+      else
+        out += s.substr(lastIndex, s.size() - lastIndex);
+
+      reg->doneMatch();
+
+      result = String(out);
     } else { // First arg is a string
       u2 = a0.toString(exec);
       pos = s.find(u2);
@@ -433,8 +454,10 @@
     if (a0.type() == ObjectType && Object::dynamicCast(a0).inherits(&RegExpImp::info)) {
       Object obj0 = Object::dynamicCast(a0);
       RegExp reg(obj0.get(exec,"source").toString(exec));
+      reg.prepareMatch(s);
       if (s.isEmpty() && !reg.match(s, 0).isNull()) {
 	// empty string matched by regexp -> empty array
+        reg.doneMatch();
 	res.put(exec, lengthPropertyName, Number(0), DontDelete|ReadOnly|DontEnum);
 	break;
       }
@@ -454,6 +477,7 @@
 	  i++;
 	}
       }
+      reg.doneMatch();
     } else {
       u2 = a0.toString(exec);
       if (u2.isEmpty()) {
Comment 14 Anton 2006-11-13 00:21:48 UTC
Thank you very much, Maksim!
The bug is fixed.
I'll try push the patch to gentoo distro.
Btw, I couldn't copy-paste the patch from here because it's a bit broken.
For example, "@" replaced with " at " and i'm not sure about spaces and tabs also.
Comment 15 Anton 2007-01-11 14:03:42 UTC
I have an another issue with konqueror and gmail.
It seems the problem in AJAX, javascript and utf-8 encoding.
It took a screenshot and uploaded it here:
http://img145.imageshack.us/img145/3959/konqgoogleiglt7.png

Shell I open an another bug report?
Comment 16 Anton 2007-01-13 05:36:10 UTC
ok, the issue looks very different for me and I opened the new bug 140005
Cheers.