110597 – String.split(RegExp) with parens is not ECMA standards-conformant

Bug 110597 - String.split(RegExp) with parens is not ECMA standards-conformant

Summary: String.split(RegExp) with parens is not ECMA standards-conformant

Status:	RESOLVED UNMAINTAINED

Alias:	None

Product:	konqueror
Classification:	Applications
Component:	kjs (show other bugs)
Version:	unspecified
Platform:	Debian testing Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Konqueror Developers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2005-08-12 00:31 UTC by lupin
Modified:	2012-06-18 19:55 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed In:

Attachments
Improve standards compliance of string.split() function + fixes for Regex when using POSIX regex.h (2.90 KB, patch) 2006-08-26 15:10 UTC, James Thorniley	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description lupin 2005-08-12 00:31:20 UTC

Version:            (using KDE KDE 3.3.2)
Installed from:    Debian testing/unstable Packages
OS:                Linux

According to section 15.5.4.14 of
 
  http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf

String.split, if given a RegExp with parentheses, should intersperse the split parts of the string with the matches corresponding to each pair of parens. Konqueror doesn't do this.

For example, if you type these into Konqueror's Javascript debugger, then

  'abc'.split(/b/) should (and does) give a,c

but

  'abc'.split(/(b)/) should give a,b,c but it gives a,c

Comment 1 Harri Porten 2005-09-17 10:55:33 UTC

Confirmed. I hope this issue can be resolved from within kjs itself. The regular expressions themselves are handled by a 3rd party library.

Comment 2 lupin 2005-09-17 14:41:14 UTC

There is a workaround for this bug, but I would not call it a fix. If you use the following code, then you can use 'abc'.parenSplit(/b/) instead of 'abc'.split(/b/).

// String.prototype.parenSplit should do what ECMAscript says 
// String.prototype.split does, interspersing paren matches between
// the split elements

if (String('abc'.split(/(b)/))!='a,b,c') {
  // broken String.split, e.g. konq, IE
  String.prototype.parenSplit=function (re) {
    var m=re.exec(this);
    if (!m) return [this];
    // without this, we have 
    // 'ab'.parenSplit(/a|(b)/) != 'ab'.split(/a|(b)/)
    for(var i=0; i<m.length; ++i) {
      if (typeof m[i]=='undefined') m[i]='';
    }
    return [this.substring(0,m.index)]
      .concat(m.slice(1))
      .concat(this.substring(m.index+m[0].length).parenSplit(re));
  };
} else {
  String.prototype.parenSplit=function (re) {return this.split(re);};
}

Comment 3 George Staikos 2005-10-29 18:59:12 UTC

Confirmed still a problem.

Comment 4 James Thorniley 2006-08-26 15:10:07 UTC

Created attachment 17512 [details]
Improve standards compliance of string.split() function + fixes for Regex when using POSIX regex.h

Hi, please consider this patch for fixing bug 110597 (for 3.5 branch, though it
should be possible to do something similar in trunk).

The string.split function inserts matched subpatterns into the result array as
per the ECMA standard.

This is easy with PCRE, but I had to change the regex code when compiled using
regex.h.

Firstly, the nrSubPatterns variable should equal the number subpatterns, not
including the +1 for the whole pattern (as it did before).

Also, counting the subpatterns by checking pmatch[i].rm_so != -1 does not work,
because empty subpatterns will be ignored.

E.g. 'abcd'.split(/(x)?(b)/) goes horribly wrong as the current method of
counting subpatterns will decide there are zero, because the (x)? pattern does
not match (thus pmatch[1].rm_so == -1). Note the difference between that and
/(x?)(b)/, in which case the first subpattern would match, it would just be an
empty string.

The only way I could find to get round this was to write some code to count the
number of patterns by looking for '(' characters that are not escaped or inside
character classes. Not ideal but I cannot see an alternative.

Hopefully everyone uses PCRE anyway - it doesn't seem that regex.h supports
utf-8 either.

James

Comment 5 Maksim Orlovich 2006-08-26 15:21:30 UTC

Hi, I'll try to take a look at your patch in a week or so, but can't guarantee it --- I am just pretty sure I wouldn't be able to do it over the next week (but may be someone else will). Yeah, it's pretty much expected that anyone not doing an embedded build would use libpcre.

Thank your for your contribution.

Comment 6 Myriam Schweingruber 2012-06-18 19:55:01 UTC

Message from the Bugsquad and Konqueror teams:
This bug is closed as outdated, as we do not have the manpower to maintain the KDE3 version anymore.
If you still can reproduce this issue with Konqueror 4.8.4 or later, please open a new report.
Thank you for your understanding.