Bug 298010 - Missing Unicode blocks; ordering should be improved
Summary: Missing Unicode blocks; ordering should be improved
Status: RESOLVED FIXED
Alias: None
Product: kcharselect
Classification: Applications
Component: general (show other bugs)
Version: unspecified
Platform: Ubuntu Linux
: NOR normal
Target Milestone: ---
Assignee: Daniel Laidig
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-13 02:27 UTC by John Zaitseff
Modified: 2016-11-02 03:37 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Add missing Unicode blocks, rearrange some blocks (1.96 KB, patch)
2012-04-13 02:31 UTC, John Zaitseff
Details
Diff from running kcharselect-generate-datafile.py (2.50 KB, patch)
2012-04-13 02:33 UTC, John Zaitseff
Details
Add missing Unicode blocks, rearrange some blocks (3.17 KB, patch)
2014-07-02 02:02 UTC, John Zaitseff
Details
Diff from running kcharselect-generate-datafile.py (3.17 KB, patch)
2014-07-02 02:03 UTC, John Zaitseff
Details
Generated kcharselect-data (2.86 MB, application/octet-stream)
2014-07-02 02:04 UTC, John Zaitseff
Details
Diff from running kcharselect-generate-datafile.py (4.10 KB, patch)
2014-07-02 02:11 UTC, John Zaitseff
Details
Add missing Unicode blocks, rearrange some blocks (for KF5) (9.50 KB, patch)
2016-05-06 12:39 UTC, John Zaitseff
Details
Generated kcharselect-data file (2.87 MB, application/octet-stream)
2016-05-06 12:40 UTC, John Zaitseff
Details
Diff from running kcharselect-generate-datafile.py (7.25 KB, patch)
2016-05-06 12:43 UTC, John Zaitseff
Details

Note You need to log in before you can comment on or make changes to this bug.
Description John Zaitseff 2012-04-13 02:27:15 UTC
The current version of kcharselect (including that in the Git repository) has a number of Unicode blocks missing:

Arabic Extended-A
Meetei Mayek Extensions
Sundanese Supplement

The section name "Other Scripts" should be "American Scripts" to conform with the Unicode charts at http://www.unicode.org/charts/.

In addition, I would like the block "General Punctuation" to be the first listed in "Symbols", as this is by far the most commonly-used block.  Listing it first saves having to use the secondary drop-box to select it, saving at least a few keystrokes or mouse movements (which get rather annoying rather quickly).

I will attach a patch that fixes all of these problems.
Comment 1 John Zaitseff 2012-04-13 02:31:04 UTC
Created attachment 70357 [details]
Add missing Unicode blocks, rearrange some blocks

1. The following missing Unicode blocks are added:

Arabic Extended-A
Meetei Mayek Extensions
Sundanese Supplement

2. The section "Other Scripts" is renamed "American Scripts" to match the Unicode charts.

3. "General Punctuation" is listed first in the section "Symbols".

4. Various blocks in the section "Mathematical Symbols" are rearranged into alphabetical order.
Comment 2 John Zaitseff 2012-04-13 02:33:57 UTC
Created attachment 70358 [details]
Diff from running kcharselect-generate-datafile.py

Diff created by running the patched kcharselect-generate-datafile.py script on Unicode 6.1 data, as compared with the kdelibs Git repository.
Comment 3 John Zaitseff 2012-04-13 02:59:55 UTC
For those who want to fix the issues mentioned in this bug without waiting---without even recompiling kcharselect---all that needs to be done is to place the new version of kcharselect-data in the appropriate directory.  I have successfully used the new version of kcharselect-data with KDE SC 4.8.2.  Of course, this means that one new string ("American Scripts") will not be localised.

To fix this bug on a temporary basis:

1. Download the new kcharselect-data from my website:

    http://www.zap.org.au/~john/misc/kcharselect-data

2. Place the downloaded file in your KDE apps directory:

    cp kcharselect-data ~/.kde/share/apps/kcharselect

(Your path may be slightly different).  Kcharselect will now display the missing Unicode blocks.

3. Remember to remove your copy of kcharselect-data once the upstream version of the file is fixed.
Comment 4 John Zaitseff 2014-07-02 01:56:09 UTC
I have updated kcharselect-data for Unicode 7.0.  In particular, all missing BMP blocks have been added:

  Latin Extended-E
  Arabic Extended-A
  Meetei Mayek Extensions
  Myanmar Extended-B
  Sundanese Supplement
  Combining Diacritical Marks Extended

I have also renamed and rearranged the sections to more closely match those on the Unicode website at http://www.unicode.org/charts/ :

  European Alphabets => European Scripts
  Philippine Scripts => Indonesia and Oceania Scripts
  South East Asian Scripts => Southeast Asian Scripts
  Other Scripts => American Scripts

Comment 3 remains valid: users can update their kcharselect-data immediately, if desired.

It would be nice to add all the non-BMP characters to KCharSelect, but given that QChar is 16-bit only, I'm not sure how one would do that.  Even in Qt5, QChar is still 16-bit...

I'm disappointed that my previous patch---created two years ago---was not applied.  Is KCharSelect being maintained?  Please apply this patch!
Comment 5 John Zaitseff 2014-07-02 01:59:54 UTC
Comment on attachment 70357 [details]
Add missing Unicode blocks, rearrange some blocks

diff -ruNa kcharselect.orig/kcharselect-generate-datafile.py kcharselect/kcharselect-generate-datafile.py
--- kcharselect.orig/kcharselect-generate-datafile.py	2014-07-02 09:35:18.516222690 +1000
+++ kcharselect/kcharselect-generate-datafile.py	2014-07-02 09:32:04.825658372 +1000
@@ -102,13 +102,14 @@
 
 # based on http://www.unicode.org/charts/
 sectiondata = '''
-SECTION European Alphabets
+SECTION European Scripts
 Basic Latin
 Latin-1 Supplement
 Latin Extended-A
 Latin Extended-B
 Latin Extended-C
 Latin Extended-D
+Latin Extended-E
 Latin Extended Additional
 Armenian
 Coptic
@@ -137,6 +138,7 @@
 SECTION Middle Eastern Scripts
 Arabic
 Arabic Supplement
+Arabic Extended-A
 Arabic Presentation Forms-A
 Arabic Presentation Forms-B
 Hebrew
@@ -144,6 +146,11 @@
 Samaritan
 Syriac
 
+SECTION Central Asian Scripts
+Mongolian
+Phags-pa
+Tibetan
+
 SECTION South Asian Scripts
 Bengali
 Common Indic Number Forms
@@ -156,6 +163,7 @@
 Limbu
 Malayalam
 Meetei Mayek
+Meetei Mayek Extensions
 Ol Chiki
 Oriya
 Saurashtra
@@ -166,33 +174,34 @@
 Thaana
 Vedic Extensions
 
-SECTION Philippine Scripts
-Buhid
-Hanunoo
-Tagalog
-Tagbanwa
-
-
-SECTION South East Asian Scripts
-Balinese
-Batak
-Buginese
+SECTION Southeast Asian Scripts
 Cham
-Javanese
 Kayah Li
 Khmer
 Khmer Symbols
 Lao
 Myanmar
 Myanmar Extended-A
+Myanmar Extended-B
 New Tai Lue
-Rejang
-Sundanese
 Tai Le
 Tai Tham
 Tai Viet
 Thai
 
+SECTION Indonesia and Oceania Scripts
+Balinese
+Batak
+Buginese
+Buhid
+Hanunoo
+Javanese
+Rejang
+Sundanese
+Sundanese Supplement
+Tagalog
+Tagbanwa
+
 SECTION East Asian Scripts
 Bopomofo
 Bopomofo Extended
@@ -220,23 +229,18 @@
 Yi Radicals
 Yi Syllables
 
-SECTION Central Asian Scripts
-Mongolian
-Phags-pa
-Tibetan
-
-SECTION Other Scripts
+SECTION American Scripts
 Cherokee
 Unified Canadian Aboriginal Syllabics
 Unified Canadian Aboriginal Syllabics Extended
 
 SECTION Symbols
+General Punctuation
 Braille Patterns
 Control Pictures
 Currency Symbols
 Dingbats
 Enclosed Alphanumerics
-General Punctuation
 Miscellaneous Symbols
 Miscellaneous Technical
 Optical Character Recognition
@@ -249,17 +253,17 @@
 Arrows
 Block Elements
 Box Drawing
-Supplemental Arrows-A
-Supplemental Arrows-B
 Geometric Shapes
 Letterlike Symbols
 Mathematical Operators
-Supplemental Mathematical Operators
 Miscellaneous Mathematical Symbols-A
 Miscellaneous Mathematical Symbols-B
 Miscellaneous Symbols and Arrows
 Number Forms
 Superscripts and Subscripts
+Supplemental Arrows-A
+Supplemental Arrows-B
+Supplemental Mathematical Operators
 
 SECTION Phonetic Symbols
 IPA Extensions
@@ -268,8 +272,9 @@
 Phonetic Extensions Supplement
 Spacing Modifier Letters
 
-SECTION Combining Diacritical Marks
+SECTION Combining Diacritics
 Combining Diacritical Marks
+Combining Diacritical Marks Extended
 Combining Diacritical Marks Supplement
 Combining Diacritical Marks for Symbols
 Combining Half Marks
@@ -284,7 +289,6 @@
 Specials
 Variation Selectors
 '''
-# TODO: rename "Other Scripts" to "American Scripts"
 
 categoryMap = { # same values as QChar::Category
     "Mn": 1,
@@ -533,7 +537,7 @@
 
     def getBlockList(self):
         return self.blockList
-    
+
     def getSectionList(self):
         return self.sectionList
Comment 6 John Zaitseff 2014-07-02 02:02:16 UTC
Created attachment 87507 [details]
Add missing Unicode blocks, rearrange some blocks
Comment 7 John Zaitseff 2014-07-02 02:03:12 UTC
Created attachment 87508 [details]
Diff from running kcharselect-generate-datafile.py
Comment 8 John Zaitseff 2014-07-02 02:04:15 UTC
Created attachment 87509 [details]
Generated kcharselect-data
Comment 9 John Zaitseff 2014-07-02 02:11:04 UTC
Created attachment 87510 [details]
Diff from running kcharselect-generate-datafile.py
Comment 10 Christoph Feck 2014-07-21 15:04:32 UTC
I am not sure if the patches require an updated Qt, in other words, if KCharSelect can only support the Unicode version that Qt supports. Hopefully the KCharSelect maintainer finds some time to review the changes, and provide some useful comments.

On the other hand, thinking about bug 142625 KCharSelect might need an overhaul anyway.

Again, thanks for the patches, I guess it would be useful if you applied for a KDE developer account.
Comment 11 John Zaitseff 2014-08-06 00:29:51 UTC
My patches do NOT require an updated Qt that can handle non-BMP characters.
Comment 12 John Zaitseff 2016-05-06 12:39:08 UTC
Created attachment 98809 [details]
Add missing Unicode blocks, rearrange some blocks (for KF5)

Updated for GIT version of kcharselect as at 2016-04-06 (for KF5).
Comment 13 John Zaitseff 2016-05-06 12:40:31 UTC
Created attachment 98810 [details]
Generated kcharselect-data file

Updated for KF5 and Unicode 8.0
Comment 14 John Zaitseff 2016-05-06 12:43:25 UTC
Created attachment 98811 [details]
Diff from running kcharselect-generate-datafile.py

The change generated by running the updated kcharselect-generate-datafile.py on data obtained from Unicode 8.0.  The resulting kcharselect-translation.cpp file needs to be placed in kwidgetsaddons/src (KF5).
Comment 15 John Zaitseff 2016-05-06 12:51:29 UTC
I am really disappointed that no one has applied these simple patches in over four years!  These patches are still very much necessary.

I have updated the patch for KF5 and Unicode 8.0.  The resulting data file can also be used with KDE4.  Only BMP characters (<= 0xFFFF) are included for compatibility.

====

For those who want to fix the issues mentioned in this bug without waiting---without even recompiling kcharselect---all that needs to be done is to place the new version of kcharselect-data in the appropriate directory. I have successfully used the new version of kcharselect-data with the latest KF5. Of course, this means that one new string ("American Scripts") will not be localised.

To fix this bug on a temporary basis:

1. Download the new kcharselect-data from my website:

    http://www.zap.org.au/~john/misc/kcharselect-data

2. Place the downloaded file in the appropriate KF5 and KDE4 directories:

    mkdir -p ~/.local/share/kf5/kcharselect
    cp kcharselect-data ~/.local/share/kf5/kcharselect
    mkdir -p ~/.kde/share/apps/kcharselect
    cp kcharselect-data ~/.kde/share/apps/kcharselect

(Your paths may be slightly different).  Kcharselect will now display the missing Unicode blocks.

3. Remember to remove your copies of kcharselect-data once the upstream versions of the file are fixed.  And I hope that will be soon!
Comment 16 Chris 2016-05-23 04:13:43 UTC
KCharSelect is pretty out of date, it's too bad. It also needs emoji support, which GNOME Character Map has. I feel strange using gucharmap on my Plasma desktop, but it's my only option right now.
Comment 17 Christoph Feck 2016-07-14 19:43:47 UTC
Git commit 9ba72a807a18da73c05e3e99f1c9799cf95f0c36 by Christoph Feck, on behalf of John Zaitseff.
Committed on 14/07/2016 at 19:41.
Pushed by cfeck into branch 'master'.

Add missing Unicode blocks; improve ordering
Reviewed by Christoph Feck

M  +78   -66   kcharselect-generate-datafile.py

http://commits.kde.org/kcharselect/9ba72a807a18da73c05e3e99f1c9799cf95f0c36
Comment 18 Christoph Feck 2016-11-02 03:37:50 UTC
Git commit deeb355fea88559dea8e36150db8f55f22c5a494 by Christoph Feck, on behalf of John Zaitseff.
Committed on 14/07/2016 at 19:41.
Pushed by bshah into branch 'master'.

Add missing Unicode blocks; improve ordering
Reviewed by Christoph Feck

M  +78   -66   kcharselect-generate-datafile.py

http://commits.kde.org/kwidgetsaddons/deeb355fea88559dea8e36150db8f55f22c5a494