Bug 426856 - File encoding is not always stored
Summary: File encoding is not always stored
Status: REOPENED
Alias: None
Product: KBibTeX
Classification: Applications
Component: Loading/saving files (show other bugs)
Version: git (master)
Platform: Manjaro Linux
: NOR normal
Target Milestone: ---
Assignee: Thomas Fischer
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-22 08:43 UTC by nobodyinperson
Modified: 2023-11-21 20:35 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In: 0.10.1


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description nobodyinperson 2020-09-22 08:43:39 UTC
SUMMARY

Latest development KBibTeX (kbibtex-git-r3368.52ab3ba1-1) does not store every encoding set in the file settings.

STEPS TO REPRODUCE

1. Create a file :

-----  encoding.bib  -------
@comment{x-kbibtex-encoding=us-ascii}

@article{encodingfail2020,
	author = {Author},
	journal = {Journal},
	title = {Encoding with special character $\mu$},
	year = {2020}
}
----------------------------



2. Open the file in KBibTeX
- The Encoding is correctly set to US-ASCII in the file settings
- The special escaped LaTeX character \mu is replaced with µ
3. Simulate changing something so that KBibTeX allows saving the file
4. Save the file

OBSERVED RESULT

The encoding line

@comment{x-kbibtex-encoding=us-ascii}

Is removed from the file, causing future openings in KBibTeX to interpret the file as UTF-8. Future open/save cycles are stable, leading to the following file:

------------ encoding.bib --------------
@comment{x-kbibtex-encoding=utf-8}

@article{encodingfail2020,
	author = {Author},
	journal = {Journal},
	title = {Encoding with special character $\ensuremath{μ}$},
	year = {2020}
}
-----------------------------------------




EXPECTED RESULT


KBibTeX keeps the file in US-ASCII so there are no problems with old BibTeX uncapable of handling Unicode.


SOFTWARE/OS VERSIONS
Up-to-date Manjaro, KBibTeX built from the kbibtex-git AUR package
Comment 1 nobodyinperson 2020-09-22 08:44:49 UTC
BTW kbibtex stable 0.9.2 does not exhibit this bug.
Comment 2 nobodyinperson 2020-09-22 08:49:07 UTC
Furthermore, if one resets the file encoding to US-ASCII by hand each time when opening the file, there is an \ensuremath{...} runaway on each open/save cycle, causing the file to blow up like this:


@article{encodingfail2020,
	author = {Author},
	journal = {Journal},
	title = {Encoding with special character $\ensuremath{\ensuremath{\ensuremath{\ensuremath{\mu}}}}$},
	year = {2020}
}
Comment 3 Thomas Fischer 2020-11-26 21:59:26 UTC
Sorry for the late response.

This bug report documents actually two problems:
1. The (mis)handling of "US-ASCII"  -- and --
2. Writing $\ensuremath{μ}$ instead of $\mu$.

For the first problem, I have decided to remove US-ASCII from the list of encodings as it is redundant. You have effectively US-ASCII if you choose either UTF-8 or LaTeX (and restrict yourself to characters defined in US-ASCII). However, whereas US-ASCII cannot handle 'Ä' or 'Æ', both UTF-8 and 'LaTeX' can in their own ways.

The second problem was much about the 'LaTeX encoder' failing to recognize that there was already a math environment from the dollar signs and thus \ensuremath was unnecessary. The issue with μ is more complex than it seems, as on the LaTeX side there is \mu, \textmu, and \textmugreek and on the Unicode side there is the Greek letter μ (U+03BC), the 'micro' symbol 'µ' (U+00B5), and some more special 'mu' symbols. Mapping between both sides is not obvious. KBibTeX's guess work on the mapping has hopefully been improved now.

Those changes have been integrated both into the 'master' code (not yet pushed at the time of writing) and a bugfix branch based on 'kbibtex/0.10'. The bugfix branch contains the minimum changes necessary to fix the bug, the 'master' changes include additional refactoring that does not belong into an almost-stable branch.
So, please check the bugfix branch first. I would like to refine the commits for the 'master' based on the feedback I receive here:
https://invent.kde.org/thomasfischer/kbibtex/commit/423a161dc5f44c8e7f0c873258dadc050f25acd6
Comment 4 Bug Janitor Service 2020-12-11 04:34:02 UTC
Dear Bug Submitter,

This bug has been in NEEDSINFO status with no change for at least
15 days. Please provide the requested information as soon as
possible and set the bug status as REPORTED. Due to regular bug
tracker maintenance, if the bug is still in NEEDSINFO status with
no change in 30 days the bug will be closed as RESOLVED > WORKSFORME
due to lack of needed information.

For more information about our bug triaging procedures please read the
wiki located here:
https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging

If you have already provided the requested information, please
mark the bug as REPORTED so that the KDE team knows that the bug is
ready to be confirmed.

Thank you for helping us make KDE software even better for everyone!
Comment 5 Thomas Fischer 2020-12-23 19:21:52 UTC
Git commit 1e649222ed54060eb561fcc5b70568ba7f6098fb by Thomas Fischer.
Committed on 23/12/2020 at 19:18.
Pushed by thomasfischer into branch 'kbibtex/0.10'.

Improving recognizing encoding of a BibTeX file to load

Drawing from commits in the 'master' branch in order to improve
recognizing BibTeX/BibLaTeX files' encoding.
- 8e473758e99f30cf3d61fa0b1: Guessing file's encoding based on bit patterns
- fba235cf5d0494b8189a1fca3: Refactoring FileImporterBibTeX
FIXED-IN: 0.10

One effect is that opened files ASCII-only files are not directly
classified as UTF-8, but stay ASCII-encoded as far as possible by
classifying them as 'LaTeX'-encoded.

M  +2    -2    src/io/fileimporter.h
M  +177  -37   src/io/fileimporterbibtex.cpp
M  +2    -0    src/io/fileimporterbibtex.h

https://invent.kde.org/office/kbibtex/commit/1e649222ed54060eb561fcc5b70568ba7f6098fb
Comment 6 neuroshock 2023-07-22 15:18:04 UTC
I'm still seeing the same problem. I'm using KBibTeX 0.10.0 on Fedora Linux 38.

When I save a .bib file in KBibTeX with "Encoding: LaTeX" in File Settings, and then I close and reopen the .bib file in KBibTeX, it forgets that the encoding, and replaces it with UTF-8.

If I open the .bib file in a text editor, here's what I see:

-> On saving the .bib file with "Encoding: LaTeX" in file settings, and opening it up in a text editor, I can see that special characters are correctly written in LaTeX format. (ö becomes {\"o}, etc.). However, the first line of the .bib file is @comment{x-kbibtex-encoding=utf-8}. And if I close and reopen the .bib file in KBibTeX, it incorrectly identifies the encoding in File Settings as UTF-8. So if I save again in KBibTeX, the .bib file goes back to the wrong encoding. ({\"o} goes back to ö, etc.)

-> If I manually choose "Encoding: LaTeX" in KBibTeX and save again, and then open the .bib file in a text editor, and manually change the first line from @comment{x-kbibtex-encoding=utf-8} to @comment{x-kbibtex-encoding=latex}, and then re-open the .bib file in KBibTeX, it correctly identifies the encoding as LaTeX in File Settings.

-> But, when I save the .bib file in KBibTeX, that first line of the .bib file goes back to @comment{x-kbibtex-encoding=utf-8}, and the whole process repeats.

-> What this means is that every time I save my .bib file in KBibTeX, I have to manually choose "Encoding: LaTeX" each time. If I forget, then KBibTeX saves the file as UTF-8.

The solution would be for KBibTeX to make that first line @comment{x-kbibtex-encoding=latex} automatically if I've selected "Encoding: LaTeX" in File Settings.
Comment 7 nobodyinperson 2023-07-24 11:17:30 UTC
Thanks for bringing this up again. It's currently also a big pain point for me. It seems that KBibTeX v0.10.0 doesn't how to encode some Unicode characters (non-breaking spaces, weird dashes, etc.) to LaTeX. Ran from the terminal, these are the errors for me (a location in the file would be helpful):

```bash
kbibtex.io: Don't know how to encode Unicode char "0x00a0"                                                                            
kbibtex.io: Don't know how to encode Unicode char "0x2010"  
kbibtex.io: Don't know how to encode Unicode char "0x2010"                                                                            
kbibtex.io: Don't know how to encode Unicode char "0x202f"                                                                            
```

When I find-replace those characters in the file (in vim, do    `:%s/\%u202f/ /g`    and `%s/\%u00a0/-/g`  etc.), then KBibTeX is finally stable when saving the encoding again and stays at LaTeX encoding. 😮‍💨
Comment 8 neuroshock 2023-07-25 03:44:54 UTC
(In reply to nobodyinperson from comment #7)
> Thanks for bringing this up again. It's currently also a big pain point for
> me. It seems that KBibTeX v0.10.0 doesn't how to encode some Unicode
> characters (non-breaking spaces, weird dashes, etc.) to LaTeX. Ran from the
> terminal, these are the errors for me (a location in the file would be
> helpful):
> 
> ```bash
> kbibtex.io: Don't know how to encode Unicode char "0x00a0"                  
> 
> kbibtex.io: Don't know how to encode Unicode char "0x2010"  
> kbibtex.io: Don't know how to encode Unicode char "0x2010"                  
> 
> kbibtex.io: Don't know how to encode Unicode char "0x202f"                  
> 
> ```
> 
> When I find-replace those characters in the file (in vim, do   
> `:%s/\%u202f/ /g`    and `%s/\%u00a0/-/g`  etc.), then KBibTeX is finally
> stable when saving the encoding again and stays at LaTeX encoding. 😮‍💨

You very much helped me here! I dug through my (large) .bib file in a text editor, and eventually found some characters that KBibTeX was having trouble converting to LaTeX. In my case, they were the characters α (alpha) and ’ (single right apostrophe). Once I replaced them with LaTeX-friendly characters, namely, {$\alpha$} and ', the problem disappeared, and KBibTeX is now stably saving with LaTeX encoding.

So the lesson here is to make sure there are no troublesome characters in the .bib file! Ideally KBibTeX would handle this problem more elegantly, either warning the user about bad characters, or at least not encoding the whole file as UTF-8.

Thanks!
Comment 9 Thomas Fischer 2023-11-21 20:35:48 UTC
Comment #7 has the important point here: probably every of your BibTeX files where you reported that KBibTeX switched from LaTeX encoding to UTF-8 was due to KBibTeX not knowing how to map an Unicode character to a LaTeX equivalent. Thus is was falling back to UTF-8 encoding in order to preserve the data, i.e. it is a feature, not a bug ;-)
This is for two reasons: First, the mapping is manually crafted and simply does not cover the thousands of characters and symbols that are in use. Second, for some symbols, no clear mapping is possible. One particular example is the Greek letter mu. Unicode knows U+00B5 (micro sign), U+03BC (Greek small letter mu), U+1D6CD (Mathematical bold small mu), and possibly others. On the LaTeX side you have \mu (in math mode), \textmu, \upmu, \muup, \textmugreek, and possibly others.

Anyhow, I added U+00A0, U+2010, and U+202F to the manual mapping, as those were mentioned earlier and seem most pressing. U+2010 and U+202F are "unidirectional", i.e. they will be mapped to a simple ASCII dash/minus/hyphen and '\,', respectively, and when again encoded to UTF-8 will stay ASCII minus or become U+2009, respectively.
If you want to add more manual mappings or update existing ones, please let me know, e.g. by commenting in this bug report and providing both Unicode number and corresponding LaTeX command.
The manual mapping is coded in src/io/encoderlatex.cpp, in case you want to look at the technical details.

(In reply to nobodyinperson from comment #7)
> Thanks for bringing this up again. It's currently also a big pain point for
> me. It seems that KBibTeX v0.10.0 doesn't how to encode some Unicode
> characters (non-breaking spaces, weird dashes, etc.) to LaTeX. Ran from the
> terminal, these are the errors for me (a location in the file would be
> helpful):
> 
> ```bash
> kbibtex.io: Don't know how to encode Unicode char "0x00a0"                  
> 
> kbibtex.io: Don't know how to encode Unicode char "0x2010"  
> kbibtex.io: Don't know how to encode Unicode char "0x2010"                  
> 
> kbibtex.io: Don't know how to encode Unicode char "0x202f"                  
> 
> ```
> 
> When I find-replace those characters in the file (in vim, do   
> `:%s/\%u202f/ /g`    and `%s/\%u00a0/-/g`  etc.), then KBibTeX is finally
> stable when saving the encoding again and stays at LaTeX encoding. 😮‍💨