Bug 426856 - File encoding is not always stored
Summary: File encoding is not always stored
Alias: None
Product: KBibTeX
Classification: Applications
Component: Loading/saving files (show other bugs)
Version: git (master)
Platform: Manjaro Linux
: NOR normal
Target Milestone: ---
Assignee: Thomas Fischer
Depends on:
Reported: 2020-09-22 08:43 UTC by nobodyinperson
Modified: 2020-12-23 19:21 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In: 0.10


Note You need to log in before you can comment on or make changes to this bug.
Description nobodyinperson 2020-09-22 08:43:39 UTC

Latest development KBibTeX (kbibtex-git-r3368.52ab3ba1-1) does not store every encoding set in the file settings.


1. Create a file :

-----  encoding.bib  -------

	author = {Author},
	journal = {Journal},
	title = {Encoding with special character $\mu$},
	year = {2020}

2. Open the file in KBibTeX
- The Encoding is correctly set to US-ASCII in the file settings
- The special escaped LaTeX character \mu is replaced with µ
3. Simulate changing something so that KBibTeX allows saving the file
4. Save the file


The encoding line


Is removed from the file, causing future openings in KBibTeX to interpret the file as UTF-8. Future open/save cycles are stable, leading to the following file:

------------ encoding.bib --------------

	author = {Author},
	journal = {Journal},
	title = {Encoding with special character $\ensuremath{μ}$},
	year = {2020}


KBibTeX keeps the file in US-ASCII so there are no problems with old BibTeX uncapable of handling Unicode.

Up-to-date Manjaro, KBibTeX built from the kbibtex-git AUR package
Comment 1 nobodyinperson 2020-09-22 08:44:49 UTC
BTW kbibtex stable 0.9.2 does not exhibit this bug.
Comment 2 nobodyinperson 2020-09-22 08:49:07 UTC
Furthermore, if one resets the file encoding to US-ASCII by hand each time when opening the file, there is an \ensuremath{...} runaway on each open/save cycle, causing the file to blow up like this:

	author = {Author},
	journal = {Journal},
	title = {Encoding with special character $\ensuremath{\ensuremath{\ensuremath{\ensuremath{\mu}}}}$},
	year = {2020}
Comment 3 Thomas Fischer 2020-11-26 21:59:26 UTC
Sorry for the late response.

This bug report documents actually two problems:
1. The (mis)handling of "US-ASCII"  -- and --
2. Writing $\ensuremath{μ}$ instead of $\mu$.

For the first problem, I have decided to remove US-ASCII from the list of encodings as it is redundant. You have effectively US-ASCII if you choose either UTF-8 or LaTeX (and restrict yourself to characters defined in US-ASCII). However, whereas US-ASCII cannot handle 'Ä' or 'Æ', both UTF-8 and 'LaTeX' can in their own ways.

The second problem was much about the 'LaTeX encoder' failing to recognize that there was already a math environment from the dollar signs and thus \ensuremath was unnecessary. The issue with μ is more complex than it seems, as on the LaTeX side there is \mu, \textmu, and \textmugreek and on the Unicode side there is the Greek letter μ (U+03BC), the 'micro' symbol 'µ' (U+00B5), and some more special 'mu' symbols. Mapping between both sides is not obvious. KBibTeX's guess work on the mapping has hopefully been improved now.

Those changes have been integrated both into the 'master' code (not yet pushed at the time of writing) and a bugfix branch based on 'kbibtex/0.10'. The bugfix branch contains the minimum changes necessary to fix the bug, the 'master' changes include additional refactoring that does not belong into an almost-stable branch.
So, please check the bugfix branch first. I would like to refine the commits for the 'master' based on the feedback I receive here:
Comment 4 Bug Janitor Service 2020-12-11 04:34:02 UTC
Dear Bug Submitter,

This bug has been in NEEDSINFO status with no change for at least
15 days. Please provide the requested information as soon as
possible and set the bug status as REPORTED. Due to regular bug
tracker maintenance, if the bug is still in NEEDSINFO status with
no change in 30 days the bug will be closed as RESOLVED > WORKSFORME
due to lack of needed information.

For more information about our bug triaging procedures please read the
wiki located here:

If you have already provided the requested information, please
mark the bug as REPORTED so that the KDE team knows that the bug is
ready to be confirmed.

Thank you for helping us make KDE software even better for everyone!
Comment 5 Thomas Fischer 2020-12-23 19:21:52 UTC
Git commit 1e649222ed54060eb561fcc5b70568ba7f6098fb by Thomas Fischer.
Committed on 23/12/2020 at 19:18.
Pushed by thomasfischer into branch 'kbibtex/0.10'.

Improving recognizing encoding of a BibTeX file to load

Drawing from commits in the 'master' branch in order to improve
recognizing BibTeX/BibLaTeX files' encoding.
- 8e473758e99f30cf3d61fa0b1: Guessing file's encoding based on bit patterns
- fba235cf5d0494b8189a1fca3: Refactoring FileImporterBibTeX
FIXED-IN: 0.10

One effect is that opened files ASCII-only files are not directly
classified as UTF-8, but stay ASCII-encoded as far as possible by
classifying them as 'LaTeX'-encoded.

M  +2    -2    src/io/fileimporter.h
M  +177  -37   src/io/fileimporterbibtex.cpp
M  +2    -0    src/io/fileimporterbibtex.h