460315 – Sorting in output file

Bug 460315 - Sorting in output file

Summary: Sorting in output file

Status:	REOPENED

Alias:	None

Product:	KBibTeX
Classification:	Applications
Component:	Loading/saving files (other bugs)
Version First Reported In:	0.10
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Thomas Fischer

URL:
Keywords:

Depends on:
Blocks:	460316
	Show dependency tree / graph

Reported:	2022-10-12 14:36 UTC by nobodyinperson
Modified:	2022-10-22 20:52 UTC (History)
CC List:	0 users

See Also:
Latest Commit:	https://invent.kde.org/office/kbibtex/commit/eb510bdfdc46afaa520a9aa71c640c881ea6eaba
Version Fixed/Implemented In:	0.11
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description nobodyinperson 2022-10-12 14:36:10 UTC

SUMMARY

It would be nice to have an option (or defaulting) to sorting the BibTeX output file by identifier and the keys as well.

STEPS TO REPRODUCE
1. open an unsorted BibTeX file with KBibTeX
2. save it@comment{x-kbibtex-encoding=utf-8}

OBSERVED RESULT

BibTeX file maintains original unsorted state

EXPECTED RESULT

BibTeX file is sorted alphabetically by identifier and keys within an entry as well

ADDITIONAL INFORMATION

Benefits from this:

- Easier diffing when tracking the BibTeX file
- Consistency when working with multiple people on a BibTeX file
- Takes the much of the arbitrariness away from the file formatting (same as strict code formatters like Python's black e.g.)

I guess this is just a couple of lines of sorting in fileexporterbibtex.cpp but I got really confused browsing through the code as I don't know any Qt... 😔

I don't know if people might get annoyed if KBibTeX sorts their files, maybe a checkbox in the File Settings widget could help. We would need a way to persist this setting though, maybe a preamble like `@comment{x-kbibtex-encoding=utf-8}` for the file encoding could be used for that (e.g. `@comment{x-kbibtex-sorting=ascending|descending}`).

Comment 1 nobodyinperson 2022-10-12 14:38:43 UTC

Wow seems you can't edit the bug description here... the second step-to-reproduce was obviously a copy-paste fail. 🤦

Comment 2 Thomas Fischer 2022-10-12 20:27:42 UTC

I agree that this is a useful feature. I was able to make a basic implementation, but I want to check it first with a fresh pair of eyes before pushing it to the Git repo.

Regarding the source code, you were on the right track. To have a sorted output, one has to look into FileExporterBibTeX::saveAsString(.., const File *). There, I introduced
const File *_bibtexfile = sortedByIdentifier ? File::sortByIdentifier(bibtexfile) : bibtexfile;
... and use a search&replace to make use of _bibtexfile instead of the original bibtexfile in the remainder of this function.
File::sortByIdentifier(..) is a new function that basically employs std::sort(..) with a custom comparison function to get the desired result.

To make it more useful, the code expands on FileImporterBibTeX::Private::Statistics to track if two subsequent entries' ids are in lexicographic order. When opening a BibTeX file, based on this counter it is guessed if this file is "sorted" or not. This information is tracked as a file property and then used when saving a file as described above.

The property can also toggled in the per-file properties widget. Activating it will only have an effect when writing. Deactivation has no practical effect, there is no shuffling in this case ;-)

Comment 3 nobodyinperson 2022-10-13 11:32:34 UTC

That sound absolutely awesome, thank you Thomas! Looking forward to 
trying it! 🙂

Am 12.10.22 um 22:27 schrieb Thomas Fischer:
> https://bugs.kde.org/show_bug.cgi?id=460315
>
> Thomas Fischer <fischer@unix-ag.uni-kl.de> changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>       Ever confirmed|0                           |1
>               Status|REPORTED                    |ASSIGNED
>
> --- Comment #2 from Thomas Fischer <fischer@unix-ag.uni-kl.de> ---
> I agree that this is a useful feature. I was able to make a basic
> implementation, but I want to check it first with a fresh pair of eyes before
> pushing it to the Git repo.
>
> Regarding the source code, you were on the right track. To have a sorted
> output, one has to look into FileExporterBibTeX::saveAsString(.., const File
> *). There, I introduced
> const File *_bibtexfile = sortedByIdentifier ?
> File::sortByIdentifier(bibtexfile) : bibtexfile;
> ... and use a search&replace to make use of _bibtexfile instead of the original
> bibtexfile in the remainder of this function.
> File::sortByIdentifier(..) is a new function that basically employs
> std::sort(..) with a custom comparison function to get the desired result.
>
> To make it more useful, the code expands on
> FileImporterBibTeX::Private::Statistics to track if two subsequent entries' ids
> are in lexicographic order. When opening a BibTeX file, based on this counter
> it is guessed if this file is "sorted" or not. This information is tracked as a
> file property and then used when saving a file as described above.
>
> The property can also toggled in the per-file properties widget. Activating it
> will only have an effect when writing. Deactivation has no practical effect,
> there is no shuffling in this case ;-)
>

Comment 4 Thomas Fischer 2022-10-14 20:57:20 UTC

(In reply to nobodyinperson from comment #3)
> That sound absolutely awesome, thank you Thomas! Looking forward to 
> trying it! 🙂

So, I had to fix some small issues here and there and added a test function as well.
You can see the current commit here:
https://invent.kde.org/thomasfischer/kbibtex/commit/658106cd9a5cb50320ab557f881a17c16da1b4a2

You can test the code by following the instructions here:
https://userbase.kde.org/KBibTeX/Development#Quick_Start_to_Run_KBibTeX_from_Git
with the adjusted script invocation:
bash run-kbibtex.sh 'https://invent.kde.org/thomasfischer/kbibtex.git' 'bugs/kde460315'

Comment 5 nobodyinperson 2022-10-15 23:38:55 UTC

Seems to work very well! Also very nice the fuzzy detection of the "sortedness" of a file! 👍 Thank you very much, Thomas!

Comment 6 Thomas Fischer 2022-10-16 12:32:02 UTC

(In reply to nobodyinperson from comment #5)
> Seems to work very well! Also very nice the fuzzy detection of the
> "sortedness" of a file! 👍 Thank you very much, Thomas!
Nice to hear ☺
I guess that means I can merge the change and close this bug report.

Comment 7 Thomas Fischer 2022-10-16 12:32:21 UTC

Git commit eb510bdfdc46afaa520a9aa71c640c881ea6eaba by Thomas Fischer.
Committed on 16/10/2022 at 12:27.
Pushed by thomasfischer into branch 'master'.

BibTeX file property: sort by identifier

A new property can be set for BibTeX files: whether elements (foremost
entries and macros) should be sorted by their identifier or key,
respectively.

This patch includes the following aspects to implement this feature:
- when importing a BibTeX file, the Statistics struct counts how often
  two consecutive entries/macros in an imported file have identifiers/
  keys in sorted order and how often not. Depending on the result, the
  SortedByIdentifier flag is set true or false.
- when exporting a BibTeX file and the SortedByIdentifier flag is set,
  the File list will be copied and then sorted by identifier and key,
  respectively before writing to disk.
- The file settings docklet and the properties dialog at "Save As"
  operations allow to set the SortedByIdentifier flag.
- Sorting is implemented as a static function in class File, in case it
  will be used in other cases like other exporters in the future.
- In KBibTeXDataTest, a new test ('sortFileByIdentifier') has been added
  to test the sorting functionality.

Unsetting the SortedByIdentifier flag has no immediate consequence,
there is no shuffling made.

M  +53   -1    src/config/preferences.cpp
M  +13   -1    src/config/preferences.h
M  +7    -0    src/config/preferences.json
M  +5    -0    src/data/element.cpp
M  +2    -0    src/data/element.h
M  +41   -0    src/data/file.cpp
M  +8    -0    src/data/file.h
M  +9    -0    src/gui/widgets/filesettingswidget.cpp
M  +1    -0    src/gui/widgets/filesettingswidget.h
M  +18   -5    src/io/fileexporterbibtex.cpp
M  +19   -3    src/io/fileimporterbibtex.cpp
M  +105  -0    src/test/kbibtexdatatest.cpp

https://invent.kde.org/office/kbibtex/commit/eb510bdfdc46afaa520a9aa71c640c881ea6eaba

Comment 8 Thomas Fischer 2022-10-17 18:20:28 UTC

I forgot to run one test that needs manual configuration. It turns out, the threshold to recognize whether a partially sorted file is to be sorted when writing it to disk was too low.
In file
src/io/fileimporterbibtex.cpp
at line 1297, I would like the multiplicator from 3 to 10.
This basically means that are for a file to be recognized as sorted, it must be already somewhat sorted, but to a larger extend than with the previous value of 3.
Please check if the value of 10 is still ok for you.

Comment 9 nobodyinperson 2022-10-18 12:17:40 UTC

10 still works for me 👍

Comment 10 Thomas Fischer 2022-10-22 20:52:48 UTC

Git commit f3f7b16ac93eab91033a20f7b7a227ca98dfd187 by Thomas Fischer.
Committed on 22/10/2022 at 20:52.
Pushed by thomasfischer into branch 'master'.

Making detection of sorting less sensitive

Previous commit eb510bdfdc46afaa520a9 introduced a heuristic to detect
whether entries and macros in a BibTeX file where (partially) sorted and
then tried to impose a complete sorting on the file when writing back
to disk.
As it turned out, the chosen threshold for detecting sorting was too
sensitive thus sorting bibliographies that are not meant to be (forcefully)
sorted, e.g. to keep the difference between two Git commits small.

This commit here decreases the sensitivity to a (still arbitrary) value
that seems to be a good compromise that still detects (partially) sorted
bibliographies, but does not errornously assess almost unsorted
bibliographies as sorted.

M  +1    -1    src/io/fileimporterbibtex.cpp

https://invent.kde.org/office/kbibtex/commit/f3f7b16ac93eab91033a20f7b7a227ca98dfd187