Bug 392877 - balooctl index command with large number of files corrupts db
Summary: balooctl index command with large number of files corrupts db
Status: RESOLVED FIXED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: balooctl (show other bugs)
Version: 5.44.0
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: baloo-bugs-null
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-08 14:06 UTC by Stefan Brüns
Modified: 2021-08-12 14:18 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Brüns 2018-04-08 14:06:21 UTC
When running "balooctl index <folder>/*", (at least) the contents of the terms db are garbage afterwards.

Running "for f in <folder>/*; do balooctl index "$f" ; done", does not have this effect and even restores the db to a good state.

The latter obviously creates one write transaction per file, while the other one uses one write transaction overall.
Comment 1 Michael Heidelbach 2018-04-15 16:01:09 UTC
(In reply to Stefan Brüns from comment #0)
> When running "balooctl index <folder>/*", (at least) the contents of the
> terms db are garbage afterwards.
I could not reproduce. Maybe I overlooked the "garbage", could you specify please.
balooctl index <folder>/*" => balooshow -x b.epub >multi.txt
balooctl disable + enable
for f in <folder>/*; do balooctl index "$f" ; done => balooshow -x b.epub >single.txt

diff multi.txt single.txt 
14d13
< title: buddenbrooks einer familie verfall
15a15
> title: buddenbrooks einer familie verfall
Comment 2 Stefan Brüns 2018-05-29 23:47:47 UTC
Git commit e1d1b7e87ff1e8ce6a7e03ecdf2902322cb8624a by Stefan Brüns.
Committed on 29/05/2018 at 23:47.
Pushed by bruns into branch 'master'.

Avoid crash when reading corrupt data from document terms db

Summary:
The terms db contains terms, where each terms is stored independently
(terminated with 0), or as a suffix to the previous term (terminated with
1).
In case of corrupted data, the first terminator seen may be a 1, which
leads to a crash when trying to access the previous term with
QVector<>::last().
Show a debug message, to give a hint about the bad data, which can be
fixed by reindexing the relevant file.
Related: bug 392878

Test Plan:
Corrupt the database
Run balooshow -x <affected file(s)>

Reviewers: #baloo, michaelh, ngraham, #frameworks, dhaumann

Reviewed By: dhaumann

Subscribers: dhaumann, kde-frameworks-devel, #frameworks

Tags: #frameworks, #baloo

Differential Revision: https://phabricator.kde.org/D12047

M  +5    -0    src/codecs/doctermscodec.cpp
M  +5    -1    src/engine/documentdb.cpp

https://commits.kde.org/baloo/e1d1b7e87ff1e8ce6a7e03ecdf2902322cb8624a
Comment 3 Nate Graham 2020-10-26 16:47:27 UTC
Is this 100% fixed now? Or is there still anything left to do?
Comment 4 tagwerk19 2021-08-11 21:45:02 UTC
(In reply to Nate Graham from comment #3)
> Is this 100% fixed now? Or is there still anything left to do?
Checked, on an index of 2,000,000 files (recorded Go games), the balooctl option to clear the entries for 10000 files:

    balooctl clear 2016-01*

and reindex them

    balooctl index 2016-01*

in one transaction. Done on a system with constrained RAM so that transaction filled the RAM and extended into swap. It completed; the index was seemingly OK.

Can never be sure, but good enough?
Comment 5 Nate Graham 2021-08-12 14:18:18 UTC
Cool.