Bug 363269

Summary: Crash when projects contains *.txt file that is actually a binary file [cmListFileLexerSetToken, CMakeListsParser::readCMakeFile]
Product: [Applications] kdevelop Reporter: Maciej Cencora <m.cencora>
Component: Build tools: CMakeAssignee: kdevelop-bugs-null
Status: RESOLVED FIXED    
Severity: crash CC: aleixpol, asiasuppenesser, axel.kellermann, mail, NightNord
Priority: NOR Keywords: junior-jobs
Version: git master   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In: 5.1.2
Sentry Crash Report:
Attachments: Backtrace from ASAN
Example crashing file
Minimal UTF-16 file to trigger behavior
Proposal for patch

Description Maciej Cencora 2016-05-19 11:01:24 UTC
For some reason the project I'm working on have a few zip files with *.txt filename extension.

KDevelop is incorrectly trying to parse these as cmake files (LanguageController incorrectly detects it by matching only filename extension), and then cmake parser crashes in lexer.
Not all binary *.txt files causes the crash. Unfortunately I cannot attach the one that triggers the error.

*** Error in `/usr/bin/kdevelop': free(): invalid next size (fast): 0x00007fff900c8da0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x77725)[0x7ffff4dea725]
/lib/x86_64-linux-gnu/libc.so.6(+0x7ff4a)[0x7ffff4df2f4a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7ffff4df6abc]
/usr/lib/x86_64-linux-gnu/libkdevcmakecommon.so(+0xe34e)[0x7fffc2d9734e]
/usr/lib/x86_64-linux-gnu/libkdevcmakecommon.so(+0xc6db)[0x7fffc2d956db]
/usr/lib/x86_64-linux-gnu/libkdevcmakecommon.so(cmListFileLexer_Scan+0x37)[0x7fffc2d97863]
/usr/lib/x86_64-linux-gnu/libkdevcmakecommon.so(_ZN16CMakeListsParser13readCMakeFileERK7QString+0x242)[0x7fffc2d98733]
/usr/lib/x86_64-linux-gnu/qt5/plugins/kdevplatform/25/kdevcmakemanager.so(+0x46784)[0x7fffc2b42784]
/usr/lib/x86_64-linux-gnu/libKF5ThreadWeaver.so.5(_ZN12ThreadWeaver11IdDecorator3runE14QSharedPointerINS_12JobInterfaceEEPNS_6ThreadE+0x60)[0x7fffead6e650]
/usr/lib/x86_64-linux-gnu/libKF5ThreadWeaver.so.5(_ZN12ThreadWeaver8Executor3runERK14QSharedPointerINS_12JobInterfaceEEPNS_6ThreadE+0x50)[0x7fffead6ec90]
/usr/lib/x86_64-linux-gnu/libKF5ThreadWeaver.so.5(_ZN12ThreadWeaver3Job7executeERK14QSharedPointerINS_12JobInterfaceEEPNS_6ThreadE+0x40)[0x7fffead6d7e0]
/usr/lib/x86_64-linux-gnu/libKF5ThreadWeaver.so.5(_ZN12ThreadWeaver6Thread3runEv+0x8a)[0x7fffead6d28a]
/usr/lib/x86_64-linux-gnu/libQt5Core.so.5(+0xa584e)[0x7ffff556b84e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76fa)[0x7fffee3406fa]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7ffff4e79b5d]




Reproducible: Sometimes

Steps to Reproduce:
1. create cmake project
2. put there some zip files
3. rename the zip files to have *.txt extension
4. let the background parser do its job

Actual Results:  
Crash

Expected Results:  
No crash
Comment 1 Kevin Funk 2016-05-19 11:26:07 UTC
Debugging symbols please?
Comment 2 Maciej Cencora 2016-05-19 11:30:15 UTC
The attached backtrace already contains kdevelop's debugging symbols
Comment 3 Maciej Cencora 2016-05-19 11:37:12 UTC
/usr/lib/x86_64-linux-gnu/libkdevcmakecommon.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=b2ecb624afc81a60ca5fa413e6e9b62c47c9cfec, not stripped
/usr/lib/x86_64-linux-gnu/qt5/plugins/kdevplatform/25/kdevcmakemanager.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=41a85cabcd81f5ae083137f9e3295af248bf3450, not stripped

I can provide backtrace from gdb, but that will tell you the same that crash happened in cmListFileLexer_Scan
Comment 4 Milian Wolff 2016-05-24 21:54:52 UTC
you are missing the attachement, I think. I cannot see it at least.
Comment 5 Maciej Cencora 2016-05-25 10:40:51 UTC
Created attachment 99166 [details]
Backtrace from ASAN
Comment 6 Maciej Cencora 2016-05-25 10:55:36 UTC
The backtrace (generated by libc) is in the first comment.

Now that I wanted to get the backtrace in gdb, I get crashes in other (unrelated) places.
This is caused by heap corruption, that is detected usually in next free() or realloc() call that can happen in completely different code then the one that caused the memory corruption.

I've recompiled kdevelop with address sanitizer, and now I get reliable bactraces (attached).

Following added assertion in  "cmListFileLexerSetToken" fails:
assert(strlen(text) == length)
Comment 7 Milian Wolff 2016-05-25 11:29:51 UTC
thanks, that's much more helpful. Note that libc backtraces don't contain the useful debug information like line number + file. The ASAN build shows everything we need though, ty!
Comment 8 Milian Wolff 2016-05-25 11:32:51 UTC
Personally, I think we could resolve this issue in many ways. The best would probably be to do the expensive mime-type check inside the parse job's run method (i.e. in the background thread) to ensure we are not trying to parse binary files. That would guard us against this issue here, and would not lockup the UI thread by doing the mimetype check.
Comment 9 Milian Wolff 2016-05-25 11:34:21 UTC
Oh and if it's not clear: the crash is inside generated code (flex? bison? yacc?), so we'd need to fix these tools, or potentially the input file, to prevent that from happening.
Comment 10 Aleix Pol 2016-05-25 23:47:33 UTC
Git commit 95497835ee1c721648d34667dd4147d339e8e4c4 by Aleix Pol.
Committed on 25/05/2016 at 23:38.
Pushed by apol into branch '5.0'.

Update the cmake lexer to cmake 3.5.1

Also generate the file with a newer flex version.

M  +1    -1    projectmanagers/cmake/CMakeLists.txt
A  +2761 -0    projectmanagers/cmake/parser/cmListFileLexer.c     [License: GENERATED FILE]  *
D  +0    -2351 projectmanagers/cmake/parser/cmListFileLexer.cpp
M  +33   -24   projectmanagers/cmake/parser/cmListFileLexer.h
M  +224  -27   projectmanagers/cmake/parser/cmListFileLexer.in.l
M  +1    -1    projectmanagers/cmake/parser/cmakelistsparser.cpp

The files marked with a * at the end have a non valid license. Please read: http://techbase.kde.org/Policies/Licensing_Policy and use the headers which are listed at that page.


http://commits.kde.org/kdevelop/95497835ee1c721648d34667dd4147d339e8e4c4
Comment 11 Aleix Pol 2016-05-25 23:58:15 UTC
I'm trying to reproduce but I didn't manage. I randomly put a something.txt binary file but I didn't get a crash.

Am I missing something?
Comment 12 Maciej Cencora 2016-05-27 15:35:55 UTC
(In reply to Aleix Pol from comment #11)
> I'm trying to reproduce but I didn't manage. I randomly put a something.txt
> binary file but I didn't get a crash.
> 
> Am I missing something?

Make sure that KDevelop actually tries to parse the file. You can force it by renaming the filename extension to *.cmake.
If you still won't be able to repro this, I'll try to check what exact byte sequence triggers the crash.
Comment 13 Kevin Funk 2016-06-01 09:55:14 UTC
@Maciej: Well, have you been able to reproduce the issue with above patch applied?
Comment 14 Maciej Cencora 2016-06-02 14:54:00 UTC
Created attachment 99326 [details]
Example crashing file
Comment 15 Maciej Cencora 2016-06-02 14:57:24 UTC
Attached file causes a crash of kdevelop with ASAN enabled. Since it is only 200 bytes, It most probably won't crash a regular kdevelop build.

The crash still happens with Aleix parser upgrade.
Comment 16 Kevin Funk 2016-09-08 21:36:00 UTC
*** Bug 367841 has been marked as a duplicate of this bug. ***
Comment 17 Kevin Funk 2016-09-08 21:36:45 UTC
Some comments from the dup:

"""
I found the problem, and it's actually similar, but not the same as bug 363269. The code base being parsed contains a subfolder with example text in a couple of different asian languages, all stored as .txt files in UTF16LE. The cmake parser seems to not just read CMakeLists.txt and *.cmake, but also generally all files with the extension .txt. Apparently the encoding of the example files is handled correctly, but then the wide characters are given to the lexer that expects chars, hence the one byte buffer overflow reported (just my guess, didn't have a look at all the code involved).

If I delete all the UTF16 files, the whole code base is processed successfully.
"""
Comment 18 Night Nord 2016-09-09 13:41:41 UTC
The crash still happens on todays (09.09.2016) 5.0 branch with UTF-16 txt files. ASAN trace is the same as the one already attached. Without asan it will crash 100% times either during parsing or when saving the cache, so it pretty much blocks any work with such projects.

Strangely enough, I don't see why it's trying to parse random *.txt files (in my case it was 'utf-16.txt' file from reviewboard testdata) - a quick glance at the code says it should only try to parse 'CMakeFiles.txt' or '*.cmake'.
Comment 19 Axel Kellermann 2016-10-01 20:13:31 UTC
I had some time to kill today, so I took another look at why random .txt files get processed by the background parser. If I understand the code flow correctly, it happens like this:
- on startup the project folder is recursively scanned for all contained files
- every found file is wrapped in a ProjectFileItem (abstractfilemanagerplugin.cpp:234) and added to the projects file set
- the BackgroundParser then creates a ParseJob for every ProjectFileItem (in BackgroundParser::parseDocumentsInternal())
- in BackgroundParser::createParseJob(), the language to use for the current ProjectFileItem is queried by a call to m_languageController->languagesForUrl(qUrl)
- LanguageController::languagesForUrl() in turn checks an internal mimeTypeCache and fileExtensionCache for the language to use for the given file
- in my case, the fileExtensionCache contains an entry that resolves the extension txt to language CMake
- that means in backgroundparser.cpp:357 I get back a CMakeManager, that is then used to create a cmake parse job

Is that expected behaviour or some side effect that has to be fixed? I'm asking this because there are multiple cmake specific spots in the code that explicitly filter for "CMakeLists.txt" and "*.cmake", and that are circumvented by this implicit behaviour.
Comment 20 Sven Brauch 2016-10-01 20:36:08 UTC
Thanks for looking. You are indeed right -- .txt files are treated as cmake in some cases. I also poked around a bit, when it does not find a language for a file it looks at the contents, guesses the mime type, and caches that. So when you open CMakeLists.txt, after that all .txt files will be treated as CMake. That is a bug, and needs to be fixed. The relevant code is, I think, in languagecontroller.cpp:310ff.
Comment 21 Sven Brauch 2016-10-13 18:53:02 UTC
Git commit 77b83054f943b6c9bce0da178732f7992f7ada3b by Sven Brauch.
Committed on 13/10/2016 at 18:52.
Pushed by brauch into branch '5.0'.

Remove mime type <-> extension cache

The idea that because one file with extension X has mime type A,
determined by its contents, hence other files with extension X will
have the same mime type is just wrong. One common example where this
breaks in a spectacular way is CMakeLists.txt and the .txt extension.

I found the claim that looking into each file will make the application
unresponsive to be unfounded. QMimeType will only read the first 16kB to
guess the mime type, which takes less than a millisecond for each file.
A test project with three hundred 3 MB binary blobs still loads instantly.
If, in comparison, we parse one of the files as CMake erraneously, we
take multiple seconds.
Differential Revision: https://phabricator.kde.org/D3042

M  +1    -35   shell/languagecontroller.cpp

http://commits.kde.org/kdevplatform/77b83054f943b6c9bce0da178732f7992f7ada3b
Comment 22 Sven Brauch 2016-10-13 18:54:44 UTC
Note that while above commit should fix the crash in this case, this bug is still open, since it would be back if you would rename the file to "CMakeLists.txt" ... I think.
Comment 23 Kevin Funk 2016-11-17 09:36:32 UTC
*** Bug 372575 has been marked as a duplicate of this bug. ***
Comment 24 Axel Kellermann 2017-07-25 22:05:33 UTC
Created attachment 106865 [details]
Minimal UTF-16 file to trigger behavior
Comment 25 Axel Kellermann 2017-07-25 22:06:23 UTC
Created attachment 106866 [details]
Proposal for patch
Comment 26 Axel Kellermann 2017-07-25 22:06:45 UTC
I had another look at the lexer and I think I pinned down the problem with UTF-16 files in cmListFileLexer.c. Scanning the files works fine, but copying the scanned content into the token structure in cmListFileLexerSetToken() is where things go wrong. The code that copies the content uses the functions strcpy() and strdup() that are meant to be used only with zero-terminated character arrays. As we handle two-byte characters, where the most significant byte can be zero (see e.g. letter 'h' with 0x0068), it's possible that the text to be copied to token->text contains zero-bytes mid-string. In that case strdup() doesn't do what it's intended to do. It interprets the buffer as zero-terminated string and only duplicates it up to the first occurence of '\0'. At the same time the original buffer size is stored in token->length. This leads to out-of-bounds memory accesses later on.

I attached a simple UTF-16 file that reliably triggers the problem for me (363269_repro.txt) and a proposal for a fix that replaces the string functions with mallocs/memcpys (363269_proposal.patch). Maybe someone with more experience with the cmake parser/lexer could have a look at it.

Related questions: The patch fixes the crashes/ASAN aborts for me, but is the cmake parser really handling UTF-16 files correctly? Functions like cmListFileLexer_BOM() imply that it can handle all kinds of UTF formats, but at the same time the whole code seems to imply that it only works on zero terminated char arrays. Do we possibly need to update to a newer version of the lexer (which is generated from external sources, right?)?
Comment 27 Sven Brauch 2017-07-25 22:06:56 UTC
Does that still happen with KDevelop 5.1+? I thought I fixed it.
Comment 28 Sven Brauch 2017-07-25 22:08:12 UTC
Patch still looks sensible to me -- maybe use strncpy instead, and put it on phabricator.kde.org so somebody familiar with the code can have a look?
Comment 29 Axel Kellermann 2017-07-25 22:26:42 UTC
(In reply to Sven Brauch from comment #27)
> Does that still happen with KDevelop 5.1+? I thought I fixed it.

See comment #22. Random .txt files aren't scanned anymore, but .cmake files can still trigger the faulty behaviour. I conducted my tests with cmListFileLexer.c from master (last updated beginning of the week). But I extracted it into a small reproducer application to ease debugging, so if you fixed it somewhere in the cmake parser outside of cmListFileLexer.c, maybe we don't have to do anything.
Comment 30 Axel Kellermann 2017-07-25 22:29:23 UTC
(In reply to Sven Brauch from comment #28)
> Patch still looks sensible to me -- maybe use strncpy instead, and put it on
> phabricator.kde.org so somebody familiar with the code can have a look?

strncpy()/strndup() won't work either. Both copy up to n bytes, but also end processing on the occurence of the first '\0' (at least to my understanding).

I'll have a look at phabricator. Never used it before...
Comment 31 Sven Brauch 2017-07-25 22:53:19 UTC
Ah yes, if you trick the mimetype detection, you will still crash. You are also right about strncpy.

phabricator should be simple enough to use, just log in (with identity.kde.org), click differential, then click new diff and paste your diff.
Comment 32 Axel Kellermann 2017-07-26 11:40:03 UTC
(In reply to Sven Brauch from comment #31)
> Ah yes, if you trick the mimetype detection, you will still crash. You are
> also right about strncpy.
> 
> phabricator should be simple enough to use, just log in (with
> identity.kde.org), click differential, then click new diff and paste your
> diff.

OK, I added the diff to phabricator:
https://phabricator.kde.org/D6924
Comment 33 Sven Brauch 2017-07-28 07:26:50 UTC
Git commit 3fde9551d5870229d42cc1a61f152fc8d0876b71 by Sven Brauch, on behalf of Axel Kellermann.
Committed on 27/07/2017 at 22:28.
Pushed by brauch into branch '5.1'.

cmListFileLexer: use memcpy instead of strcpy to copy buffer contents to tokens

This fixes a crash when the input is garbage (e.g. binary file which
is not cmake code at all).
Differential Revision: https://phabricator.kde.org/D6924

M  +7    -3    projectmanagers/cmake/parser/cmListFileLexer.c

https://commits.kde.org/kdevelop/3fde9551d5870229d42cc1a61f152fc8d0876b71