Created attachment 70041 [details] zipped file of a single exported message Since 4.7 the spamassassin "sa-learn --spam --mbox" command can not learn spam messages which are stored in mbox format by kmail. sa-learn always reports: Learned tokens from 0 message(s) (0 message(s) examined) sa-learn still works on older mbox files. I noticed some differences between former and current mbox exports. Now there are two empty lines between two messages. Removing the empty lines didn't not solve the problem. This same error in 4.8.0, 4.8.1
I confirm it.
Created attachment 71117 [details] Message saved from kmail Version 1.13.6 This message was received and saved from KMail Version 1.13.6 Unter KDE 4.6.00 (4.6.0) "release 6" sa-learn --spam --mbox reports Learned tokens from 1 message(s) (1 message(s) examined) Which is what we expect!
Created attachment 71118 [details] Message saved from kmail Version 4.8.3 This message was received and stored with kmail2 Version 4.8.3 sa-learn --spam --mbox gives Learned tokens from 0 message(s) (0 message(s) examined) Which we do not expect. Both checks sa-learn were run on the same computer with spamassassin 3.3.1!
There is a simple difference in the messages which causes the problem. kmail1 used the following format of the "From" line: From thomas@arend-rhb.de Tue May 15 22:01:41 2012 kmail2 used the format: From thomas@arend-rhb.de Tue, 15 May 2012 22:01:41 +0200 If I change the "From"-line to the older format, sa-learn can learn the messages! kmail1 is not iaw RFC 822; which kmail2 is. Sticking to the standard the problem is with spamassassin.
See bug report to SpamAssassin #6703 (https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703) See http://tools.ietf.org/html/rfc4155 which states that "A comprehensive description of mbox database files on UNIX-like systems can be found at http://qmail.org./man/man5/mbox.html, which should be treated as mostly authoritative ..." man 5 mbox defines that the date time stamp of the From_ line shopuld be in ctime format. Therefore I propose to switch back to the old date time stamp format of kmail (Version 1.x).
Comment from Mark Martinec 2012-05-16 13:42:01 UTC from https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703#c6 > When saving messages with kmail 1 the From Line has following format > which is not iaw RFC 822: > From info@ende-18-06.com Fri Jun 17 16:03:07 2011 > With kmail 2 the format is changed to the format which is iaw RFC 822 > From thomas@arend-rhb.de Tue, 15 May 2012 22:01:41 +0200 > which is not parsed correctly by sa-learn. sa-learn --spam reports: [...] Oh, no, not yet another incompatible mbox format!!! > Proposed to change the behavior in a way that the old malformed From lines > and the new correct ones are parsed. It is the other way around, the new one differs from everybody else. The format of the mbox file (along with its separator From_ lines) is *not* governed by RFC 822 or its successors. There is no formal standard for an mbox format, the RFC 4155 comes closest: http://tools.ietf.org/html/rfc4155 See also a Wikipedia article: http://en.wikipedia.org/wiki/Mbox RFC 4155 says: | a timestamp indicating the UTC date and time when the message | was originally received, conformant with the syntax of the | traditional UNIX 'ctime' output sans timezone (note that the | use of UTC precludes the need for a timezone indicator); This matches qmail docs: http://qmail.org/qmail-manual-html/man5/mbox.html and matches Postfix and sendmail's local delivery agent. To accommodate the new incompatible format it seems that the two instances of a regexps in ArchiveIterator.pm need to be extended, or just relaxed. Not sure if the date would still be correctly parsed. Best would be to persuade kmail folks to back off the change!
confirmed in comment #1
The bug is still in 4.9.1
I changed the name of the bug so that it is clear, that this is a kmail2 issue. The mbox format kmail2 uses is incompatible to kmai1 mbox format and is a violation of rfc 4155.
kmail 4.9.2 When a message with unkown date - for example spam massages storing the adress in the date field - the from_ line used the local, nationalized date time format: From user@example.com So. Okt 28 00:25:18 2012 SpamAssassin can not detect messages in such an mbox.
The error is now also kmail2 4.9.4.
Nice try to fix this bug, but you did it totally wrong and screwed things more than ever: The From_ line now looks as follows: From thomas@example.com Mi. Jun 26 21:26:17 2013 Why did you change the weekday to German languge? Is the month now also Englisch?
I forgot the Version 4.10.4
patch added https://git.reviewboard.kde.org/r/116975/
Git commit d1379fda35c9809913d5ab5432461ad5a8fe92d8 by Martin Koller. Committed on 22/03/2014 at 12:14. Pushed by mkoller into branch 'KDE/4.13'. write "From " delimiter line with correct dateTime format Fix writing dateTime field in the "From " delimiter line according to RFC4155 FIXED-IN: 4.13 REVIEW: 116975 M +6 -2 kmbox/mbox_p.cpp http://commits.kde.org/kdepimlibs/d1379fda35c9809913d5ab5432461ad5a8fe92d8