Bug 74577 - anti spam wizard gets bogofilter rules completely wrong for newer bogofilters
Summary: anti spam wizard gets bogofilter rules completely wrong for newer bogofilters
Status: RESOLVED FIXED
Alias: None
Product: kmail
Classification: Applications
Component: general (show other bugs)
Version: unspecified
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: kdepim bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-02-08 16:16 UTC by Richard Smith
Modified: 2007-09-14 12:17 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Smith 2004-02-08 16:16:16 UTC
Version:            (using KDE Devel)
Installed from:    Compiled sources

The anti spam wizard gets the bogofilter rules completely wrong for recent versions of bogofilter (they changed the commandline options, breaking everything, a while back).

For the classify rule, it says "bogofilter -p -e" but "bogofilter -p -e -u" is better.

For the mark as not spam rule, it says "bogofilter -N" which unmarks the message as non-spam. It should say "bogofilter -S -n" which unmarks the message as spam and marks it as non-spam.

Similarly, "bogofilter -S" should be "bogofilter -N -s".

It really also needs two new rules for marking messages received prior to setting up the filter, which would do "bogofilter -s" (mark unmarked message as spam) and "bogofilter -n" (mark unmarked message as non-spam).

The commandline options changed as of version 0.11.0, released over 11 months ago. This can be detected via the output of bogofilter -V.
Comment 1 Andreas Gungl 2004-02-08 20:01:08 UTC
Subject: Re:  New: anti spam wizard gets bogofilter rules completely wrong for newer bogofilters

Sorry, I don't understand your problem and I also can't see any parameter 
changes when I compare a 0.10.x bogofilter against the current manpage at 
http://bogofilter.sourceforge.net (incl. change log).

So, instead of telling us to change A to B please explain in detail why you 
think the changes are necessary and what is the intention behind that 
changes and the benefit compared to the old configuration.

On Sonntag, 8. Februar 2004 16:16, Richard Smith wrote:
> The anti spam wizard gets the bogofilter rules completely wrong for
> recent versions of bogofilter (they changed the commandline options,
> breaking everything, a while back).
>
> For the classify rule, it says "bogofilter -p -e" but "bogofilter -p -e
> -u" is better.
>
> For the mark as not spam rule, it says "bogofilter -N" which unmarks the
> message as non-spam. It should say "bogofilter -S -n" which unmarks the
> message as spam and marks it as non-spam.
>
> Similarly, "bogofilter -S" should be "bogofilter -N -s".
>
> It really also needs two new rules for marking messages received prior to
> setting up the filter, which would do "bogofilter -s" (mark unmarked
> message as spam) and "bogofilter -n" (mark unmarked message as non-spam).
>
> The commandline options changed as of version 0.11.0, released over 11
> months ago. This can be detected via the output of bogofilter -V.

Comment 2 Richard Smith 2004-02-09 01:22:27 UTC
Subject: Re:  anti spam wizard gets bogofilter rules completely wrong for newer bogofilters

On Sunday 08 February 2004 7:01 pm, Andreas Gungl wrote:
> Sorry, I don't understand your problem and I also can't see any parameter
> changes when I compare a 0.10.x bogofilter against the current manpage at
> http://bogofilter.sourceforge.net (incl. change log).

From CHANGES-0.11:
	2003-02-27
	* Separate message registration options from unregistration
	  options.  '-s' and '-n' register messages and '-S' and '-N'
	  unregister them.  '-S' and '-n' may be used together, as can
	  '-N' and '-s'.

From the NEWS file, in the section for revision 0.11.0:

	* Separated message registration options from unregistration
	  options.  '-S' and '-N' have been changed and now just do
	  unregistration.  To move a message from one wordlist to the
	  other, use '-S -n' or '-N -s' (as appropriate)

> So, instead of telling us to change A to B please explain in detail why you
> think the changes are necessary and what is the intention behind that
> changes and the benefit compared to the old configuration.

I wasn't telling you to change anything, merely telling you your existing 
configuration is broken, and one way to fix it. From the fact it's broken, 
the intention behind my changes and the benefit compared to the old 
configuration should be obvious.

Here's a reiteration of what I said before:

Your 'mark as spam' option is "bogofilter -S". According to the bogofilter 
manpage (the one at the website you quoted):

    The -S option tells bogofilter to undo a prior registration of the same
    message as spam. If a message was incorrectly entered in the spam wordfile
    by '-n' or '-u' and you want to remove it from the spam wordfile and enter
    it in the non-spam wordfile, use options '-Sn'. If '-S' is used for a
    message that wasn't registered as spam, the counts will still be
    decremented.[1]

IOW, -S marks a previously spam-marked message as unknown.
Similarly, -N (your mark as non-spam option) marks a previously 
not-spam-marked message as unknown. Since these only decrease the hit counts 
for words, using them will never build up any knowledge in bogofilter at all 
(in fact, what they do is to unteach it the wrong thing). I hope you can now 
see why I called them 'completely wrong'.

Now, regarding my proposed solution:

Adding the -u option to the options for the classify command causes bogofilter 
to automatically add the messages it filters into the category it decides 
they're in (spam or non-spam), so you only have to manually teach it if it 
makes mistakes.

Now, suppose you have a misclassified message.
If bogofilter said it's spam, and it's not, the correct commandline is 
"bogofilter -S -n" (taken from the manpage section above); this is the "mark 
as non-spam" option I suggested.
If bogofilter said it's not spam, and it is, the correct commandline is 
"bogofilter -N -s".

If, on the other hand, you have messages you want to teach bogofilter with 
that it hasn't classified, you need to call it with different arguments:
If a message is unclassified and it's spam, you want "bogofilter -s".
If a message is unclassified and it's not spam, you want "bogofilter -n".

There may be some clever way to have just a single mark-as-{not-,}spam action 
which works whether or not bogofilter's already classified a mail, but I 
can't think of a way to do that using KMail's filters alone.

Anyway, I hope this answers your questions.

[1] There's actually a typo in this item; where it says "by -n or -u" it means 
"by -s or -u", as is readily apparent from reading what -s and -n do.

Comment 3 Andreas Gungl 2004-02-09 11:27:46 UTC
On Monday 09 February 2004 01:22, Richard Smith wrote:
> [...]

Richard, I've read the manpage again and now I understand the differences. 
Perhaps I didn't find the differences because English is not my native 
language and I was too fast reading over the pages.

> Here's a reiteration of what I said before:
>
> Your 'mark as spam' option is "bogofilter -S". According to the
> bogofilter manpage (the one at the website you quoted):
>
>     The -S option tells bogofilter to undo a prior registration of the
> same message as spam. If a message was incorrectly entered in the spam
> wordfile by '-n' or '-u' and you want to remove it from the spam wordfile
> and enter it in the non-spam wordfile, use options '-Sn'. If '-S' is used
> for a message that wasn't registered as spam, the counts will still be
> decremented.[1]
>
> IOW, -S marks a previously spam-marked message as unknown.
> Similarly, -N (your mark as non-spam option) marks a previously
> not-spam-marked message as unknown. Since these only decrease the hit
> counts for words, using them will never build up any knowledge in
> bogofilter at all (in fact, what they do is to unteach it the wrong
> thing). I hope you can now see why I called them 'completely wrong'.

You're right.

I think that this parameter change is very unfortunate. It's not enough for 
the wizard to detect the programs but as it seems we have to care for the 
proper version too.
One could argue that the old version is already history, but e.g the stable 
SuSE 8.2 distribution ships such an old version. I guess, there are a lot 
of people using not up-to-date Bogofilter versions.

> Now, regarding my proposed solution:
>
> Adding the -u option to the options for the classify command causes
> bogofilter to automatically add the messages it filters into the category
> it decides they're in (spam or non-spam), so you only have to manually
> teach it if it makes mistakes.
>
> Now, suppose you have a misclassified message.
> If bogofilter said it's spam, and it's not, the correct commandline is
> "bogofilter -S -n" (taken from the manpage section above); this is the
> "mark as non-spam" option I suggested.
> If bogofilter said it's not spam, and it is, the correct commandline is
> "bogofilter -N -s".
>
> If, on the other hand, you have messages you want to teach bogofilter
> with that it hasn't classified, you need to call it with different
> arguments: If a message is unclassified and it's spam, you want
> "bogofilter -s". If a message is unclassified and it's not spam, you want
> "bogofilter -n".
>
> There may be some clever way to have just a single mark-as-{not-,}spam
> action which works whether or not bogofilter's already classified a mail,
> but I can't think of a way to do that using KMail's filters alone.
>
> Anyway, I hope this answers your questions.

As you've stated yourself it's a problem to know for sure if a message was 
classified as spam or ham. You can't be really sure (in KMail too) even if 
you use the status icons for spam / ham and the -u option.
This makes the filtering pretty difficult compared to e.g. SpamAssassin.

I personally don't like to mark non-spam messages by an icon. More than 90% 
would have it making that information nearly useless. One argument to do it 
could be that I can see if a message has been classified. But you can't be 
sure again, because the flag could be set manually too.

My approach was to use Bogofilter for classification based on a reliable 
training (by using the classification actions in KMail or any external 
process). That's why I didn't use -u.
The classification in KMail would have been made explicitely by the user and 
keeps the statistics clean, but I do realize that the counting isn't 
perfect in this case too.

I tend to agree with you about being "-p -e -u" together with "-S -n" and 
"-N -s" the best possible solution which we can achieve.

I'm going to find a way to differentiate versions of a given tool to be able 
to handle such changes in the meaning of the parameters.

Comment 4 Andreas Gungl 2004-02-11 20:38:11 UTC
I've modified the configuration file for the wizard. Now the new options are configured. I've verified that they don't force an error with at least bogofilter 0.10.3.1 although the counting might be affected.
Dealing with different versions is pretty difficult as we already know from the pgp support in KMail. So I take this parameter change as an exception for now. If we encounter more of such problems, we still can implement an appropriate mechanism.
Comment 5 Mark Saward 2005-06-23 12:42:31 UTC
I've just started using kmail for the first time in years, having migrated from evolution.  I tried setting up the kmail spam filters and ran across a similar problem relating to the updates.

As Richard pointed out above, -N -s and -S -n only work for spam that has already been classified.  This meant that when I built up my spam database using the default kde filters of -Ns and -Sn, I wasn't actually building anything worthwhile.  I first needed to classify my emails using -s or -n alone and THEN change it with -Ns or -Sn.  This is what Richard mentioned above.

In other words, kmail needs to handle two different filters:
1.  Marking as spam or ham of emails that have not already been marked - using -n and -s
2.  Marking/correcting spam or ham emails that have already been marked - using -Ns and -Sn

Even if this is not possible, it will cause a great deal of confusion to people trying to set up spam filtering in kmail for the first time, and there should at least be a warning or something.

Hope that helps.
Comment 6 Mark Saward 2005-06-23 12:47:30 UTC
Additional note, I should mention more clearly that the version of kmail I'm using set the rules as -Ns and -Sn, so not the same as what Richard had.
Comment 7 Thomas McGuire 2007-08-02 22:34:39 UTC
SVN commit 695738 by tmcguire:

Change the filter commands for bogofilter.
The old behavior corrupted the bogofilter database because KMail unregistered
messages which were not registered with bogofilter in the first place.

With the new behavior, messages which are classified automatically are no
longer added to the bogofilter database.

For more details and a better explaination, see the bugreport and especially
the bogofilter mail archives (linked to from the bugreport).

BUG: 148211
CCBUG: 74577


 M  +3 -3      kmail.antispamrc  


--- trunk/KDE/kdepim/kmail/kmail.antispamrc #695737:695738
@@ -34,10 +34,10 @@
 Executable=bogofilter -V
 URL=http://bogofilter.sourceforge.net
 PipeFilterName=Bogofilter Check
-PipeCmdDetect=bogofilter -p -e -u
+PipeCmdDetect=bogofilter -p -e
 PipeCmdNoSpam=
-ExecCmdSpam=bogofilter -N -s
-ExecCmdHam=bogofilter -S -n
+ExecCmdSpam=bogofilter -s
+ExecCmdHam=bogofilter -n
 DetectionHeader=X-Bogosity
 DetectionPattern=(yes)|(spam\\b)
 DetectionPattern2=