Good starting numbers for spamassassins dcc

View: New views
5 Messages — Rating Filter:   Alert me  

Good starting numbers for spamassassins dcc

by Michał Grzędzicki :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

Hello,

By default spamassassin uses 99999 as dcc_body/fuz1/fuz2_max whitch is  
same as dcc's many.
This is olny 1/6th of the messages.

I'm planing to add variable scoring to spamassassins DCC.PM to make it  
more usefull ( now only messages with many reports are flagged).
I'm thinking about 40 reports getting 1/10 of the base score to 10 000  
reports (or many, where does it start ?) getting whole base score,
500 reports may be treated as likelly spam with 1/2 of base score in  
beatween maybe use 2 linear functions or one of higher order.

Base score should be around 4/5 of mark as spam score.

What would be good threstholds for wery unlikely spam, likelly spam,  
surelly spam.

I'm guessing this is the right aligment body fuz1 fuz2 checksums with  
body getting most reports and fuz2 least reports.
Is this right?

--
Michał Grzędzicki

_______________________________________________
DCC mailing list      DCC@...
http://www.rhyolite.com/mailman/listinfo/dcc

Re: Good starting numbers for spamassassins dcc

by Vernon Schryver :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

> From: =?ISO-8859-2?Q?Micha=B3_Grz=EAdzicki?= <lazy@...>

> By default spamassassin uses 99999 as dcc_body/fuz1/fuz2_max whitch is  =
> same as dcc's many.
> This is olny 1/6th of the messages.

Are you referring to the difference between 19% tagged as "many" and
the 51% with bulky counts according to the graphs for your server at
https://www.rhyolite.com/dcc/private/.... ?
That is a low value.  Are you doing DCC filtering after other filters?

> I'm planing to add variable scoring to spamassassins DCC.PM to make it  =
> more usefull ( now only messages with many reports are flagged).
> I'm thinking about 40 reports getting 1/10 of the base score to 10 000  =
> reports (or many, where does it start ?) getting whole base score,
> 500 reports may be treated as likelly spam with 1/2 of base score in  =
> beatween maybe use 2 linear functions or one of higher order.
>
> Base score should be around 4/5 of mark as spam score.
>
> What would be good threstholds for wery unlikely spam, likelly spam,  =
> surelly spam.

I doubt that would help.  The DCC detects bulk email.  Spam is unsolicited
bulk email.  Mail messages that have been seen 100 or 10,000 times are
equally bulky, and neither is more likely to be spam.  Contrast Amazon
online order confirmations with Amazon advertisements.  Both are very
bulky, but only some of the Amazon advertisements are spam.

That is why I have always said the best way to use DCC is with per-user
whitelists.  Each user's whitelist indicates which streams of bulk mail
are solicited.


I think the SpamAssassin threshold of "many"/99999 is far too high.
The SpamAssassin conversion of "many" to 99999 is kludge that should
not have been code.
Instead, SpamAssassin should look for "bulk" in the X-DCC header
and the dccifd or dccproc thresholds should tell dccifd or dccproc
whether to add "bulk".  See DCCM_REJECT_AT and DCCM_REJECT_AT
in /var/dcc/dcc_conf.  See also -c and -t in the dccproc and dccifd
man pages; -t could be added to DCCIFD_ARGS


> I'm guessing this is the right aligment body fuz1 fuz2 checksums with  =
> body getting most reports and fuz2 least reports.
> Is this right?

If I understand the question, no.  All of the checksums are computed
on all mail messages, but only reports of the most bulky checksums are
flooded among DCC servers.  Body checksums are not at all fuzzy, and
so minimal personalizations can make each copy of spam have differing
DCC body checksums.


Vernon Schryver    vjs@...
_______________________________________________
DCC mailing list      DCC@...
http://www.rhyolite.com/mailman/listinfo/dcc

Re: [ SPAM ] Re: Good starting numbers for spamassassins dcc

by Michał Grzędzicki :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message


Wiadomość napisana w dniu 2009-05-02, o godz. 15:39, przez Vernon  
Schryver:

>> From: =?ISO-8859-2?Q?Micha=B3_Grz=EAdzicki?= <lazy@...>
>
>> By default spamassassin uses 99999 as dcc_body/fuz1/fuz2_max whitch  
>> is  =
>> same as dcc's many.
>> This is olny 1/6th of the messages.
>
> Are you referring to the difference between 19% tagged as "many" and
> the 51% with bulky counts according to the graphs for your server at
> https://www.rhyolite.com/dcc/private/.... ?
yes, so trapped spam is 'many'  and likely spam is somethng around 10 ?


> That is a low value.  Are you doing DCC filtering after other filters?
only some rbl blackholling + spf is done before
i'm talking only about messages that are getting DCC_CHECK score for  
having "many" occurences and this is only 17 000 out of 102 000
this is coused by spamassassin always requiering many reports

>
>> I'm planing to add variable scoring to spamassassins DCC.PM to make  
>> it  =
>> more usefull ( now only messages with many reports are flagged).
>> I'm thinking about 40 reports getting 1/10 of the base score to 10  
>> 000  =
>> reports (or many, where does it start ?) getting whole base score,
>> 500 reports may be treated as likelly spam with 1/2 of base score  
>> in  =
>> beatween maybe use 2 linear functions or one of higher order.
>>
>> Base score should be around 4/5 of mark as spam score.
>>
>> What would be good threstholds for wery unlikely spam, likelly  
>> spam,  =
>> surelly spam.
>
> I doubt that would help.  The DCC detects bulk email.  Spam is  
> unsolicited
> bulk email.  Mail messages that have been seen 100 or 10,000 times are
> equally bulky, and neither is more likely to be spam.  Contrast Amazon
> online order confirmations with Amazon advertisements.  Both are very
> bulky, but only some of the Amazon advertisements are spam.
>
> That is why I have always said the best way to use DCC is with per-
> user
> whitelists.  Each user's whitelist indicates which streams of bulk  
> mail
> are solicited.
>
>
> I think the SpamAssassin threshold of "many"/99999 is far too high.
> The SpamAssassin conversion of "many" to 99999 is kludge that should
> not have been code.
> Instead, SpamAssassin should look for "bulk" in the X-DCC header
> and the dccifd or dccproc thresholds should tell dccifd or dccproc
> whether to add "bulk".  See DCCM_REJECT_AT and DCCM_REJECT_AT
> in /var/dcc/dcc_conf.  See also -c and -t in the dccproc and dccifd
> man pages; -t could be added to DCCIFD_ARGS

You are right, for now i will lower the requrements in spamassasin.  
Maybe in feature i will add some "bonus" points to popular spams with  
more then 1000 reports to make sure they are spam flaged eaven other  
filters didn't engage them.

DCC.pm checks for X-DCC: bulk only if it has been added upstream,  
dcc_conf mentiones 50 as bulk mail count so i will start with  
something around 200
with lower score value (most of those spams whitch get threw alredy  
have some points from other checks)


whitelised emails aren't scanned at all so I don't have to worry about  
that at dcc level


>
>
>> I'm guessing this is the right aligment body fuz1 fuz2 checksums  
>> with  =
>> body getting most reports and fuz2 least reports.
>> Is this right?
>
> If I understand the question, no.  All of the checksums are computed
> on all mail messages, but only reports of the most bulky checksums are
> flooded among DCC servers.  Body checksums are not at all fuzzy, and
> so minimal personalizations can make each copy of spam have differing
> DCC body checksums.

I guess i wasnt clear about that sorry.
I hope this time it will be clearer.
Are fuz1 and fuz2 computed from same parts of email eg. sender,  
subject, X-Client + body, or fuz2 takes more headers ? Then wery  
simillar spams can have same body hash same fuz1 but difrend fuz2  
because fuz2 takes in acount X-Client header whitch difers in this 2  
spams or mayby they take same subset of email, header + body but use  
difrend fuzzing algoritm (like omiting whitespaces ignoring case ect.  
to ignore minor diferences in spams)

If they use same subset of headers + body there's no point in  
diferenting threstholds for fuz1 and fuz2, and if fuz2 inputs more  
data it should have smaller thresthold then fuz1.


--
Michał Grzędzicki

_______________________________________________
DCC mailing list      DCC@...
http://www.rhyolite.com/mailman/listinfo/dcc

Re: [ SPAM ] Re: Good starting numbers for spamassassins dcc

by Vernon Schryver :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message

> From: =?ISO-8859-2?Q?Micha=B3_Grz=EAdzicki?= <lazy@...>


> DCC.pm checks for X-DCC: bulk only if it has been added upstream,  =

I think it is good to run DCC checks during the original SMTP transaction.
The best way is to let the MTA reject spam during the transaction.
Even if one cannot do that, dccifd or dccm can add X-DCC headers when
run as part of sendmail, postfix, or other MTAs.


> Are fuz1 and fuz2 computed from same parts of email eg. sender,  =
> subject, X-Client + body, or fuz2 takes more headers ? Then wery  =
> simillar spams can have same body hash same fuz1 but difrend fuz2  =
> because fuz2 takes in acount X-Client header whitch difers in this 2  =
> spams or mayby they take same subset of email, header + body but use  =
> difrend fuzzing algoritm (like omiting whitespaces ignoring case ect.  =
> to ignore minor diferences in spams)

All three DCC checksums, body, fuz1, and fuz2, are computed on only
the message body starting after the blank line that ends the SMTP headers.
The fuzziness of the fuz1 and fuz2 checksums differ.
I will not say how they differ, although it is not a secret for anyone
willing to read the source.

> If they use same subset of headers + body there's no point in  =
> diferenting threstholds for fuz1 and fuz2, and if fuz2 inputs more  =
> data it should have smaller thresthold then fuz1.

I think the thresholds for the checksums should be the same.

Except for tiny messages and certain other cases, all three DCC
checksums are computed message bodies.  
However, only reports of bulky checksums are flooded, so your DCC
server is more likely to receive reports of fuzzy checksums than
simple "body" checksums.


Vernon Schryver    vjs@...
_______________________________________________
DCC mailing list      DCC@...
http://www.rhyolite.com/mailman/listinfo/dcc

Re: Good starting numbers for spamassassins dcc

by Michał Grzędzicki :: Rate this Message:

Reply (Restricted by the Administrator) | Reply to Author | View Threaded | Show Only this Message


Wiadomość napisana w dniu 2009-05-02, o godz. 18:54, przez Vernon  
Schryver:

>> From: =?ISO-8859-2?Q?Micha=B3_Grz=EAdzicki?= <lazy@...>
>
>
>> DCC.pm checks for X-DCC: bulk only if it has been added upstream,  =
>
> I think it is good to run DCC checks during the original SMTP  
> transaction.
> The best way is to let the MTA reject spam during the transaction.
yes this is much betted then  deleting spam, but in our current config  
with amavis it it would be hard to get


> Even if one cannot do that, dccifd or dccm can add X-DCC headers when
> run as part of sendmail, postfix, or other MTAs.
>
>> Are fuz1 and fuz2 computed from same parts of email eg. sender,  =
>> subject, X-Client + body, or fuz2 takes more headers ? Then wery  =
>> simillar spams can have same body hash same fuz1 but difrend fuz2  =
>> because fuz2 takes in acount X-Client header whitch difers in this  
>> 2  =
>> spams or mayby they take same subset of email, header + body but  
>> use  =
>> difrend fuzzing algoritm (like omiting whitespaces ignoring case  
>> ect.  =
>> to ignore minor diferences in spams)
>
> All three DCC checksums, body, fuz1, and fuz2, are computed on only
> the message body starting after the blank line that ends the SMTP  
> headers.
> The fuzziness of the fuz1 and fuz2 checksums differ.
> I will not say how they differ, although it is not a secret for anyone
> willing to read the source.

ok, thank You for clarification

>
>> If they use same subset of headers + body there's no point in  =
>> diferenting threstholds for fuz1 and fuz2, and if fuz2 inputs more  =
>> data it should have smaller thresthold then fuz1.
>
> I think the thresholds for the checksums should be the same.
>
> Except for tiny messages and certain other cases, all three DCC
> checksums are computed message bodies.
> However, only reports of bulky checksums are flooded, so your DCC
> server is more likely to receive reports of fuzzy checksums than
> simple "body" checksums.


thank You for all the answers


--
Michał Grzędzicki

_______________________________________________
DCC mailing list      DCC@...
http://www.rhyolite.com/mailman/listinfo/dcc