|
View:
New views
15 Messages
—
Rating Filter:
Alert me
|
|
|
Suggestion to developersSpamAssassin is a really great product.
But, it is perl-based and checks every message with a lot of (all) rules (, always!). Volume of spam is constantly increasing, as well as CPU and memory load that SA creates on servers. As a SA user, I would be happy to have the following possibility in the next version: 1. Add an option which will allow to limit number of rules run against every message. I.e., if the limit of spam points is reached to required_score, stop further checking and process the message as a spam. I think, not all users really interested in gathering all statistics about all spam messages. 2. According to (1), it makes sense to sort all rules from lightweight to heavyweight (including ones which require internet queries) and make checking in this order. This could allow to lower SA footprint. Thanks. |
|
|
RE: Suggestion to developersHow would you account for negative scoring rules? (if your message hit's
score=5 it may soon be socre=-2 after a negative scoring rule is applied). The most effective way I've found to lower the SA footprint is to limit the mail that gets to it by using some triage on the MTA side. SA as a standalone tool might benefit from some kind of triage functionality to kill messages immediately as per a "blacklist" rule. The blacklist rule(s) would be run against the messages before the normal ruleset was applied. If any of the blacklist rules were triggered, the message would be dropped without further scanning. -----Original Message----- From: Crocomoth [mailto:avp@...] Sent: Wednesday, September 12, 2007 10:42 AM To: users@... Subject: Suggestion to developers SpamAssassin is a really great product. But, it is perl-based and checks every message with a lot of (all) rules (, always!). Volume of spam is constantly increasing, as well as CPU and memory load that SA creates on servers. As a SA user, I would be happy to have the following possibility in the next version: 1. Add an option which will allow to limit number of rules run against every message. I.e., if the limit of spam points is reached to required_score, stop further checking and process the message as a spam. I think, not all users really interested in gathering all statistics about all spam messages. 2. According to (1), it makes sense to sort all rules from lightweight to heavyweight (including ones which require internet queries) and make checking in this order. This could allow to lower SA footprint. Thanks. -- View this message in context: http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12637043 Sent from the SpamAssassin - Users mailing list archive at Nabble.com. |
|
|
RE: Suggestion to developersIn order to implement something like this, you would need to know the order
of rules processing (which perhaps there is one - but I don't know it). You would need to be careful if you have rules which will assign negative scores which typically do so after other rules have already given positive ones. Every SA implementation would be unique, so SA would have to be modified to rules some specific rule sets first before any others (maybe it does now?) and you would then want to make certain your custom scores go into those files. In my own implementation, I put my custom rules into a unique .cf file which I have created so I can distinguish it from other rule sets. The "out-of-the-box" SA wouldn't run this file first (unless SA can be modified to read a designated file before it reads others). -----Original Message----- From: Crocomoth [mailto:avp@...] Sent: Wednesday, September 12, 2007 9:42 AM To: users@... Subject: Suggestion to developers SpamAssassin is a really great product. But, it is perl-based and checks every message with a lot of (all) rules (, always!). Volume of spam is constantly increasing, as well as CPU and memory load that SA creates on servers. As a SA user, I would be happy to have the following possibility in the next version: 1. Add an option which will allow to limit number of rules run against every message. I.e., if the limit of spam points is reached to required_score, stop further checking and process the message as a spam. I think, not all users really interested in gathering all statistics about all spam messages. 2. According to (1), it makes sense to sort all rules from lightweight to heavyweight (including ones which require internet queries) and make checking in this order. This could allow to lower SA footprint. Thanks. -- View this message in context: http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12637043 Sent from the SpamAssassin - Users mailing list archive at Nabble.com. |
|
|
RE: Suggestion to developersOf course, this would not be simple to implement this, but, I think, as SA becomes more heavy, developers will be forced to find ways of "scissoring".
To preserve nagative scores, SA could run these rules first. And, while sorting, SA should take into account possible dependencies between rules - read all rules from all config files and build a forest of rule trees. I think, SA does this anyways and all custom rules will be included into a set of rules in memory. Sort order, for simplicity, could be from rules with high score to ones with low score. And even this could help greatly.
|
|
|
RE: Suggestion to developersI am not sure that messages after positive blacklist check will be dropped. As far as I see, SA just adds 100 points to this message and continues checking. And I am not sure about the order of rules in checking process. |
|
|
Re: Suggestion to developersCrocomoth wrote:
> SpamAssassin is a really great product. > But, it is perl-based and checks every message with a lot of (all) rules (, > always!). > Volume of spam is constantly increasing, as well as CPU and memory load that > SA creates on servers. > As a SA user, I would be happy to have the following possibility in the next > version: > 1. Add an option which will allow to limit number of rules run against every > message. I.e., if the limit of spam points is reached to required_score, > stop further checking and process the message as a spam. > I think, not all users really interested in gathering all statistics about > all spam messages. > 2. According to (1), it makes sense to sort all rules from lightweight to > heavyweight (including ones which require internet queries) and make > checking in this order. > > This could allow to lower SA footprint. > SA 3.2.x already does this, you just need to know how. Read the docs on the shortcircuit plugin, and the "priority" option for rules: Shortcircuit allows you to define when to "bail out" http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_Shortcircuit.html And priority, documented in the "Rule definitions and privileged settings" section of the Conf manpage, allows you to tell SA what order to run rules in. http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Conf.html#rule_definitions_and_privileged_settings Note however that over-using priority on the rules can be detrimental to your performance, forcing SA to scan through the message many times. > |
|
|
|
|
|
Re: Suggestion to developersThank you for very useful information. This method and plug-in could really make checking faster. But, I have to say: 1. Using this method, admin must understand that the fate of every message (for all users) will depend from the single rule. In some cases, this looks like not enough, especially when the system is used by multiple users with quite different desired average message content. So, bayes may generate false positives, in default configuration. 2. I suspect that not every admin could be smart enough or have enough time to develop his own rulesets with shortcircuit involved to get really good and reliable results. But, he could be able to turn some option in config file and restart SA. 3. Method proposed by me is not mutually exclusive with shortcircuit. They could work together. Thanks. |
|
|
Re: Suggestion to developersCrocomoth wrote:
> Matt Kettler-3 wrote: > >> SA 3.2.x already does this, you just need to know how. Read the docs on >> the shortcircuit plugin, and the "priority" option for rules: >> >> Shortcircuit allows you to define when to "bail out" >> http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_Shortcircuit.html >> >> > > Thank you for very useful information. > This method and plug-in could really make checking faster. > But, I have to say: > 1. Using this method, admin must understand that the fate of every message > (for all users) will depend from the single rule. very early priority (low number), then have another one run with a semi-early priority which does shortcircuiting. All of the "very early" rules will be involved in the decision to shortcircuit or not. > In some cases, this looks > like not enough, especially when the system is used by multiple users with > quite different desired average message content. So, bayes may generate > false positives, in default configuration. > 2. I suspect that not every admin could be smart enough or have enough time > to develop his own rulesets with shortcircuit involved to get really good > and reliable results. But, he could be able to turn some option in config > file and restart SA. > Agreed. > 3. Method proposed by me is not mutually exclusive with shortcircuit. They > could work together. > Yes, but the method you proposed is only feasible using these tools anyway. SA can't "auto-sort" the rules in any reasonble way without severely degrading performance, or risking serious miscategorization problems. Trust me, the topic isn't new, and shortcircuit/priority is about the best you can do. You have to make those manual decisions. Now, it's possible for the devs to be the deciders, not the end-admins, but someone has to manually prioritize. |
|
|
Re: Suggestion to developersYes, but low-numbered rules may not generate any points and the desision may depend from one rule anyways. This does not change anything. And what is more (see (2) with which you have agreed), in default configuration, this will be bayes which generates only 3.5 points (not taking into account while/black lists because they will not be set up properly in most cases). And, I think, number of persons not wishing to reorder standard rules will be much more than "semi-professional" admins. But, as we can see, an option named "priority" exists. That means, SA really does some kind of sorting. And, theoretically, user can assign any priority to any rule and SA will work, as a stable product. Isn't it? Sort order may be: negative rules, sorted positive common rules. Any user-defined rules should be checked after negative ones and before positives, if exists. Of course, sorting should be performed once upon load procedure. Or, such a cut-off may work without any sorting; this is optional. Standard priorities could be enough, if they set up. Thank you. I just want to draw attention of developers to this problem. Every other message here is about productivity. |
|
|
Re: Suggestion to developersCrocomoth wrote:
> Matt Kettler-3 wrote: > >>> 1. Using this method, admin must understand that the fate of every >>> message >>> (for all users) will depend from the single rule. >>> >> Not if you set it up properly.. You can have multiple rules run with a >> very early priority (low number), then have another one run with a >> semi-early priority which does shortcircuiting. All of the "very early" >> rules will be involved in the decision to shortcircuit or not. >> >> > > Yes, but low-numbered rules may not generate any points and the desision may > depend from one rule anyways. This does not change anything. And what is > more (see (2) with which you have agreed), in default configuration, this > will be bayes which generates only 3.5 points (not taking into account > while/black lists because they will not be set up properly in most cases). > And, I think, number of persons not wishing to reorder standard rules will > be much more than "semi-professional" admins. > True, but your automated method based on sorting them on "weight" would pretty much grind spamassassin to a screeching halt by increasing the average scan time due to forcing multiple passes through the message. Not to mention false positive problems if negative-scoring rules end up being considered "heavy" and don't get run. Your idea essentially ruins any benefits of memory caching that SpamAssassin currently exploits. Right now, rules are run in groups based on what part of the message they need. This lends speed to spamassassin by allowing that portion of the mesage to already be in cache for all but the first rule in the group. If you start jumping around all over the message for different rules, the processor memory cache quickly becomes full and pushes out parts that you're going to be looking at again. If you keep going back-and-forth header, body, header, body, header, body.. you wind up going out to ram quite often, and that's painfully slow. (I don't care what high-speed dual-channel ddr2 memory setup you have, it's abysmally slow from the processors perspective, generally 20 times slower than cache is) Sure, some messages will bail out faster, but most messages will take much longer to scan. How is that better? I don't debate that the basic idea of having SA do this "automagically" would be a great thing. However, the reality of doing it efficiently is much trickier than you think. At one point, one idea was to run all the negative scoring rules, and then run the positive scoring ones, and bail out if the score went over the spam threshold during the positive phase. The end result of that test was abysmally slow, due to having to scan the message in two passes (negative header, negative body, positive header, positive body). > Sort order may be: negative rules, sorted positive common rules. Any > user-defined rules should be checked after negative ones and before > positives, if exists. Of course, sorting should be performed once upon load > procedure. Tested, as mentioned above. Resulted in horrible performance due to over-sorting. > Or, such a cut-off may work without any sorting; this is optional. Standard > priorities could be enough, if they set up. I'd agree there. SA could exploit priorities better in the default config, but this kind of thing needs to be done very carefuly to avoid thrashing the processor cache. Any simple "sort by.." is going to result in terrible performance. |
|
|
Re: Suggestion to developersMatt's generally nailed it. I would say that it should be easy enough to write a plugin which reorders rule priorities into a desired order, then implements the "have_shortcircuited" plugin hook to return 1 at the desired point... so if anyone feels like trying it out to see if they can make an auto-shortcircuiting plugin which outperforms base SpamAssassin over a mixed corpus of 50:50 nonspam and spam, go for it ;) --j. Matt Kettler writes: > Crocomoth wrote: > > Matt Kettler-3 wrote: > > > >>> 1. Using this method, admin must understand that the fate of every > >>> message > >>> (for all users) will depend from the single rule. > >>> > >> Not if you set it up properly.. You can have multiple rules run with a > >> very early priority (low number), then have another one run with a > >> semi-early priority which does shortcircuiting. All of the "very early" > >> rules will be involved in the decision to shortcircuit or not. > >> > >> > > > > Yes, but low-numbered rules may not generate any points and the desision may > > depend from one rule anyways. This does not change anything. And what is > > more (see (2) with which you have agreed), in default configuration, this > > will be bayes which generates only 3.5 points (not taking into account > > while/black lists because they will not be set up properly in most cases). > > And, I think, number of persons not wishing to reorder standard rules will > > be much more than "semi-professional" admins. > > > > True, but your automated method based on sorting them on "weight" would > pretty much grind spamassassin to a screeching halt by increasing the > average scan time due to forcing multiple passes through the message. > Not to mention false positive problems if negative-scoring rules end up > being considered "heavy" and don't get run. > > Your idea essentially ruins any benefits of memory caching that > SpamAssassin currently exploits. Right now, rules are run in groups > based on what part of the message they need. This lends speed to > spamassassin by allowing that portion of the mesage to already be in > cache for all but the first rule in the group. > > If you start jumping around all over the message for different rules, > the processor memory cache quickly becomes full and pushes out parts > that you're going to be looking at again. If you keep going > back-and-forth header, body, header, body, header, body.. you wind up > going out to ram quite often, and that's painfully slow. (I don't care > what high-speed dual-channel ddr2 memory setup you have, it's abysmally > slow from the processors perspective, generally 20 times slower than > cache is) > > Sure, some messages will bail out faster, but most messages will take > much longer to scan. How is that better? > > I don't debate that the basic idea of having SA do this "automagically" > would be a great thing. However, the reality of doing it efficiently is > much trickier than you think. > > At one point, one idea was to run all the negative scoring rules, and > then run the positive scoring ones, and bail out if the score went over > the spam threshold during the positive phase. > > The end result of that test was abysmally slow, due to having to scan > the message in two passes (negative header, negative body, positive > header, positive body). > > > Sort order may be: negative rules, sorted positive common rules. Any > > user-defined rules should be checked after negative ones and before > > positives, if exists. Of course, sorting should be performed once upon load > > procedure. > Tested, as mentioned above. Resulted in horrible performance due to > over-sorting. > > > Or, such a cut-off may work without any sorting; this is optional. Standard > > priorities could be enough, if they set up. > I'd agree there. SA could exploit priorities better in the default > config, but this kind of thing needs to be done very carefuly to avoid > thrashing the processor cache. Any simple "sort by.." is going to result > in terrible performance. |
|
|
Re: Suggestion to developersOn 9/13/07, Justin Mason <jm@...> wrote:
> > if anyone feels like trying it out to see if they can make an > auto-shortcircuiting plugin which outperforms base SpamAssassin over a > mixed corpus of 50:50 nonspam and spam, go for it ;) I dunno about your mail, but if it outperformed base SA on a corpus of 20:80 ham:spam that'd be worth it for what we end up filtering. Of course "outperform" means it also has to maintain the same (or a smaller) FP ratio, not just that it does the wrong thing faster. |
|
|
RE: Suggestion to developersOn Wed, 12 Sep 2007, Jason Burzenski wrote:
> How would you account for negative scoring rules? (if your message hit's > score=5 it may soon be socre=-2 after a negative scoring rule is > applied). It is stupid simple - run them first. :) -- Michał Jęczalik, +48.603.64.62.97 INFONAUTIC, +48.33.487.69.04 |
|
|
Re: Suggestion to developersI trust you. And, probably, any reordering may impact performance (original ruleset is carefully tuned). Unfortunately, I don't know rules order in processing (equal to load order established by first numbers in configs filename?) But, I see that shortcirquit does reordering (bayes, whitelists and some others) and nothing dramatic happens. Even more, this plug-in is recommended for use (in propertly set up installations). Of course, if we will consider an abstract case where negative rules may happen in body as well as in header in unpredictable quantity and order, and reordering is impossible, this idea has no right to live. But, in reality, we see that almost all negative rules are about the header with the only exception - bayes. And this test (bayes) is moved to the top by shortcirquit (before all header tests), and this does not harm performance. I think, this situation (all negatives are from the header) will be preserved in future version of SA, because of nature of email messages. So, I think, it is possible to turn on collected points check after [prioritized rules + header rules] (and inside body rules), without any sorting if this is undesirable. |
| Free embeddable forum powered by Nabble | Forum Help |