|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
MIME parts and sa-learnHej,
I am thinking of a clever way to integrate spamassassin's sa-learn (bayesian classifier training program) into exim's ACLs. The intended approach is to pass the message which should be trained as a "message/rfc822" attachment (so original headers are preserved) to a specific address (e.g. sa-learn-spam@domain) at the server. Therefore, the first thing I was looking at was the smtp_mime ACL, but it doesn't seem to be of much use besides filtering for regular expressions. If the "malware" condition would be allowed, I could pass the attachment via a "cmdline" scanner to sa-learn, but according to the docs this isn't possible. It is of course possible to pass the whole message to a script in the data ACL or in a transport and demime everything in the script, but I don't really like that much. Another approach would be to demime into unique files in the mime ACL and read these files in a scanner/delivery script, but that's even worse, IMO. I'm sure people are using spamassassin a lot out there, so can anyone here show a smarter way of integration spam/ham learning? (Without using spamc or sth. else from the user side.) TIA & br, daniel -- ## List details at http://lists.exim.org/mailman/listinfo/exim-users ## Exim details at http://www.exim.org/ ## Please use the Wiki with this list - http://wiki.exim.org/ |
|
|
Re: MIME parts and sa-learnDaniel Tiefnig wrote:
> Hej, > > I am thinking of a clever way to integrate spamassassin's sa-learn > (bayesian classifier training program) into exim's ACLs. The intended > approach is to pass the message which should be trained as a > "message/rfc822" attachment (so original headers are preserved) to a > specific address (e.g. sa-learn-spam@domain) at the server. > > Therefore, the first thing I was looking at was the smtp_mime ACL, but > it doesn't seem to be of much use besides filtering for regular > expressions. If the "malware" condition would be allowed, I could pass > the attachment via a "cmdline" scanner to sa-learn, but according to the > docs this isn't possible. > > It is of course possible to pass the whole message to a script in the > data ACL or in a transport and demime everything in the script, but I > don't really like that much. > Another approach would be to demime into unique files in the mime ACL > and read these files in a scanner/delivery script, but that's even > worse, IMO. > > I'm sure people are using spamassassin a lot out there, so can anyone > here show a smarter way of integration spam/ham learning? (Without using > spamc or sth. else from the user side.) > > TIA & br, > daniel > CAVEAT: Take this as a 'contrarian' observation w/r auto-learning and local server spam/ham classifying in general. - IMNSHO, trying to 'learn' spam/ham discrimination on a mixed-user server has two drawbacks: -- It uses a great deal of machine resources compared to a multitude of simpler and more repeatable/predictable means of filtering. -- it can be confused by per-user differences, not only as to what one user consders spam and another does not (quasi-legit adverts, supermarket, bookstore, airline and travel 'bargains' etc), and the very nature of the traffic different users expect (active in social networks, retired vs active business contacts, family & friends vs professionals, et al). So Spam-Bayes and friends can easily get it 'wrong' if applied system-wide, yet may need even greater resources if they are to be applied per-recipient - not easily done in the requisite DATA phase anyway - at least not as to rejection vs mere demerit scoring. Conversely, Bayesian filtering seems to be at its best when applied in the end-user's MUA, where there it is always 'per-recipient' specific, AND has at least 'momentary' access to a generally greater chunk of processing power than a server might be able to spare at busy times. Next is the general 'need' to reinvent the classification anyway. It might have a better payoff to utilize SA for all EXCEPT Bayesian / 'learning'/ AWL, and add, for example, DSPAM, wherein a broader global dataset of spam vs ham 'fingerprints' can be applied with less total effort than developing your own on the fly. Either way, our experience has been that there is more than enough information available to identify the unwanted so as to not need either SA's Spam-Bayes or DPSAM. Messages that evade interception by simpler means are few enough to not justify the extra complexity - and maintenance - otherwise required, even when plentiful machine resources are on-tap. Note the relatively modest scores SA assigns by default even when SA-Spam-Bayes is used. Not really in the front lines of defense - though one can, of course, make it such. Finally, to the extent that all other filters are working well, AND rejecting in-session, not just scoring and onpassing, there can be a scarcity of spam on which to train Bayesian filtering. Carrying such traffic 'deeper' into DATA phase, so that Bayes can 'sniff' it to broaden its dataset, also adds workload when it could have been rejected earlier. After extensive tests, including saving folders full of known-spam for training, we've given it up as too marginal to be useful, (ditto greylisting), and have now had Spam-Bayes switched OFF for many years. As said, a 'contrarian' viewpoint, so YMMV. Bill -- ## List details at http://lists.exim.org/mailman/listinfo/exim-users ## Exim details at http://www.exim.org/ ## Please use the Wiki with this list - http://wiki.exim.org/ |
|
|
Re: MIME parts and sa-learnW B Hacker wrote:
> - IMNSHO, trying to 'learn' spam/ham discrimination on a mixed-user > server has two drawbacks: > > -- It uses a great deal of machine resources compared to a multitude > of simpler and more repeatable/predictable means of filtering. Which there are ... > -- it can be confused by per-user differences, not only as to what > one user consders spam and another does not I didn't write I'd like to share bayes data between users ... Although I will do. ;-) But my user base is very small (i.e. only a few employees) and maintains a close relationship. I wouldn't do that on a large-scale server with lots of different users. (Like for ISP systems which I have administrated a lot, but no longer do.) And finally, I'm still experimenting a bit, and will disable bayes classifying if it doesn't work out well. > So Spam-Bayes and friends can easily get it 'wrong' if applied > system-wide, yet may need even greater resources if they are to be > applied per-recipient - not easily done in the requisite DATA phase > anyway - at least not as to rejection vs mere demerit scoring. Well, this is a problem of content based filtering in general. > Conversely, Bayesian filtering seems to be at its best when applied > in the end-user's MUA, where there it is always 'per-recipient' > specific, AND has at least 'momentary' access to a generally greater > chunk of processing power than a server might be able to spare at > busy times. Sure, but client-side filtering has other drawbacks, e.g. when using different clients like laptops, workstations, mobile phones ... Generally, it counterfeights the idea of IMAP-based mail access. > Next is the general 'need' to reinvent the classification anyway. It > might have a better payoff to utilize SA for all EXCEPT Bayesian / > 'learning'/ AWL, and add, for example, DSPAM, wherein a broader > global dataset of spam vs ham 'fingerprints' can be applied with less > total effort than developing your own on the fly. Thanks for the hint, I tried dspam a few years ago and liked it very much, but for the moment I don't want to maintain another spam filter besides spamassassin. At some time I might decide to use dspam to replace spamassassin at all, but not to run two different scanners which will produce contradicting results. > As said, a 'contrarian' viewpoint, so YMMV. Thank you, your thoughts are very appreciated. br, daniel -- ## List details at http://lists.exim.org/mailman/listinfo/exim-users ## Exim details at http://www.exim.org/ ## Please use the Wiki with this list - http://wiki.exim.org/ |
| Free embeddable forum powered by Nabble | Forum Help |