|
View:
New views
11 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 | Next > |
|
|
Re: Spam PDFJohn Rudd wrote:
> bgodette@... wrote: >>> Actually, it didn't. The assertion is that if someone else hadn't seen >>> this exact message first, then SA wouldn't have caught it. >> No, the assertion is that if someone else hadn't seen prior abuse from >> the sending host first (not this exact message), then SA wouldn't have >> caught that particular message. That assertion happens to be true for >> the blacklists, and true for BAYES as well since it would have had to >> have seen headers (since the payload is vastly different) that look like >> this sending host in the recent past and been told that it was SPAM. > > Your assertion about bayes is not well supported. It might have been > flagged by bayes for reasons that have _NOTHING_ to do with the received > headers. Follow along on this thought experiment. We have a spam run in progress from fresh zombies that have never sent to your system before. The payload of the spam run is new and does not match any historical spam because it's just a base64 encoded attachment (nothing for bayes to tokenize in the body of the message). The only bayes fodder for such a message is received headers, from/to/subject, and mime boundaries if the bot does not add any additional headers beyond the minimum (iow no user-agent, x-headers, etc). You *will* not be getting a BAYES_90 or BAYES_99 from that. >>> The PBL (which isn't spamtrap fed, it's collected from ISP published >>> and/or contributed data) would have caught this based upon issues that >>> have nothing at all to do with this message, and most likely nothing at >>> all to do with this current round of spam. It would be based upon the >>> host provider's policy that this host shouldn't send email to the internet. >> Which means, some time, in the past, for whatever reasons that >> particular IP address did something against someone's policy to end up >> on that list. The important part being "in the past". > > No, it means that the ISP, or possibly net block user, told Spamhaus > "it's an end user IP address, and not a mail server". There might be > _NO_ previous abuse from that IP address, and they'll still be listed. > The "policy" here is NOT the recipient's policy, the sendering network > owner's policy. I think you're missing the point when I say "in the past" in relation to scoring vs blacklists. It doesn't matter why an IP is listed, at some point in the past that IP had to have been added before you can match against it. In the case of PBL it contains more than just organization submitted netblocks, it also includes ranges added by hand by Spamhaus for ranges that appear to be end-user IP space that they have received spam from, which IMO is probably majority of that list at this point. Which is why I assert that hitting PBL is based on prior abuse (spam sent to Spamhaus PBL maintainers) more often than not. Fortunately IP addresses already in PBL tend to actually stay there since there's no automatic expiration, but it is still a reactionary list. > It could have been recent abuse from an entirely different message > batch. In other words, maybe that IP sent a standard stock scam > yesterday, and today it sent the pdf spam ... and this person was the > first one to receive that pdf spam message. No previous recipient of > the same message. But they'll still be listed at spamcop. And? That's still a late-receiver effect, this particular message scored X points because of what Y host did Z minutes ago, where Z could be days/weeks/months of minutes. >> Explain how BAYES will have any matching tokens to work on if its from a >> fresh, never before seen by your system, zombie and there's no message >> body other than the attachment? All you have to work with is headers >> which you've never seen before and MIME boundaries which you've never >> seen before. > > There are more headers than just the received headers. And, I honestly > don't know whether or not an attachment's raw data is analyzed by bayes > or not. My assumption is that it is. Yes there's from/to/subject and the MIME boundary. To: is ambiguous since that's going to match both ham and spam. From:, Subject:, and the MIME boundary can be ambiguous or worse they'll match common ham in the case of site-wide BAYES. No other headers need be added. So you're left with receive headers to make a spam/ham choice from and your BAYES database hasn't ever seen this host before, good luck with that. And since I mentioned site-wide bayes, as I mentioned more than a year ago the practice of using ham (e.g. common mailing lists) from the prior week as filler is continuing to become more and more common. The only solution to that for bayes is per-user bayes which only works if your users have a method of training and actually use it correctly (unlike AOL users that use the junk button to "unsubscribe"), that pretty much excludes that option from being used by large installations. >>> Just resting upon BAYES, BOTNET, and PBL, you're not "lucky to have >>> caught the message because you're a late receiver". You've caught the >>> message due to a combination of policy, misuse, and historical >>> characteristics of spam in general being used to train your system. >> All of which needs prior examples/reporting of messages similar to the >> one you're trying to detect, that's what "historical characteristics of >> spam" means. > > BOTNET does _NOT_ need prior reporting. And the prior reporting the PBL > require has nothing to do with abuse. Further, BAYES does not depend > upon the received headers. But even if you're right about bayes, your > claim that "all of which needs prior..." is at least 2/3 wrong, if not > 3/3 wrong. Yes I failed to exclude BOTNET from that, it's the only score from the original message that started this that is solid. The reason is because BOTNET is proactive, all the others are either 100% reactionary or nearly so (PBL). Really the only point I'm trying to make here is saying "well X spam is caught here..." is rather pointless if all its hitting on your system is blacklists, message hash systems (DCC/Razor/Pyzor/etc), and BAYES and wouldn't have been marked if you excluded them. If you have a spam like that it is something that should really be looked at for some method of detection that doesn't rely on those reactionary methods. Those methods are really only good at detecting "old" spam and only have a limited probability of matching "new" spam as the mutation rate increases, which is in case you hadn't notice. Fortunately for this particular spam there is an existing proactive plugin, FuzzyOCR, that can be massaged to deal with PDF stock image spam. |
|
|
Re: Spam PDFarni wrote:
> bgodette@... <mailto:bgodette@...> schrieb: >> >> Sounds more like "if we didn't rely on other people to have seen this >> particular abusive host before us and our learning system to have seen >> past examples of spam that looks a whole lot like this one from headers >> alone to detect this particular spam, we'd fail to catch it until we've >> trained our system and the abusive host has been reported to various lists". >> >> That's what makes policy (e.g. MTA checks, BOTNET) and behavior based >> detection work as well as it does, it's proactive instead of reactive. >> >> > I have no spam that doesnt score at least BAYES_80 - BAYES_80 is 3.5 > points here, BOTNET is 3 points here, makes 6.5 total and a bust. > > Doesnt have anything to do with beeing a late reciever as i recieve this > spam on a whole lot of addresses and not just one - please dont tell me > you think i'm a late reciever on all. > > arni No all BAYES is saying you've received and trained spam in the past that has bits and pieces that look like this new spam. If a spammer reduces the amount of tokens that can match negatively and does nothing else they'll end up with a meaningless bayes score (right around BAYES_50). Add a bit of "likely to be trained as ham" bits from a common mailing list from the day before, and use that in combination with an image/attachment/short spam and you've got a nice low bayes score. Works great against large site-wide bayes databases, not so much against per-user unless the user happens to be subscribed to whatever ham source the spammer is using. <joke>Maybe we should train all our mailing lists as spam!</joke> |
|
|
Re: Spam PDFbgodette@... wrote:
> John Rudd wrote: > You *will* not be getting a BAYES_90 or > BAYES_99 from that. My first one got BAYES_80, without having seen that zombie/relay before. That's enough for 2 points. > I think you're missing the point when I say "in the past" in relation to > scoring vs blacklists. It doesn't matter why an IP is listed, at some > point in the past that IP had to have been added before you can match > against it. It does matter, because it's not a "late receiver effect" unless someone, anyone, has received spam from that host before. And there's no relationship between "previous email from that host at all" and "being listed in the PBL". > In the case of PBL it contains more than just organization submitted > netblocks, it also includes ranges added by hand by Spamhaus for ranges > that appear to be end-user IP space that they have received spam from, > which IMO is probably majority of that list at this point. Show me that the "that they have recieved spam from" part is how they built their list, and not just "that appear to be end-user IP space". > And? That's still a late-receiver effect, this particular message scored > X points because of what Y host did Z minutes ago, where Z could be > days/weeks/months of minutes. Any time I have seen the phrase "late receiver effect" used before, it is about the message itself, and not the relay. Thus, Razor, which is entirely about the message and not even remotely about the relay, is entirely "late receiver effect" based. > Yes I failed to exclude BOTNET from that, it's the only score from the > original message that started this that is solid. The reason is because > BOTNET is proactive, all the others are either 100% reactionary or > nearly so (PBL). My first one was caught by Botnet, Bayes_80 (again, no previous pdf spam, and no previous activity from that relay), and UNIQUE_WORDS. Even if Botnet alone hadn't been enough, and only had a score of 3 ... _either_ of the other two would have been enough to push it up to 5. |
|
|
Re: Spam PDF
bgodette@... schrieb:
i will use one of the best quotes here that were ever created on the internet:arni wrote:bgodette@... bgodette@... schrieb: "You make your mouth full of technical bullshit when only facts talk" By some random guy ;-) arni |
|
|
Re: Spam PDFJohn Rudd wrote:
> bgodette@... wrote: >> John Rudd wrote: > >> You *will* not be getting a BAYES_90 or >> BAYES_99 from that. > > My first one got BAYES_80, without having seen that zombie/relay before. > That's enough for 2 points. Which only tells me it had more than just the PDF attachment, which is not what we're seeing here. You're also avoiding the point I was making by saying "hey this spam I got which has all this additional content for bayes to work with happened to score high". Well of course it did. > It does matter, because it's not a "late receiver effect" unless > someone, anyone, has received spam from that host before. And there's > no relationship between "previous email from that host at all" and > "being listed in the PBL". > > Show me that the "that they have recieved spam from" part is how they > built their list, and not just "that appear to be end-user IP space". > "Additional IP address ranges are added and maintained by the Spamhaus PBL Team, particularly for networks which are not participating themselves (either because the ISP/block owner does not know about, is proving difficult to contact, or because of language difficulties), and where spam received from those ranges, rDNS and server patterns are consistent with end-user IP space which typically contain high concentrations of "botnet zombies", a major source of spam." > >> And? That's still a late-receiver effect, this particular message scored >> X points because of what Y host did Z minutes ago, where Z could be >> days/weeks/months of minutes. > > Any time I have seen the phrase "late receiver effect" used before, it > is about the message itself, and not the relay. Thus, Razor, which is > entirely about the message and not even remotely about the relay, is > entirely "late receiver effect" based. It applies to bayes as well. They both operate on matching hashes and coming to some sort of confidence factor for the new message based on historical data. The bayes system just happens to work at a finer grained level than Razor/Pyzor/DCC and isn't a distributed system. The only "late receiver effect" you can experience with bayes in the absence of content in the body of the message for it to work with, is from auto-training from a local spamtrap+greylisting. If there's additional content besides headers+attachment bayes might work if you're using per-user. If you've got site-wide and some of your users subscribe to certain common mailing lists, well you could just as easily end up with BAYES_00. >> Yes I failed to exclude BOTNET from that, it's the only score from the >> original message that started this that is solid. The reason is because >> BOTNET is proactive, all the others are either 100% reactionary or >> nearly so (PBL). > > My first one was caught by Botnet, Bayes_80 (again, no previous pdf > spam, and no previous activity from that relay), and UNIQUE_WORDS. Even > if Botnet alone hadn't been enough, and only had a score of 3 ... > _either_ of the other two would have been enough to push it up to 5. So it hit UNIQUE_WORDS, which means it had more than just the attachment, so yeah BAYES had something more to work with than just the headers, consider yourself fortunate. |
|
|
Re: Spam PDFarni wrote:
> i will use one of the best quotes here that were ever created on the > internet: > > "You make your mouth full of technical bullshit when only facts talk" > > By some random guy > > ;-) arni So you're saying you want my stock spam with mailing list filler? It's real and has been for a year or more and makes site-wide bayes useless against it. |
|
|
Re: Spam PDFbgodette@... wrote:
> John Rudd wrote: >> bgodette@... wrote: >>> John Rudd wrote: >>> You *will* not be getting a BAYES_90 or >>> BAYES_99 from that. >> My first one got BAYES_80, without having seen that zombie/relay before. >> That's enough for 2 points. > > Which only tells me it had more than just the PDF attachment, which is > not what we're seeing here. You're also avoiding the point I was making > by saying "hey this spam I got which has all this additional content for > bayes to work with happened to score high". Well of course it did. There was nothing else in it. It was exactly like the other pdf spam's that have been talked about, and exactly like the ones I've received since. It has _no_ body data aside from the attachment. >> It does matter, because it's not a "late receiver effect" unless >> someone, anyone, has received spam from that host before. And there's >> no relationship between "previous email from that host at all" and >> "being listed in the PBL". >> >> Show me that the "that they have recieved spam from" part is how they >> built their list, and not just "that appear to be end-user IP space". >> > > "Additional IP address ranges are added and maintained by the Spamhaus > PBL Team, particularly for networks which are not participating > themselves (either because the ISP/block owner does not know about, is > proving difficult to contact, or because of language difficulties), and > where spam received from those ranges, rDNS and server patterns are > consistent with end-user IP space which typically contain high > concentrations of "botnet zombies", a major source of spam." I'll concede that I didn't know that about the source of listing in the PBL. >>> Yes I failed to exclude BOTNET from that, it's the only score from the >>> original message that started this that is solid. The reason is because >>> BOTNET is proactive, all the others are either 100% reactionary or >>> nearly so (PBL). >> My first one was caught by Botnet, Bayes_80 (again, no previous pdf >> spam, and no previous activity from that relay), and UNIQUE_WORDS. Even >> if Botnet alone hadn't been enough, and only had a score of 3 ... >> _either_ of the other two would have been enough to push it up to 5. > > So it hit UNIQUE_WORDS, which means it had more than just the > attachment, so yeah BAYES had something more to work with than just the > headers, consider yourself fortunate. It had nothing in the body. Without seeing that relay before, both BAYES_80 and UNIQUE_WORDS caught it. Excluding the attachment encoding itself, here's what it had: Received: from [83.76.165.174] (HELO lmnht) by mail.rudd.cc (CommuniGate Pro SMTP 5.1.4 _community_) with SMTP id 1081873 for john@...; Wed, 27 Jun 2007 05:11:47 -0700 Received-SPF: none receiver=mail.rudd.cc; client-ip=83.76.165.174; envelope-from=wzou@... Received: from [33.31.118.54] (helo=iyaty) by lmnht with smtp (Exim 4.66 (FreeBSD)) id 1I4j0S-0003Q8-5s; Wed, 27 Jun 2007 14:12:06 +0200 Message-ID: <468253E7.2060801@...> Date: Wed, 27 Jun 2007 14:11:19 +0200 From: Annabel Cleveland <wzou@...> User-Agent: Thunderbird 1.5.0.12 (Windows/20070509) MIME-Version: 1.0 To: john@... Subject: Re: Cheque.22.pdf Content-Type: multipart/mixed; boundary="------------040808030703010202050005" --------------040808030703010202050005 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit --------------040808030703010202050005 Content-Type: application/pdf; name="Cheque.22.pdf" Content-Transfer-Encoding: base64 Content-Disposition: inline; filename="Cheque.22.pdf" [attachment data omitted] --------------040808030703010202050005-- |
|
|
Re: Spam PDF> It had nothing in the body. Without seeing that relay before, both
> BAYES_80 and UNIQUE_WORDS caught it. > > Excluding the attachment encoding itself, here's what it had: > > Received: from [83.76.165.174] (HELO lmnht) > by mail.rudd.cc (CommuniGate Pro SMTP 5.1.4 _community_) > with SMTP id 1081873 for john@...; Wed, 27 Jun 2007 05:11:47 -0700 > Received-SPF: none > receiver=mail.rudd.cc; client-ip=83.76.165.174; > envelope-from=wzou@... > Received: from [33.31.118.54] (helo=iyaty) > by lmnht with smtp (Exim 4.66 (FreeBSD)) > id 1I4j0S-0003Q8-5s; Wed, 27 Jun 2007 14:12:06 +0200 > Message-ID: <468253E7.2060801@...> > Date: Wed, 27 Jun 2007 14:11:19 +0200 > From: Annabel Cleveland <wzou@...> > User-Agent: Thunderbird 1.5.0.12 (Windows/20070509) > MIME-Version: 1.0 > To: john@... > Subject: Re: Cheque.22.pdf > Content-Type: multipart/mixed; > boundary="------------040808030703010202050005" > > --------------040808030703010202050005 > Content-Type: text/plain; charset=windows-1252; format=flowed > Content-Transfer-Encoding: 7bit > > > > --------------040808030703010202050005 > Content-Type: application/pdf; > name="Cheque.22.pdf" > Content-Transfer-Encoding: base64 > Content-Disposition: inline; > filename="Cheque.22.pdf" > > [attachment data omitted] > --------------040808030703010202050005-- > I'm going to guess that UNIQUE_WORDS hit on the MIME definitions (the ratio is pretty bad actually as there's only 4? non-unique). As for BAYES, the score it would assign would depend entirely on how much ham with attachments (for all the common MIME stuff) and ham with PDF attachments (for the content-type) is trained vs spam of the same. In our case it could be a lot, yours I don't know about obviously. If I were to set up per-user BAYES for my accounts I'm sure BAYES would work perfectly at catching this stuff as I don't receive any ham with PDF attachments except once or twice a year, and it'd be working mostly off the content-type. |
|
|
Re: Spam PDFyes, actually i do, just for the fun of it.arni wrote: would be nice if you could send 3 to 5 as an attachment including all headers to the list or only my address. ofc you'll want to chose spam with a low score to prove me wrong ;-) arni |
|
|
Re: Spam PDFarni wrote:
> [snip snap] > I looked for the lowest scoring email of the past 2 days (dont save > them longer), this is the one: > > X-Spam-Status: Yes, score=10.7 required=5.0 tests=BAYES_99,DCC_CHECK, > DKIM_POLICY_SIGNSOME,HTML_MESSAGE,LOGINHASH1,LOGINHASH2,MIME_HTML_MOSTLY > autolearn=no version=3.2.0 > X-Spam-Report: > * 5.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100% > * [score: 1.0000] > * 0.0 DKIM_POLICY_SIGNSOME Domain Keys Identified Mail: policy says domain > * signs some mails > * 0.0 MIME_HTML_MOSTLY BODY: Multipart message mostly text/html MIME > * 0.0 HTML_MESSAGE BODY: HTML included in message > * 1.5 LOGINHASH2 BODY: mail has been classified as spam @ unknown company, > * Germany > * 1.5 LOGINHASH1 BODY: mail has been classified as spam @ LogIn&Solutions > * AG, Germany > * 2.2 DCC_CHECK Listed in DCC (http://rhyolite.com/anti-spam/dcc/) > > > > Note that already a well trained BAYES can take these mails out on its > own on my system. > > If you find your bayes to score really acurate then its a good idea to > increase the scores. For me bayes is fed from 2 spamtrap addresses > with around 50 pieces of the finest spam every day. Doing this, bayes > scores BAYES_99 on 99.5% of my remaining spam - i hardly ever see it > score below BAYES_80 and thats just great. Kind a new to spam ... and especially how people use bayes. So how many ham mails do you get per day ? wandering if I could do something to my system so bayes may score higher .... I have read some where that spam mails in bayes should be alot higher than ham mails ... is that true ? Cause I'm doing spam scans for multiple domains .. > > So maybe training bayes better or increasing the score will put and > end to this for you. > > arni > Any aditional reading on bayes are welcome ... // Mikael Syska |
|
|
Re: Spam PDFMikael Syska schrieb:
> Kind a new to spam ... and especially how people use bayes. > > So how many ham mails do you get per day ? wandering if I could do > something to my system so bayes may score higher .... > > I have read some where that spam mails in bayes should be alot higher > than ham mails ... is that true ? > > Cause I'm doing spam scans for multiple domains .. > some stats on my bayes db: 0.000 0 4556 0 non-token data: nspam 0.000 0 1356 0 non-token data: nham 0.000 0 280877 0 non-token data: ntokens i get about 20 ham and 150 spams per day (on my personal box) - bayes is only learned by spamtraps and autolearn. arni |
| < Prev | 1 - 2 - 3 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |