Spam PDF

View: New views
11 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 | Next >

Re: Spam PDF

by bgodette :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

John Rudd wrote:

> bgodette@... wrote:
>>> Actually, it didn't.  The assertion is that if someone else hadn't seen
>>> this exact message first, then SA wouldn't have caught it.
>> No, the assertion is that if someone else hadn't seen prior abuse from
>> the sending host first (not this exact message), then SA wouldn't have
>> caught that particular message. That assertion happens to be true for
>> the blacklists, and true for BAYES as well since it would have had to
>> have seen headers (since the payload is vastly different) that look like
>> this sending host in the recent past and been told that it was SPAM.
>
> Your assertion about bayes is not well supported.  It might have been
> flagged by bayes for reasons that have _NOTHING_ to do with the received
> headers.

Follow along on this thought experiment. We have a spam run in progress
from fresh zombies that have never sent to your system before. The
payload of the spam run is new and does not match any historical spam
because it's just a base64 encoded attachment (nothing for bayes to
tokenize in the body of the message). The only bayes fodder for such a
message is received headers, from/to/subject, and mime boundaries if the
 bot does not add any additional headers beyond the minimum (iow no
user-agent, x-headers, etc). You *will* not be getting a BAYES_90 or
BAYES_99 from that.

>>> The PBL (which isn't spamtrap fed, it's collected from ISP published
>>> and/or contributed data) would have caught this based upon issues that
>>> have nothing at all to do with this message, and most likely nothing at
>>> all to do with this current round of spam.  It would be based upon the
>>> host provider's policy that this host shouldn't send email to the internet.
>> Which means, some time, in the past, for whatever reasons that
>> particular IP address did something against someone's policy to end up
>> on that list. The important part being "in the past".
>
> No, it means that the ISP, or possibly net block user, told Spamhaus
> "it's an end user IP address, and not a mail server".  There might be
> _NO_ previous abuse from that IP address, and they'll still be listed.
> The "policy" here is NOT the recipient's policy, the sendering network
> owner's policy.

I think you're missing the point when I say "in the past" in relation to
scoring vs blacklists. It doesn't matter why an IP is listed, at some
point in the past that IP had to have been added before you can match
against it.

In the case of PBL it contains more than just organization submitted
netblocks, it also includes ranges added by hand by Spamhaus for ranges
that appear to be end-user IP space that they have received spam from,
which IMO is probably majority of that list at this point. Which is why
I assert that hitting PBL is based on prior abuse (spam sent to Spamhaus
PBL maintainers) more often than not. Fortunately IP addresses already
in PBL tend to actually stay there since there's no automatic
expiration, but it is still a reactionary list.

> It could have been recent abuse from an entirely different message
> batch.  In other words, maybe that IP sent a standard stock scam
> yesterday, and today it sent the pdf spam ... and this person was the
> first one to receive that pdf spam message.  No previous recipient of
> the same message.  But they'll still be listed at spamcop.

And? That's still a late-receiver effect, this particular message scored
X points because of what Y host did Z minutes ago, where Z could be
days/weeks/months of minutes.

>> Explain how BAYES will have any matching tokens to work on if its from a
>> fresh, never before seen by your system, zombie and there's no message
>> body other than the attachment? All you have to work with is headers
>> which you've never seen before and MIME boundaries which you've never
>> seen before.
>
> There are more headers than just the received headers.  And, I honestly
> don't know whether or not an attachment's raw data is analyzed by bayes
> or not.  My assumption is that it is.

Yes there's from/to/subject and the MIME boundary. To: is ambiguous
since that's going to match both ham and spam. From:, Subject:, and the
MIME boundary can be ambiguous or worse they'll match common ham in the
case of site-wide BAYES. No other headers need be added. So you're left
with receive headers to make a spam/ham choice from and your BAYES
database hasn't ever seen this host before, good luck with that.

And since I mentioned site-wide bayes, as I mentioned more than a year
ago the practice of using ham (e.g. common mailing lists) from the prior
week as filler is continuing to become more and more common. The only
solution to that for bayes is per-user bayes which only works if your
users have a method of training and actually use it correctly (unlike
AOL users that use the junk button to "unsubscribe"), that pretty much
excludes that option from being used by large installations.

>>> Just resting upon BAYES, BOTNET, and PBL, you're not "lucky to have
>>> caught the message because you're a late receiver".  You've caught the
>>> message due to a combination of policy, misuse, and historical
>>> characteristics of spam in general being used to train your system.
>> All of which needs prior examples/reporting of messages similar to the
>> one you're trying to detect, that's what "historical characteristics of
>> spam" means.
>
> BOTNET does _NOT_ need prior reporting.  And the prior reporting the PBL
> require has nothing to do with abuse.  Further, BAYES does not depend
> upon the received headers.  But even if you're right about bayes, your
> claim that "all of which needs prior..." is at least 2/3 wrong, if not
> 3/3 wrong.

Yes I failed to exclude BOTNET from that, it's the only score from the
original message that started this that is solid. The reason is because
BOTNET is proactive, all the others are either 100% reactionary or
nearly so (PBL).

Really the only point I'm trying to make here is saying "well X spam is
caught here..." is rather pointless if all its hitting on your system is
blacklists, message hash systems (DCC/Razor/Pyzor/etc), and BAYES and
wouldn't have been marked if you excluded them. If you have a spam like
that it is something that should really be looked at for some method of
detection that doesn't rely on those reactionary methods. Those methods
are really only good at detecting "old" spam and only have a limited
probability of matching "new" spam as the mutation rate increases, which
is in case you hadn't notice.

Fortunately for this particular spam there is an existing proactive
plugin, FuzzyOCR, that can be massaged to deal with PDF stock image spam.

Re: Spam PDF

by bgodette :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

arni wrote:

>  bgodette@... <mailto:bgodette@...> schrieb:
>>
>> Sounds more like "if we didn't rely on other people to have seen this
>> particular abusive host before us and our learning system to have seen
>> past examples of spam that looks a whole lot like this one from headers
>> alone to detect this particular spam, we'd fail to catch it until we've
>> trained our system and the abusive host has been reported to various lists".
>>
>> That's what makes policy (e.g. MTA checks, BOTNET) and behavior based
>> detection work as well as it does, it's proactive instead of reactive.
>>
>>  
> I have no spam that doesnt score at least BAYES_80 - BAYES_80 is 3.5
> points here, BOTNET is 3 points here, makes 6.5 total and a bust.
>
> Doesnt have anything to do with beeing a late reciever as i recieve this
> spam on a whole lot of addresses and not just one - please dont tell me
> you think i'm a late reciever on all.
>
> arni

No all BAYES is saying you've received and trained spam in the past that
has bits and pieces that look like this new spam. If a spammer reduces
the amount of tokens that can match negatively and does nothing else
they'll end up with a meaningless bayes score (right around BAYES_50).
Add a bit of "likely to be trained as ham" bits from a common mailing
list from the day before, and use that in combination with an
image/attachment/short spam and you've got a nice low bayes score. Works
great against large site-wide bayes databases, not so much against
per-user unless the user happens to be subscribed to whatever ham source
the spammer is using. <joke>Maybe we should train all our mailing lists
as spam!</joke>

Re: Spam PDF

by John Rudd :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

bgodette@... wrote:
> John Rudd wrote:

> You *will* not be getting a BAYES_90 or
> BAYES_99 from that.

My first one got BAYES_80, without having seen that zombie/relay before.
  That's enough for 2 points.


> I think you're missing the point when I say "in the past" in relation to
> scoring vs blacklists. It doesn't matter why an IP is listed, at some
> point in the past that IP had to have been added before you can match
> against it.

It does matter, because it's not a "late receiver effect" unless
someone, anyone, has received spam from that host before.  And there's
no relationship between "previous email from that host at all" and
"being listed in the PBL".


> In the case of PBL it contains more than just organization submitted
> netblocks, it also includes ranges added by hand by Spamhaus for ranges
> that appear to be end-user IP space that they have received spam from,
> which IMO is probably majority of that list at this point.

Show me that the "that they have recieved spam from" part is how they
built their list, and not just "that appear to be end-user IP space".


> And? That's still a late-receiver effect, this particular message scored
> X points because of what Y host did Z minutes ago, where Z could be
> days/weeks/months of minutes.

Any time I have seen the phrase "late receiver effect" used before, it
is about the message itself, and not the relay.  Thus, Razor, which is
entirely about the message and not even remotely about the relay, is
entirely "late receiver effect" based.


> Yes I failed to exclude BOTNET from that, it's the only score from the
> original message that started this that is solid. The reason is because
> BOTNET is proactive, all the others are either 100% reactionary or
> nearly so (PBL).

My first one was caught by Botnet, Bayes_80 (again, no previous pdf
spam, and no previous activity from that relay), and UNIQUE_WORDS.  Even
if Botnet alone hadn't been enough, and only had a score of 3 ...
_either_ of the other two would have been enough to push it up to 5.


Re: Spam PDF

by arni-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

bgodette@... schrieb:
arni wrote:
  
 bgodette@... bgodette@... schrieb:
    
Sounds more like "if we didn't rely on other people to have seen this
particular abusive host before us and our learning system to have seen
past examples of spam that looks a whole lot like this one from headers
alone to detect this particular spam, we'd fail to catch it until we've
trained our system and the abusive host has been reported to various lists".

That's what makes policy (e.g. MTA checks, BOTNET) and behavior based
detection work as well as it does, it's proactive instead of reactive.

  
      
I have no spam that doesnt score at least BAYES_80 - BAYES_80 is 3.5
points here, BOTNET is 3 points here, makes 6.5 total and a bust.

Doesnt have anything to do with beeing a late reciever as i recieve this
spam on a whole lot of addresses and not just one - please dont tell me
you think i'm a late reciever on all.

arni
    

No all BAYES is saying you've received and trained spam in the past that
has bits and pieces that look like this new spam. If a spammer reduces
the amount of tokens that can match negatively and does nothing else
they'll end up with a meaningless bayes score (right around BAYES_50).
Add a bit of "likely to be trained as ham" bits from a common mailing
list from the day before, and use that in combination with an
image/attachment/short spam and you've got a nice low bayes score. Works
great against large site-wide bayes databases, not so much against
per-user unless the user happens to be subscribed to whatever ham source
the spammer is using. <joke>Maybe we should train all our mailing lists
as spam!</joke>

  
i will use one of the best quotes here that were ever created on the internet:

"You make your mouth full of technical bullshit when only facts talk"
                                                                        By some random guy

;-) arni

Re: Spam PDF

by bgodette :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

John Rudd wrote:
> bgodette@... wrote:
>> John Rudd wrote:
>
>> You *will* not be getting a BAYES_90 or
>> BAYES_99 from that.
>
> My first one got BAYES_80, without having seen that zombie/relay before.
>   That's enough for 2 points.

Which only tells me it had more than just the PDF attachment, which is
not what we're seeing here. You're also avoiding the point I was making
by saying "hey this spam I got which has all this additional content for
bayes to work with happened to score high". Well of course it did.

> It does matter, because it's not a "late receiver effect" unless
> someone, anyone, has received spam from that host before.  And there's
> no relationship between "previous email from that host at all" and
> "being listed in the PBL".
>
> Show me that the "that they have recieved spam from" part is how they
> built their list, and not just "that appear to be end-user IP space".
>

"Additional IP address ranges are added and maintained by the Spamhaus
PBL Team, particularly for networks which are not participating
themselves (either because the ISP/block owner does not know about, is
proving difficult to contact, or because of language difficulties), and
where spam received from those ranges, rDNS and server patterns are
consistent with end-user IP space which typically contain high
concentrations of "botnet zombies", a major source of spam."

>
>> And? That's still a late-receiver effect, this particular message scored
>> X points because of what Y host did Z minutes ago, where Z could be
>> days/weeks/months of minutes.
>
> Any time I have seen the phrase "late receiver effect" used before, it
> is about the message itself, and not the relay.  Thus, Razor, which is
> entirely about the message and not even remotely about the relay, is
> entirely "late receiver effect" based.

It applies to bayes as well. They both operate on matching hashes and
coming to some sort of confidence factor for the new message based on
historical data. The bayes system just happens to work at a finer
grained level than Razor/Pyzor/DCC and isn't a distributed system. The
only "late receiver effect" you can experience with bayes in the absence
of content in the body of the message for it to work with, is from
auto-training from a local spamtrap+greylisting. If there's additional
content besides headers+attachment bayes might work if you're using
per-user. If you've got site-wide and some of your users subscribe to
certain common mailing lists, well you could just as easily end up with
BAYES_00.

>> Yes I failed to exclude BOTNET from that, it's the only score from the
>> original message that started this that is solid. The reason is because
>> BOTNET is proactive, all the others are either 100% reactionary or
>> nearly so (PBL).
>
> My first one was caught by Botnet, Bayes_80 (again, no previous pdf
> spam, and no previous activity from that relay), and UNIQUE_WORDS.  Even
> if Botnet alone hadn't been enough, and only had a score of 3 ...
> _either_ of the other two would have been enough to push it up to 5.

So it hit UNIQUE_WORDS, which means it had more than just the
attachment, so yeah BAYES had something more to work with than just the
headers, consider yourself fortunate.

Re: Spam PDF

by bgodette :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

arni wrote:
> i will use one of the best quotes here that were ever created on the
> internet:
>
> "You make your mouth full of technical bullshit when only facts talk"
>                                                                        
> By some random guy
>
> ;-) arni

So you're saying you want my stock spam with mailing list filler? It's
real and has been for a year or more and makes site-wide bayes useless
against it.

Re: Spam PDF

by John Rudd :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

bgodette@... wrote:

> John Rudd wrote:
>> bgodette@... wrote:
>>> John Rudd wrote:
>>> You *will* not be getting a BAYES_90 or
>>> BAYES_99 from that.
>> My first one got BAYES_80, without having seen that zombie/relay before.
>>   That's enough for 2 points.
>
> Which only tells me it had more than just the PDF attachment, which is
> not what we're seeing here. You're also avoiding the point I was making
> by saying "hey this spam I got which has all this additional content for
> bayes to work with happened to score high". Well of course it did.

There was nothing else in it.  It was exactly like the other pdf spam's
that have been talked about, and exactly like the ones I've received
since.  It has _no_ body data aside from the attachment.


>> It does matter, because it's not a "late receiver effect" unless
>> someone, anyone, has received spam from that host before.  And there's
>> no relationship between "previous email from that host at all" and
>> "being listed in the PBL".
>>
>> Show me that the "that they have recieved spam from" part is how they
>> built their list, and not just "that appear to be end-user IP space".
>>
>
> "Additional IP address ranges are added and maintained by the Spamhaus
> PBL Team, particularly for networks which are not participating
> themselves (either because the ISP/block owner does not know about, is
> proving difficult to contact, or because of language difficulties), and
> where spam received from those ranges, rDNS and server patterns are
> consistent with end-user IP space which typically contain high
> concentrations of "botnet zombies", a major source of spam."

I'll concede that I didn't know that about the source of listing in the PBL.


>>> Yes I failed to exclude BOTNET from that, it's the only score from the
>>> original message that started this that is solid. The reason is because
>>> BOTNET is proactive, all the others are either 100% reactionary or
>>> nearly so (PBL).
>> My first one was caught by Botnet, Bayes_80 (again, no previous pdf
>> spam, and no previous activity from that relay), and UNIQUE_WORDS.  Even
>> if Botnet alone hadn't been enough, and only had a score of 3 ...
>> _either_ of the other two would have been enough to push it up to 5.
>
> So it hit UNIQUE_WORDS, which means it had more than just the
> attachment, so yeah BAYES had something more to work with than just the
> headers, consider yourself fortunate.


It had nothing in the body.  Without seeing that relay before, both
BAYES_80 and UNIQUE_WORDS caught it.

Excluding the attachment encoding itself, here's what it had:

  Received: from [83.76.165.174] (HELO lmnht)
    by mail.rudd.cc (CommuniGate Pro SMTP 5.1.4 _community_)
    with SMTP id 1081873 for john@...; Wed, 27 Jun 2007 05:11:47 -0700
  Received-SPF: none
   receiver=mail.rudd.cc; client-ip=83.76.165.174;
envelope-from=wzou@...
  Received: from [33.31.118.54] (helo=iyaty)
  by lmnht with smtp (Exim 4.66 (FreeBSD))
  id 1I4j0S-0003Q8-5s; Wed, 27 Jun 2007 14:12:06 +0200
  Message-ID: <468253E7.2060801@...>
  Date: Wed, 27 Jun 2007 14:11:19 +0200
  From: Annabel Cleveland <wzou@...>
  User-Agent: Thunderbird 1.5.0.12 (Windows/20070509)
  MIME-Version: 1.0
  To: john@...
  Subject: Re: Cheque.22.pdf
  Content-Type: multipart/mixed;
   boundary="------------040808030703010202050005"

  --------------040808030703010202050005
  Content-Type: text/plain; charset=windows-1252; format=flowed
  Content-Transfer-Encoding: 7bit



  --------------040808030703010202050005
  Content-Type: application/pdf;
   name="Cheque.22.pdf"
  Content-Transfer-Encoding: base64
  Content-Disposition: inline;
   filename="Cheque.22.pdf"

  [attachment data omitted]
  --------------040808030703010202050005--

Re: Spam PDF

by bgodette :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> It had nothing in the body.  Without seeing that relay before, both
> BAYES_80 and UNIQUE_WORDS caught it.
>
> Excluding the attachment encoding itself, here's what it had:
>
>   Received: from [83.76.165.174] (HELO lmnht)
>     by mail.rudd.cc (CommuniGate Pro SMTP 5.1.4 _community_)
>     with SMTP id 1081873 for john@...; Wed, 27 Jun 2007 05:11:47 -0700
>   Received-SPF: none
>    receiver=mail.rudd.cc; client-ip=83.76.165.174;
> envelope-from=wzou@...
>   Received: from [33.31.118.54] (helo=iyaty)
>   by lmnht with smtp (Exim 4.66 (FreeBSD))
>   id 1I4j0S-0003Q8-5s; Wed, 27 Jun 2007 14:12:06 +0200
>   Message-ID: <468253E7.2060801@...>
>   Date: Wed, 27 Jun 2007 14:11:19 +0200
>   From: Annabel Cleveland <wzou@...>
>   User-Agent: Thunderbird 1.5.0.12 (Windows/20070509)
>   MIME-Version: 1.0
>   To: john@...
>   Subject: Re: Cheque.22.pdf
>   Content-Type: multipart/mixed;
>    boundary="------------040808030703010202050005"
>
>   --------------040808030703010202050005
>   Content-Type: text/plain; charset=windows-1252; format=flowed
>   Content-Transfer-Encoding: 7bit
>
>
>
>   --------------040808030703010202050005
>   Content-Type: application/pdf;
>    name="Cheque.22.pdf"
>   Content-Transfer-Encoding: base64
>   Content-Disposition: inline;
>    filename="Cheque.22.pdf"
>
>   [attachment data omitted]
>   --------------040808030703010202050005--
>

I'm going to guess that UNIQUE_WORDS hit on the MIME definitions (the
ratio is pretty bad actually as there's only 4? non-unique). As for
BAYES, the score it would assign would depend entirely on how much ham
with attachments (for all the common MIME stuff) and ham with PDF
attachments (for the content-type) is trained vs spam of the same.

In our case it could be a lot, yours I don't know about obviously. If I
were to set up per-user BAYES for my accounts I'm sure BAYES would work
perfectly at catching this stuff as I don't receive any ham with PDF
attachments except once or twice a year, and it'd be working mostly off
the content-type.

Re: Spam PDF

by arni-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

bgodette@... schrieb:
arni wrote:
  
i will use one of the best quotes here that were ever created on the
internet:

"You make your mouth full of technical bullshit when only facts talk"
                                                                       
By some random guy

;-) arni
    

So you're saying you want my stock spam with mailing list filler? It's
real and has been for a year or more and makes site-wide bayes useless
against it.

  
yes, actually i do, just for the fun of it.
would be nice if you could send 3 to 5 as an attachment including all headers to the list or only my address.

ofc you'll want to chose spam with a low score to prove me wrong ;-)

arni

Re: Spam PDF

by Mikael Syska :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

arni wrote:

> [snip snap]
> I looked for the lowest scoring email of the past 2 days (dont save
> them longer), this is the one:
>
> X-Spam-Status: Yes, score=10.7 required=5.0 tests=BAYES_99,DCC_CHECK,
> DKIM_POLICY_SIGNSOME,HTML_MESSAGE,LOGINHASH1,LOGINHASH2,MIME_HTML_MOSTLY
> autolearn=no version=3.2.0
> X-Spam-Report:
> *  5.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
> *      [score: 1.0000]
> *  0.0 DKIM_POLICY_SIGNSOME Domain Keys Identified Mail: policy says domain
> *       signs some mails
> *  0.0 MIME_HTML_MOSTLY BODY: Multipart message mostly text/html MIME
> *  0.0 HTML_MESSAGE BODY: HTML included in message
> *  1.5 LOGINHASH2 BODY: mail has been classified as spam @ unknown company,
> *       Germany
> *  1.5 LOGINHASH1 BODY: mail has been classified as spam @ LogIn&Solutions
> *      AG, Germany
> *  2.2 DCC_CHECK Listed in DCC (http://rhyolite.com/anti-spam/dcc/)
>
>
>  
> Note that already a well trained BAYES can take these mails out on its
> own on my system.
 > Bayes are good if its well trained
>
> If you find your bayes to score really acurate then its a good idea to
> increase the scores. For me bayes is fed from 2 spamtrap addresses
> with around 50 pieces of the finest spam every day. Doing this, bayes
> scores BAYES_99 on 99.5% of my remaining spam - i hardly ever see it
> score below BAYES_80 and thats just great.

Kind a new to spam ... and especially how people use bayes.

So how many ham mails do you get per day ? wandering if I could do
something to my system so bayes may score higher ....

I have read some where that spam mails in bayes should be alot higher
than ham mails ... is that true ?

Cause I'm doing spam scans for multiple domains ..

>
> So maybe training bayes better or increasing the score will put and
> end to this for you.
>
> arni
>

Any aditional reading on bayes are welcome ...

// Mikael Syska

Re: Spam PDF

by arni-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Mikael Syska schrieb:

> Kind a new to spam ... and especially how people use bayes.
>
> So how many ham mails do you get per day ? wandering if I could do
> something to my system so bayes may score higher ....
>
> I have read some where that spam mails in bayes should be alot higher
> than ham mails ... is that true ?
>
> Cause I'm doing spam scans for multiple domains ..
>
my mail volume isnt high, i do it only for myself and some friends,

some stats on my bayes db:

0.000          0       4556          0  non-token data: nspam
0.000          0       1356          0  non-token data: nham
0.000          0     280877          0  non-token data: ntokens

i get about 20 ham and 150 spams per day (on my personal box) - bayes is
only learned by spamtraps and autolearn.

arni

< Prev | 1 - 2 - 3 | Next >