hostkarma/uribl_black disparity

View: New views
2 Messages — Rating Filter:   Alert me  

hostkarma/uribl_black disparity

by Alex-325 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

Over the past few days I have been investigating more closely email
that wasn't tagged that I thought should have been, and vice-versa,
using various factors, such as URIBL_BLACK and JMF_W. I'm very
surprised that obvious hosts are on the URIBL_BLACK list, like
receiveeweek.com.

Even more interesting is a bunch of FNs that contain both URIBL_BLACK
and JMF_W. I'm not sure which is correct in many cases, because they
are not always so cut-and-dried. For example, there was a Citi Bank
email (whitelisted) that happened to use an image server
(csnimages.com) that is in URIBL_BLACK.

While I don't think that particular email should have been tagged as
spam, it's only an example, and I hoped someone would be interested
enough to check out a list I created with these types of disparities
I've had over the last day or so.

It's too long to include here, so I've created a pastebin for it:

http://pastebin.com/m4a1561b5

I realize this type of thing could happen for many reasons, not the
least of which is an otherwise-legitimate host that has been
compromised and now used to send spam. However, many on my list are
quite persistent, like blr-events.com and eturbonews.com, which I have
no idea whether it is legitimate or bogus.

Whatever the case, there are definitely mistakes, and I'd like to help
correct them.

Ideas appreciated. I'd be glad to gather more info if necessary.

Thanks
Alex

Re: hostkarma/uribl_black disparity

by Adam Katz-10 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

MySQL Student wrote:
> Over the past few days I have been investigating more closely email
> that wasn't tagged that I thought should have been, and
> vice-versa, using various factors, such as URIBL_BLACK and JMF_W.

Very interesting.

Here's a quick testing script (ymmv on log file syntax):

#########
#!/bin/sh

# helper function, see below
_sacount() {
  zgrep -h "spamd: result: ${3:+Y}" /var/log/mail.lo* \
         |egrep -c "$1${2:+.*$2|$2.*$1}"
}

# Usage: sa_count RULE1 [RULE2]
# Counts messages marked as RULE1 (and RULE2 if given)
sa_count() {
  c=`_sacount $1 $2`
  sc=`_sacount $1 $2 spam`
  echo "Found $c ($sc spam) matching ${2:+both} $1${2:+ and $2}."
}

sa_count RCVD_IN_HOSTKARMA_W URIBL_BLACK
sa_count RCVD_IN_DNSWL URIBL_BLACK
sa_count URIBL_BLACK
sa_count .   # show total numbers

#########

My output (note, I greylist):

Found 54 (11 spam) matching both RCVD_IN_HOSTKARMA_W and URIBL_BLACK.
Found 25 (16 spam) matching both RCVD_IN_DNSWL and URIBL_BLACK.
Found 1981 (1919 spam) matching  URIBL_BLACK.
Found 123273 (3791 spam) matching  ..


I don't have data on whether there were FPs or FNs involved.

(And yes, zgrep is perfectly content to deal with uncompressed files.)