why not doing a test that checks "name"-<email address> pairs

View: New views
13 Messages — Rating Filter:   Alert me  

why not doing a test that checks "name"-<email address> pairs

by aag_uk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I´m pretty new to SpamAssassin and maybe what I am saying is nonsense or somebody else has suggested this, or the test already exists but I don´t know how to configure it, anyway here is my question.

I´ve noticed that some spam messages not marked as spam by spamassassin (the score is lower than the limit I´ve set: 5.0. Those emails usually have some hints that suggest they are probably spam: score about 4.6). These message are addressed to many people in my domain but the names before the email address are random. To explain it more clearly, for example, the recipient in the TO field is something like this:  "John" <user1@mydomain.com>. Very ofter the CC field includes other recipients like: "Peter" <user2@mydomain.com>; "Mike" <user3@mydomain.com>; etc... The think is that the email recepients (user1, user2, user3,...) are real, they exist in my domain, but the names "Peter, John, Mike" have nothing to do with "user1, user2, user3", they are picked randomly. Wouldn´t be interesting to have a test that checks the "user name-email address" pairs according to some settings?

Regards,

Alberto.

Re: why not doing a test that checks "name"-<email address> pairs

by John Hardin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 17 Aug 2007, aag_uk wrote:

> These message are addressed to many people in my domain but the
> names before the email address are random. To explain it more
> clearly, for example, the recipient in the TO field is something
> like this:  "John" <user1@...>. Very ofter the CC field
> includes other recipients like: "Peter" <user2@...>;
> "Mike" <user3@...>; etc... The think is that the email
> recepients (user1, user2, user3,...) are real, they exist in my
> domain, but the names "Peter, John, Mike" have nothing to do with
> "user1, user2, user3", they are picked randomly.

(1) Check your MTA options. Some allow you to configure rejection of a
message after X number of invalid recipients are given.

(2) Consider a rule that adds a point if more than X names appear in
the TO: and/or CC: headers. Here are mine (20 is the limit):

describe TO_TOO_MANY To: too many recipients
header   TO_TOO_MANY To =~ /(?:,[^,]{1,80}){20}/
score    TO_TOO_MANY 1.50

describe CC_TOO_MANY Cc: too many recipients
header   CC_TOO_MANY Cc =~ /(?:,[^,]{1,80}){20}/

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@...    FALaholic #11174     pgpk -a jhardin@...
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  A sword is never a killer, it is but a tool in the killer's hands.
                          -- Lucius Annaeus Seneca (Martial) 4BC-65AD
-----------------------------------------------------------------------
 8 days until The 1928th anniversary of the destruction of Pompeii


Re: why not doing a test that checks "name"-<email address> pairs

by Chris St. Pierre :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 17 Aug 2007, aag_uk wrote:

> I´ve noticed that some spam messages not marked as spam by spamassassin (the
> score is lower than the limit I´ve set: 5.0. Those emails usually have some
> hints that suggest they are probably spam: score about 4.6). These message
> are addressed to many people in my domain but the names before the email
> address are random. To explain it more clearly, for example, the recipient
> in the TO field is something like this:  "John" <user1@...>. Very
> ofter the CC field includes other recipients like: "Peter"
> <user2@...>; "Mike" <user3@...>; etc... The think is that
> the email recepients (user1, user2, user3,...) are real, they exist in my
> domain, but the names "Peter, John, Mike" have nothing to do with "user1,
> user2, user3", they are picked randomly. Wouldn´t be interesting to have a
> test that checks the "user name-email address" pairs according to some
> settings?
That's an interesting idea, but it

a) is probably going to be quite resource-intensive;

b) requires LDAP, NIS, etc., so that SpamAssassin can have a clue
about your accounts;

c) requires competent fuzzy matching so that, when a user sends mail
to "Chris St. Pierre <stpierre@...>", it doesn't flag it
as spam because my "real name" is Christopher;

d) is prone to FPs, since its the clients who add that name, and it
could be literally _anything_ ("chris", "some guy", "", etc.) without
being spam; and

e) is fairly site-specific and would require a fair amount of
configuration.

It might be an interesting plugin, but I think that the kind of
scoring I'd be comfortable doing for a plugin like that -- very low --
wouldn't be worth the tradeoff in CPU time, network traffic, etc.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University

Re: why not doing a test that checks "name"-<email address> pairs

by sm-7 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

At 13:58 17-08-2007, Chris St. Pierre wrote:
>That's an interesting idea, but it
>
>a) is probably going to be quite resource-intensive;

Not really.

>c) requires competent fuzzy matching so that, when a user sends mail
>to "Chris St. Pierre <stpierre@...>", it doesn't flag it
>as spam because my "real name" is Christopher;

That's the main problem.  There are also misspellings which are
difficult to catch.

>d) is prone to FPs, since its the clients who add that name, and it
>could be literally _anything_ ("chris", "some guy", "", etc.) without
>being spam; and

It could be used for negative scoring when the client hits reply to
answer your message.  That would also let some spam through though as
some use the real name.

Regards,
-sm


Re: why not doing a test that checks "name"-<email address> pairs

by hamann.w :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>>
>> Hi,=20
>>
>> I=C2=B4m pretty new to SpamAssassin and maybe what I am saying is nonsense =
>> or
>> somebody else has suggested this, or the test already exists but I don=C2=
>> =B4t
>> know how to configure it, anyway here is my question.
>>
>> I=C2=B4ve noticed that some spam messages not marked as spam by spamassassi=
>> n (the
>> score is lower than the limit I=C2=B4ve set: 5.0. Those emails usually have=
>>  some
>> hints that suggest they are probably spam: score about 4.6). These message
>> are addressed to many people in my domain but the names before the email
>> address are random. To explain it more clearly, for example, the recipient
>> in the TO field is something like this:  "John" <user1@...>. Very
>> ofter the CC field includes other recipients like: "Peter"
>> <user2@...>; "Mike" <user3@...>; etc... The think is that
>> the email recepients (user1, user2, user3,...) are real, they exist in my
>> domain, but the names "Peter, John, Mike" have nothing to do with "user1,
>> user2, user3", they are picked randomly. Wouldn=C2=B4t be interesting to ha=
>> ve a
>> test that checks the "user name-email address" pairs according to some
>> settings?=20
>>
>> Regards,
>>
>> Alberto.

Hi,

you can do quite a few things to trap mail that probably is rubbish .... but it may be extra
work.
I use some prefilter in line with forbidden attachment and virus scanning
but it could probably be written as a _personal_ plugin.
I like mail sent to just the plain email address or in "user" <email> format written exactly
as I spell it. I collect mail from some other mailboxes, so of course the rule must know
about these other addresses as well.
For mail sent to my primary address (at a big isp) I dont like to see another address in the
To or Cc
The one that really caused work: I dont like mails where my address does not appear in
either To or Cc, unless the sender appears in a whitelist. You need to add mailing lists,
monthly password reminders from mailing lists, sourceforge addresses, whatnot...

Wolfgang Hamann


Re: why not doing a test that checks "name"-<email address> pairs

by aag_uk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


John D. Hardin wrote:
On Fri, 17 Aug 2007, aag_uk wrote:

(1) Check your MTA options. Some allow you to configure rejection of a
message after X number of invalid recipients are given.

(2) Consider a rule that adds a point if more than X names appear in
the TO: and/or CC: headers. Here are mine (20 is the limit):

describe TO_TOO_MANY To: too many recipients
header   TO_TOO_MANY To =~ /(?:,[^,]{1,80}){20}/
score    TO_TOO_MANY 1.50

describe CC_TOO_MANY Cc: too many recipients
header   CC_TOO_MANY Cc =~ /(?:,[^,]{1,80}){20}/
Thanks for your answer, but the spam I´m trying to identify is not about too many recipients, usually it´s only 5 or 6, and they all contain correct email addresses. The thing is that some spammers make up the name that goes before the email address (e.g. "John Smith"<peter@mydomain.com>)

Re: why not doing a test that checks "name"-<email address> pairs

by aag_uk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


>a) is probably going to be quite resource-intensive;

I don´t really know, according to http://www.nabble.com/forum/ViewPost.jtp?post=12207486&framed=y
 sm-7 say that it shouldn´t be


>b) requires LDAP, NIS, etc., so that SpamAssassin can have a clue
>about your accounts;
>c) requires competent fuzzy matching so that, when a user sends mail
>to "Chris St. Pierre <stpierre@nebrwesleyan.edu>", it doesn't flag it
>as spam because my "real name" is Christopher;
>d) is prone to FPs, since its the clients who add that name, and it
>could be literally _anything_ ("chris", "some guy", "", etc.) without
>being spam; and

My idea was that you could have a list that links each recipient to possible names that could be used (basically first name, surname and possibly a short name), not necesary NIS or LDAP. About fuzzy matching I think it shouldn't be difficult to do. It´s something like what Google does when you misspell something or enter something that is not "usual", it suggests you another search and, in my opinion,  its guess is usually very good.

>e) is fairly site-specific and would require a fair amount of
>configuration.
well, maybe if you have thousands of users in your domain and you want to enter the names-recipient links (as I explained in the previous paragraph) for the first time, it will require a lot of work. In my case I have about 100 recipients and from time to time I have to add new ones; so, that wouldn't be a problem.

>It might be an interesting plugin, but I think that the kind of
>scoring I'd be comfortable doing for a plugin like that -- very low --
>wouldn't be worth the tradeoff in CPU time, network traffic, etc.
I think is could add a low partial score, but the effect could be good because most of these emails I´m talking about are already quite suspicious, they usually match other tests (e.g. BAYES_99, which already adds a pretty high score).

Re: why not doing a test that checks "name"-<email address> pairs

by Kai Schaetzl :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Aag_uk wrote on Fri, 17 Aug 2007 23:58:05 -0700 (PDT):

> >b) requires LDAP, NIS, etc., so that SpamAssassin can have a clue
> >about your accounts;
> >c) requires competent fuzzy matching so that, when a user sends mail
> >to "Chris St. Pierre <stpierre@...>", it doesn't flag it
> >as spam because my "real name" is Christopher;
> >d) is prone to FPs, since its the clients who add that name, and it
> >could be literally _anything_ ("chris", "some guy", "", etc.) without
> >being spam; and
>
> My idea was that you could have a list that links each recipient to possible
> names that could be used (basically first name, surname and possibly a short
> name), not necesary NIS or LDAP. About fuzzy matching I think it shouldn't
> be difficult to do. It´s something like what Google does when you misspell
> something or enter something that is not "usual", it suggests you another
> search and, in my opinion,  its guess is usually very good.

You don't understand at all. What gets put in the comment is up to the sender.
They can put *everything* there and it's legit. You do not control it at all
and you do not send them a reply "please change my name in your addressbook to
xyz". It can be the name, a part of the name, several parts of the name,
reverted parts of the name, a company name in all its variations, an acronym,
misspelled, something like "Tony's brother", the email address, quoted or
bracketed in several ways, could be nothing - too show a few. Such a rule
would be prone to a huge number of FPs. It may work for you after a lot of
work, but not for others. It's not worth it.

Kai

--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com




Re: why not doing a test that checks "name"-<email address> pairs

by aag_uk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>What gets put in the comment is up to the sender.
>They can put *everything* there and it's legit. You do not control it at all
>
I know it depends on the sender and everything is legit, but it is also legit if I send an email to somebody talking about the stock market or certain medicine and it could score high when the message is perfectly normal.
It´s true that you can put whatever you want there but there are also some restrictions; let´s say, for example, all my users are Spanish,Italian,Russian... so, it´s quite unlikely that somebody tags any of my users with names like Jack, Peter, John (which is the case in 99% of the spam).


Re: why not doing a test that checks "name"-<email address> pairs

by sm-7 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

At 23:58 17-08-2007, aag_uk wrote:
> >a) is probably going to be quite resource-intensive;
>
>I don´t really know, according to

Compared to all the checks performed on a message, it isn't.

>My idea was that you could have a list that links each recipient to possible
>names that could be used (basically first name, surname and possibly a short
>name), not necesary NIS or LDAP. About fuzzy matching I think it shouldn't
>be difficult to do. It´s something like what Google does when you misspell
>something or enter something that is not "usual", it suggests you another
>search and, in my opinion,  its guess is usually very good.

That's not how "names" work in practice.  It may
take more than a lookup in your system database.

It's not difficult but it requires some work to
understand the naming conventions.  That may not
be possible in a heterogeneous environment.  The
fuzzy matching is not that easy.  Once you get
into that, you turn the process into a resource intensive one.

>well, maybe if you have thousands of users in your domain and you want to
>enter the names-recipient links (as I explained in the previous paragraph)
>for the first time, it will require a lot of work. In my case I have about
>100 recipients and from time to time I have to add new ones; so, that
>wouldn't be a problem.

It's only a name/recipient link if we make an
assumption about the "display name".  Once this
becomes a general rule, it will be circumvented.

I already have one case where this rule would
have the adverse of the intended effect.

Regards,
-sm


Re: why not doing a test that checks "name"-<email address> pairs

by Kai Schaetzl :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Aag_uk wrote on Sat, 18 Aug 2007 03:33:49 -0700 (PDT):

> it´s quite unlikely that somebody tags any of
> my users

as I said it may work for you, it will not work for the majority of SA
users. The whole effort and the FPs would not be worth it. If you don't
believe that, start coding.




Kai

--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com




Parent Message unknown Re: why not doing a test that checks "name"-<email address> pairs

by Chip M. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Alberto, your reasoning is correct, based on my experience of actually
implementing and using such a system, albeit in a small scale environment.
As "sm" points out, it is particularly useful as a "pass" rule for exact
matches to your users' actual email client "real name"s.

I've implemented this as part of a qmail filter that runs after SA.  As
I've mentioned in other posts, I'm in a shared web hosting environment, and
have no control over SA, so designed my filter to complement the great
strengths of SA, and fill in the holes that are created by a limited
environment.  Just over twenty domains use my filter, and we all share
data, so as to improve everyone's killrates.

I have no idea how practical this would be as an SA plugin, and am
Pearl-illiterate, so I merely describe how I have approached it.  More than
a year ago, I started using _VERY_ crude general header based (To/Cc
checking) real name "pass" rules, then in March of 2007 I added an explicit
"RealName" virtual header so as to allow more powerful rules, including
"match not" type penalty rules.


* Main Issues: *
        - generating a list of account specific real names (preferably automatically)
        - real-time extraction of the correct "real name"
        - some "real names" have been compromised, and should receive MUCH lower
pass scores
        - some account names are inherently poorly suited to real name pass rules
        (e.g. "jayne.cobb" since all words in the real name also appear as words
in the account part - "jcobb" is a better form)
        - some senders transpose real name parts (e.g. "Cobb, Jayne" in place of
"Jayne Cobb")
        - some senders use cutesy nicknames or other tricks (e.g. "Hero of Canton"
in place of "Jayne Cobb")
        - some senders (particularly Bulkers) use the complete account name as the
real name, and should not be scored normally
        (e.g. "jcobb@..." jcobb@...)


* PREP: Semi-automatic Real Name Data Generation: *
I'm just-a-programmer, not a sysadmin, so don't know how a typical pipeline
works, however, if it's practical, automatic real name extraction should be
fairly straight forward.  Just write something that you can temporarily
plug in _AFTER_ SA, and which extracts the account & real name pair from
everthing which passes SA, accumulates the frequencies, and picks the most
often occurring real name(s) for each account (I usually limit this to one
or two).  Include an option for human inspection, mainly for cases where
there is no clear cut winner.  In my experience, the majority of accounts
can be generated automatically, however it's wise to inspect all
possibilities.  That's manageable for small companies (less than 20), and
shouldn't be too bad for low 100s.  The collector app only needs to be run
for a week or so.  New users could be added manually.

It took me much less than five minutes to generate such a data list AND all
matching rules for the last person to join my Team (18 accounts, one week
of data), and my tool merely dumps the per account RealNames with
frequencies.  A slicker tool could make this VERY practical for larger
userbases.

Maintenance and verification would probably be an utter pain for anything
in the 1000s, so best to let us small and nimble types prove its efficacy. :)

There is anecdotal evidence that Hotmail may be doing something with real
name based rules, granted, there's reports that it's a somewhat sub optimal
implementation.  I speculate that they could easily pull the real name
straight out of each user's settings.


* Plugin: Real Name Extraction: *
An actual SA plugin would need to use the SMTP Recipient (or most reliable
Delivered-To account name) to pick out the matching account from the To or
Cc headers, then pull out its real name.  There should also be some
facility for associating external aliases with accounts (e.g. a redirected
ISP account).  If it FAILS to find a matching account, _ALL_ other real
name tests should be skipped or return false.


* Plugin: Real Name Testing: *
If it does find a matching account, three main real name based tests can be
performed:  empty, match, match not.

It's probably easier to understand how these work with a sample, so let's
say we have a user whose account is "jcobb@...", the real
name in his email client is "Jayne Cobb", and an automatic real name
collector has shown that occasionally he receives important email that uses
the real name "Hero of Canton".  Somewhere, we would construct two data
lists specific to his account, that would look something like this:
        realname_full  = jayne cobb, hero of canton
        realname_words = jayne, cobb, hero, canton

The generic real name "match" test would only trigger if the extracted real
name exactly (case in-sensitive) matched either "jayne cobb" or "hero of
canton", and the "match not" test would only trigger if NONE of the four
words "jayne, cobb, hero, canton" was found anywhere in the real name.
It's feasible to do "soft" matching, instead of word boundary based
matching (my code allows either).

Here's some examples:
        jcobb@...
        "Jayne Cobb" jcobb@...
        "Jayen Cobb" jcobb@...
        "Peter Petrelli" jcobb@...
The first triggers an "empty" test, but none of the other types of tests.
The second triggers an exact "match" pass rule.
The third has a misspelling so it fails an exact "match" pass rule, AND it
also fails a "match not" penalty rule because one of the words ("Cobb")
does match.  In other words, it receives ZERO total real name points.
The fourth triggers a "match not" penalty rule, because NO words match.

By using a LIST of acceptable individual words in the "match not" rule,
there's no need to mess about with fuzzy matching.  It is still possible
for a fuzzy misfire to occur, however so far I have not seen any actual FPs
caused by them (in more than half a million human+machine reviewed emails).
 Our only FPs have contained word that were widely off, so fuzzy matching
would have made no difference.  As always, careful scoring is appropriate,
and your mileage may vary.  A fuzzy matching option might be more suited to
a later version of a plugin.


* Scoring Notes: *
I generally score the "empty" test either not at all or fairly low (0.5).
I find it's most helpful as a bonus penalty in compound/meta rules, for
example, I give many attachments (zip, PDF, or any image) a small to medium
score if the real name is empty.

I score the "match" rule between -0.51 and -4.59, depending on whether the
real name has been compromised (one of our users gets a lot of ED spam sent
from Russia with his correct real name), and whether that person has
critical "pass" needs.  I have found it to be an EXCELLENT means of
preventing FPs, particularly during times when I'm tinkering with stuff to
fight an emerging threat, and make a dumb mistake.  :)

I score the "match not" rules typically in the 1.02 to 3.06 range (default
of 2.60).  FPs have been extremely low, with most being unimportant
bulk/junk type mail.

One weakness in my own filter is the lack of metas.  If an SA Real Name
plugin were developed, it would be more powerful, since it could be used to
reject specific attachment types that also triggered a "match not" test.
That level of control is more suited to a small business, but it sure is
nice to have. :)


* Efficacy and Performance Notes: *
Since I rolled this out last March, these tests alone immediately improved
my users' typical killrates from about 99.40% to 99.75% (three of us are
now at 100.00%), with a significant decrease in FPs.  Those levels have
been maintained, even during a period when many emerging threats have
driven down our SA rates (again, using a very constrained SA setup).

I have no feel for the SA system performance issues.  In my case, I do all
the "simple" (fast) tests first, then exit if the score is high enough, and
only then do DNS tests.  My general impression is that my overall
performance is higher, because on average these tests avoid more tests than
the time they consume.

Bottom line, I think these can be very effective for a smallish
environment.  Granted, I really need to write some code to extract precise
stats.  I am confident of the beneficial effect on FPs, because I check ALL
of those by hand.
        - "Chip"



Re: why not doing a test that checks "name"-<email address> pairs

by hamann.w :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kai Schätzl wrote:

>>
>> You don't understand at all. What gets put in the comment is up to the sender.
>> They can put *everything* there and it's legit. You do not control it at all
>> and you do not send them a reply "please change my name in your addressbook to
>> xyz". It can be the name, a part of the name, several parts of the name,
>> reverted parts of the name, a company name in all its variations, an acronym,
>> misspelled, something like "Tony's brother", the email address, quoted or
>> bracketed in several ways, could be nothing - too show a few. Such a rule
>> would be prone to a huge number of FPs. It may work for you after a lot of
>> work, but not for others. It's not worth it.
>>

while it is up to senders to make up display names, I usually see either
- no display name at all
- the name exacltly as I spell it (from replies)
- the name parts rearranged from a web form submission
in worthy mails.
If someone decides to put "Idiot" as a display name, I take the liberty to not read it.
Maybe some people really get mail sent to "Daddy" or whatever.
As others have pointed out, checking display names is a personal thing ... and it seems
to work with the mails I receive

Wolfgang Hamann