[Bug 6155] New: generate new scores for 3.3.0 release

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 | Next >

[Bug 6155] New: generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

           Summary: generate new scores for 3.3.0 release
           Product: Spamassassin
           Version: unspecified
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: blocker
          Priority: P1
         Component: Score Generation
        AssignedTo: dev@...
        ReportedBy: jm@...


Here's a ticket to track this release work item.

Do we actually need to do this, though, since we have Daryl's code generating
scores weekly from nightly mass-check results?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #1 from Justin Mason <jm@...>  2009-07-31 09:56:38 PST ---
(In reply to comment #0)
> Do we actually need to do this, though, since we have Daryl's code generating
> scores weekly from nightly mass-check results?

well, we need to fix that, actually. it seems to be broken.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #2 from Justin Mason <jm@...>  2009-08-14 13:30:20 PST ---
This time around, I think I'll scrap the confusing differentiation between
nightly mass-check result submission rsync accounts and "submit" accounts.
Anyone object?

I'm going to try a test run of the evolver based on nightly mass-check logs,
btw.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #3 from Justin Mason <jm@...>  2009-08-14 13:31:47 PST ---
http://wiki.apache.org/spamassassin/RescoreMassCheck is the procedure, as in
previous releases.

fwiw, we have 1022294 spams and 271617 hams in our nightly corpora, currently.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #4 from Mark Martinec <Mark.Martinec@...>  2009-08-17 13:15:01 PST ---
Created an attachment (id=4517)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4517)
Ignore missing support for ADSP in old versions of Mail::DKIM.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #5 from Justin Mason <jm@...>  2009-08-17 13:27:40 PST ---
(In reply to comment #4)
> Created an attachment (id=4517)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4517) [details]
> Ignore missing support for ADSP in old versions of Mail::DKIM.

wrong bug I suspect! ;)

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #6 from Warren Togami <wtogami@...>  2009-08-17 14:25:33 PST ---
Is there still time to add more nightlies for this rescoring?  There is another
major Japanese user that is very close to joining.

How important is this rescoring?  Do nightlies help to rescore the sa-update
scores?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #7 from Justin Mason <jm@...>  2009-08-17 15:21:17 PST ---
ok, I think I've ironed out a couple of issues.  Let's see what people think of
these sample scores:

http://taint.org/x/2009/gen-set0-2.0-5.0-500-ga_scores
http://taint.org/x/2009/gen-set1-5.0-5.0-500-ga_scores
http://taint.org/x/2009/gen-set2-2.0-5.0-500-ga_scores
http://taint.org/x/2009/gen-set3-5.0-5.0-500-ga_scores


here are the test results against the "test" fold for each scoreset:

gen-set0-2.0-5.0-500-ga/test
Reading scores from "tmprules"...
Reading per-message hit stat logs and scores...

# SUMMARY for threshold 5.0:
# Correctly non-spam:  26453  99.07%
# Correctly spam:      83369  81.53%
# False positives:       249  0.93%
# False negatives:     18882  18.47%
# TCR(l=50): 3.263469  SpamRecall: 81.534%  SpamPrec: 99.702%


gen-set1-5.0-5.0-500-ga/test
Reading scores from "tmprules"...
Reading per-message hit stat logs and scores...

# SUMMARY for threshold 5.0:
# Correctly non-spam:  26646  99.79%
# Correctly spam:     100943  98.72%
# False positives:        56  0.21%
# False negatives:      1308  1.28%
# TCR(l=50): 24.890701  SpamRecall: 98.721%  SpamPrec: 99.945%


gen-set2-2.0-5.0-500-ga/test
Reading scores from "tmprules"...
Reading per-message hit stat logs and scores...

# SUMMARY for threshold 5.0:
# Correctly non-spam:  26485  99.19%
# Correctly spam:      84218  82.36%
# False positives:       217  0.81%
# False negatives:     18033  17.64%
# TCR(l=50): 3.540179  SpamRecall: 82.364%  SpamPrec: 99.743%


gen-set3-5.0-5.0-500-ga/test
Reading scores from "tmprules"...
Reading per-message hit stat logs and scores...

# SUMMARY for threshold 5.0:
# Correctly non-spam:  26662  99.85%
# Correctly spam:     100964  98.74%
# False positives:        40  0.15%
# False negatives:      1287  1.26%
# TCR(l=50): 31.107697  SpamRecall: 98.741%  SpamPrec: 99.960%

Yes, set0 and set2 are terrible.  This is pretty much what happened last time,
too; our ruleset is pretty crappy nowadays without network rules active.  But
the net rule results are very good!  However I think I need to look into the
local rule GA runs if possible.

Bug 5270 is the 3.2.0 rescoring run, for reference.

Spamhaus will be happy to see a much improved score for RCVD_IN_PBL ;)

gen-set1-5.0-5.0-500-ga_scores:score RCVD_IN_PBL                    2.596
gen-set3-5.0-5.0-500-ga_scores:score RCVD_IN_PBL                    2.411

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #8 from Justin Mason <jm@...>  2009-08-17 16:05:38 PST ---
Created an attachment (id=4518)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4518)
sample new scores, as diff

here's the results of running a GA run for each set.   please shout about any
and all issues you spot (and there's a few, I think, eg the ACCESSDB score
leakage which should probably be ignored by the masses scripts)

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #9 from Warren Togami <wtogami@...>  2009-08-17 20:46:17 PST ---
http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail
90% FP rate for Japanese
http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail
52% FP rate for Japanese
http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail
44% FP rate for Japanese

All three of these rules do very poorly with Japanese mail, and the total %
SPAM is lower than the % FP.  Yet the GA scores are rather high since we don't
have a statistically significant amount of Japanese mail in the corpus.

What language are the SPAM hits?  Perhaps many are examples of identifying
foreign languages instead of determining if it is ham or spam?

Bug #6149 is related to this problem.

I am attempting to convince Japanese, Chinese and Korean users to join the
nightly masscheck, but it is very difficult.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #10 from Justin Mason <jm@...>  2009-08-18 01:15:46 PST ---
(In reply to comment #9)

> http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail
> 90% FP rate for Japanese
> http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail
> 52% FP rate for Japanese
> http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail
> 44% FP rate for Japanese
>
> All three of these rules do very poorly with Japanese mail, and the total %
> SPAM is lower than the % FP.  Yet the GA scores are rather high since we don't
> have a statistically significant amount of Japanese mail in the corpus.
>
> What language are the SPAM hits?  Perhaps many are examples of identifying
> foreign languages instead of determining if it is ham or spam?
>
> Bug #6149 is related to this problem.

I plan to fix that, alright.

> I am attempting to convince Japanese, Chinese and Korean users to join the
> nightly masscheck, but it is very difficult.

BTW, you could also take copies of their mail samples and add them to your own
corpora, in effect acting as a proxy for them.  that's easier for them than
setting up all the infrastructure.  (I thought you were already doing this ;)

You may need to be able to ask them if a mail _really_ is ham, down the line,
though, so it needs to remain a two-way arrangement.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #11 from Warren Togami <wtogami@...>  2009-08-18 19:24:47 PST ---
> BTW, you could also take copies of their mail samples and add them to your own
> corpora, in effect acting as a proxy for them.  that's easier for them than
> setting up all the infrastructure.  (I thought you were already doing this ;)

I have 3 English and 3 Japanese users in my corpus at the moment.  One
additional Japanese user rio is starting nightly masscheck hopefully tonight.
He is doing his own masschecks.

> You may need to be able to ask them if a mail _really_ is ham, down the line,
> though, so it needs to remain a two-way arrangement.

I asked them very carefully to avoid mis-classification.  This is part of the
difficulty of getting more volunteers, aside from the privacy worries.

I look forward to seeing the effect of the fix in Bug #6149 on the next
masscheck.  I asked one of my users to pick a few dozen real-world sample
messages that triggers the three rules in Comment #9 for the test suite.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #12 from Warren Togami <wtogami@...>  2009-08-19 07:30:23 PST ---
(In reply to comment #9)
> http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail
> 90% FP rate for Japanese
> http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail
> 52% FP rate for Japanese
> http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail
> 44% FP rate for Japanese

http://ruleqa.spamassassin.org/20090819-r805703-n/TVD_SPACE_RATIO/detail
0% FP rate for that particular Japanese user
http://ruleqa.spamassassin.org/20090819-r805703-n/PLING_QUERY/detail
0% FP rate for that particular Japanese user (Huh?  You changed this rule too?)
http://ruleqa.spamassassin.org/20090819-r805703-n/__GAPPY_SUBJECT/detail
44% FP rate for that particular Japanese user

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #13 from Warren Togami <wtogami@...>  2009-08-19 08:01:28 PST ---
http://ruleqa.spamassassin.org/20090819-r805703-n/GAPPY_SUBJECT/detail
0% FP rate

Oops, wrong one?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #14 from Justin Mason <jm@...>  2009-08-19 15:42:09 PST ---
(In reply to comment #13)
> http://ruleqa.spamassassin.org/20090819-r805703-n/GAPPY_SUBJECT/detail
> 0% FP rate
>
> Oops, wrong one?

yep, __GAPPY_SUBJECT is likely to have fps, GAPPY_SUBJECT avoids them.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #15 from Warren Togami <wtogami@...>  2009-08-19 18:52:38 PST ---
Looks good, looking forward to the next test scores.


Some questions...

How important is this rescoring?

Will future nightly masschecks help to rescore the sa-update scores?

Should I bother to continue recruiting more masscheck participants after this
rescore?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #16 from Justin Mason <jm@...>  2009-08-25 13:54:54 PST ---
(In reply to comment #15)
> How important is this rescoring?
> Will future nightly masschecks help to rescore the sa-update scores?

the base ruleset (non-sandbox rules) won't change scores, so this is important.
For nightly masschecks, the only scores affected will be those of sandbox
rules.  So only about 1/2 of the ruleset, I'd reckon.

> Should I bother to continue recruiting more masscheck participants after this
> rescore?

No, I think as long as they provide results for the rescore, that's the most
important thing.

Has anyone had inspiration about the reason for the bad set0 results? (I
haven't looked yet)

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #17 from Warren Togami <wtogami@...>  2009-08-26 00:48:59 PST ---
> the base ruleset (non-sandbox rules) won't change scores, so this is important.
> For nightly masschecks, the only scores affected will be those of sandbox
> rules.  So only about 1/2 of the ruleset, I'd reckon.

I am curious, do you remember the original reason for this design decision?

Might there be value in making the entire ruleset scores affected by nightly
masshecks?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #18 from Justin Mason <jm@...>  2009-08-26 01:58:53 PST ---
(In reply to comment #17)
> > the base ruleset (non-sandbox rules) won't change scores, so this is important.
> > For nightly masschecks, the only scores affected will be those of sandbox
> > rules.  So only about 1/2 of the ruleset, I'd reckon.
>
> I am curious, do you remember the original reason for this design decision?
>
> Might there be value in making the entire ruleset scores affected by nightly
> masshecks?

iirc, the risk is that a small set of corpora (e.g. a few people take a week
off) could cause the entire ruleset to be skewed incorrectly.  This way at
least only the most recent (sandbox) rules would be affected, so it's a bit
safer.

It's also faster to generate the scores, but this isn't so much of an issue
now, as our main machine is quite beefy...

There may have   been other reasons, too, but I can't find the mails :(

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #19 from Warren Togami <wtogami@...>  2009-08-26 18:11:50 PST ---
> iirc, the risk is that a small set of corpora (e.g. a few people take a week
> off) could cause the entire ruleset to be skewed incorrectly.  This way at
> least only the most recent (sandbox) rules would be affected, so it's a bit
> safer.

> It's also faster to generate the scores, but this isn't so much of an issue
> now, as our main machine is quite beefy...

> There may have   been other reasons, too, but I can't find the mails :(

I feel like we have too little diversity in the type and number of ham
contributors.  This rescoring would be a big improvement from our scores from
two years ago and we definitely should do it.

But after 3.3.0 I would like to learn how I can become more involved in order
to revamp the score update process.

* I'd like to learn how to operate the GA.
* I want to continue recruiting other nightly masscheck participants.  I want
to recruit contributors of non-English languages and non-technical users.
* I am thinking about writing a toolkit (in RPM and DEB packages) that would
make it easier for participants to join masschecks.  The current documented
process is very unclear and confusing, and I want to clean this up as well.

With more diversity in masscheck participants, perhaps we can do complete
rescoring more often than 2 years.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
< Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 | Next >