|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 | Next > |
|
|
[Bug 6155] New: generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
Summary: generate new scores for 3.3.0 release Product: Spamassassin Version: unspecified Platform: Other OS/Version: All Status: NEW Severity: blocker Priority: P1 Component: Score Generation AssignedTo: dev@... ReportedBy: jm@... Here's a ticket to track this release work item. Do we actually need to do this, though, since we have Daryl's code generating scores weekly from nightly mass-check results? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #1 from Justin Mason <jm@...> 2009-07-31 09:56:38 PST --- (In reply to comment #0) > Do we actually need to do this, though, since we have Daryl's code generating > scores weekly from nightly mass-check results? well, we need to fix that, actually. it seems to be broken. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #2 from Justin Mason <jm@...> 2009-08-14 13:30:20 PST --- This time around, I think I'll scrap the confusing differentiation between nightly mass-check result submission rsync accounts and "submit" accounts. Anyone object? I'm going to try a test run of the evolver based on nightly mass-check logs, btw. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #3 from Justin Mason <jm@...> 2009-08-14 13:31:47 PST --- http://wiki.apache.org/spamassassin/RescoreMassCheck is the procedure, as in previous releases. fwiw, we have 1022294 spams and 271617 hams in our nightly corpora, currently. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #4 from Mark Martinec <Mark.Martinec@...> 2009-08-17 13:15:01 PST --- Created an attachment (id=4517) --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4517) Ignore missing support for ADSP in old versions of Mail::DKIM. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #5 from Justin Mason <jm@...> 2009-08-17 13:27:40 PST --- (In reply to comment #4) > Created an attachment (id=4517) --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4517) [details] > Ignore missing support for ADSP in old versions of Mail::DKIM. wrong bug I suspect! ;) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #6 from Warren Togami <wtogami@...> 2009-08-17 14:25:33 PST --- Is there still time to add more nightlies for this rescoring? There is another major Japanese user that is very close to joining. How important is this rescoring? Do nightlies help to rescore the sa-update scores? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #7 from Justin Mason <jm@...> 2009-08-17 15:21:17 PST --- ok, I think I've ironed out a couple of issues. Let's see what people think of these sample scores: http://taint.org/x/2009/gen-set0-2.0-5.0-500-ga_scores http://taint.org/x/2009/gen-set1-5.0-5.0-500-ga_scores http://taint.org/x/2009/gen-set2-2.0-5.0-500-ga_scores http://taint.org/x/2009/gen-set3-5.0-5.0-500-ga_scores here are the test results against the "test" fold for each scoreset: gen-set0-2.0-5.0-500-ga/test Reading scores from "tmprules"... Reading per-message hit stat logs and scores... # SUMMARY for threshold 5.0: # Correctly non-spam: 26453 99.07% # Correctly spam: 83369 81.53% # False positives: 249 0.93% # False negatives: 18882 18.47% # TCR(l=50): 3.263469 SpamRecall: 81.534% SpamPrec: 99.702% gen-set1-5.0-5.0-500-ga/test Reading scores from "tmprules"... Reading per-message hit stat logs and scores... # SUMMARY for threshold 5.0: # Correctly non-spam: 26646 99.79% # Correctly spam: 100943 98.72% # False positives: 56 0.21% # False negatives: 1308 1.28% # TCR(l=50): 24.890701 SpamRecall: 98.721% SpamPrec: 99.945% gen-set2-2.0-5.0-500-ga/test Reading scores from "tmprules"... Reading per-message hit stat logs and scores... # SUMMARY for threshold 5.0: # Correctly non-spam: 26485 99.19% # Correctly spam: 84218 82.36% # False positives: 217 0.81% # False negatives: 18033 17.64% # TCR(l=50): 3.540179 SpamRecall: 82.364% SpamPrec: 99.743% gen-set3-5.0-5.0-500-ga/test Reading scores from "tmprules"... Reading per-message hit stat logs and scores... # SUMMARY for threshold 5.0: # Correctly non-spam: 26662 99.85% # Correctly spam: 100964 98.74% # False positives: 40 0.15% # False negatives: 1287 1.26% # TCR(l=50): 31.107697 SpamRecall: 98.741% SpamPrec: 99.960% Yes, set0 and set2 are terrible. This is pretty much what happened last time, too; our ruleset is pretty crappy nowadays without network rules active. But the net rule results are very good! However I think I need to look into the local rule GA runs if possible. Bug 5270 is the 3.2.0 rescoring run, for reference. Spamhaus will be happy to see a much improved score for RCVD_IN_PBL ;) gen-set1-5.0-5.0-500-ga_scores:score RCVD_IN_PBL 2.596 gen-set3-5.0-5.0-500-ga_scores:score RCVD_IN_PBL 2.411 -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #8 from Justin Mason <jm@...> 2009-08-17 16:05:38 PST --- Created an attachment (id=4518) --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4518) sample new scores, as diff here's the results of running a GA run for each set. please shout about any and all issues you spot (and there's a few, I think, eg the ACCESSDB score leakage which should probably be ignored by the masses scripts) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #9 from Warren Togami <wtogami@...> 2009-08-17 20:46:17 PST --- http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail 90% FP rate for Japanese http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail 52% FP rate for Japanese http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail 44% FP rate for Japanese All three of these rules do very poorly with Japanese mail, and the total % SPAM is lower than the % FP. Yet the GA scores are rather high since we don't have a statistically significant amount of Japanese mail in the corpus. What language are the SPAM hits? Perhaps many are examples of identifying foreign languages instead of determining if it is ham or spam? Bug #6149 is related to this problem. I am attempting to convince Japanese, Chinese and Korean users to join the nightly masscheck, but it is very difficult. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #10 from Justin Mason <jm@...> 2009-08-18 01:15:46 PST --- (In reply to comment #9) > http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail > 90% FP rate for Japanese > http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail > 52% FP rate for Japanese > http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail > 44% FP rate for Japanese > > All three of these rules do very poorly with Japanese mail, and the total % > SPAM is lower than the % FP. Yet the GA scores are rather high since we don't > have a statistically significant amount of Japanese mail in the corpus. > > What language are the SPAM hits? Perhaps many are examples of identifying > foreign languages instead of determining if it is ham or spam? > > Bug #6149 is related to this problem. I plan to fix that, alright. > I am attempting to convince Japanese, Chinese and Korean users to join the > nightly masscheck, but it is very difficult. BTW, you could also take copies of their mail samples and add them to your own corpora, in effect acting as a proxy for them. that's easier for them than setting up all the infrastructure. (I thought you were already doing this ;) You may need to be able to ask them if a mail _really_ is ham, down the line, though, so it needs to remain a two-way arrangement. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #11 from Warren Togami <wtogami@...> 2009-08-18 19:24:47 PST --- > BTW, you could also take copies of their mail samples and add them to your own > corpora, in effect acting as a proxy for them. that's easier for them than > setting up all the infrastructure. (I thought you were already doing this ;) I have 3 English and 3 Japanese users in my corpus at the moment. One additional Japanese user rio is starting nightly masscheck hopefully tonight. He is doing his own masschecks. > You may need to be able to ask them if a mail _really_ is ham, down the line, > though, so it needs to remain a two-way arrangement. I asked them very carefully to avoid mis-classification. This is part of the difficulty of getting more volunteers, aside from the privacy worries. I look forward to seeing the effect of the fix in Bug #6149 on the next masscheck. I asked one of my users to pick a few dozen real-world sample messages that triggers the three rules in Comment #9 for the test suite. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #12 from Warren Togami <wtogami@...> 2009-08-19 07:30:23 PST --- (In reply to comment #9) > http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail > 90% FP rate for Japanese > http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail > 52% FP rate for Japanese > http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail > 44% FP rate for Japanese http://ruleqa.spamassassin.org/20090819-r805703-n/TVD_SPACE_RATIO/detail 0% FP rate for that particular Japanese user http://ruleqa.spamassassin.org/20090819-r805703-n/PLING_QUERY/detail 0% FP rate for that particular Japanese user (Huh? You changed this rule too?) http://ruleqa.spamassassin.org/20090819-r805703-n/__GAPPY_SUBJECT/detail 44% FP rate for that particular Japanese user -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #13 from Warren Togami <wtogami@...> 2009-08-19 08:01:28 PST --- http://ruleqa.spamassassin.org/20090819-r805703-n/GAPPY_SUBJECT/detail 0% FP rate Oops, wrong one? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #14 from Justin Mason <jm@...> 2009-08-19 15:42:09 PST --- (In reply to comment #13) > http://ruleqa.spamassassin.org/20090819-r805703-n/GAPPY_SUBJECT/detail > 0% FP rate > > Oops, wrong one? yep, __GAPPY_SUBJECT is likely to have fps, GAPPY_SUBJECT avoids them. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #15 from Warren Togami <wtogami@...> 2009-08-19 18:52:38 PST --- Looks good, looking forward to the next test scores. Some questions... How important is this rescoring? Will future nightly masschecks help to rescore the sa-update scores? Should I bother to continue recruiting more masscheck participants after this rescore? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #16 from Justin Mason <jm@...> 2009-08-25 13:54:54 PST --- (In reply to comment #15) > How important is this rescoring? > Will future nightly masschecks help to rescore the sa-update scores? the base ruleset (non-sandbox rules) won't change scores, so this is important. For nightly masschecks, the only scores affected will be those of sandbox rules. So only about 1/2 of the ruleset, I'd reckon. > Should I bother to continue recruiting more masscheck participants after this > rescore? No, I think as long as they provide results for the rescore, that's the most important thing. Has anyone had inspiration about the reason for the bad set0 results? (I haven't looked yet) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #17 from Warren Togami <wtogami@...> 2009-08-26 00:48:59 PST --- > the base ruleset (non-sandbox rules) won't change scores, so this is important. > For nightly masschecks, the only scores affected will be those of sandbox > rules. So only about 1/2 of the ruleset, I'd reckon. I am curious, do you remember the original reason for this design decision? Might there be value in making the entire ruleset scores affected by nightly masshecks? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #18 from Justin Mason <jm@...> 2009-08-26 01:58:53 PST --- (In reply to comment #17) > > the base ruleset (non-sandbox rules) won't change scores, so this is important. > > For nightly masschecks, the only scores affected will be those of sandbox > > rules. So only about 1/2 of the ruleset, I'd reckon. > > I am curious, do you remember the original reason for this design decision? > > Might there be value in making the entire ruleset scores affected by nightly > masshecks? iirc, the risk is that a small set of corpora (e.g. a few people take a week off) could cause the entire ruleset to be skewed incorrectly. This way at least only the most recent (sandbox) rules would be affected, so it's a bit safer. It's also faster to generate the scores, but this isn't so much of an issue now, as our main machine is quite beefy... There may have been other reasons, too, but I can't find the mails :( -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6155] generate new scores for 3.3.0 releasehttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155
--- Comment #19 from Warren Togami <wtogami@...> 2009-08-26 18:11:50 PST --- > iirc, the risk is that a small set of corpora (e.g. a few people take a week > off) could cause the entire ruleset to be skewed incorrectly. This way at > least only the most recent (sandbox) rules would be affected, so it's a bit > safer. > It's also faster to generate the scores, but this isn't so much of an issue > now, as our main machine is quite beefy... > There may have been other reasons, too, but I can't find the mails :( I feel like we have too little diversity in the type and number of ham contributors. This rescoring would be a big improvement from our scores from two years ago and we definitely should do it. But after 3.3.0 I would like to learn how I can become more involved in order to revamp the score update process. * I'd like to learn how to operate the GA. * I want to continue recruiting other nightly masscheck participants. I want to recruit contributors of non-English languages and non-technical users. * I am thinking about writing a toolkit (in RPM and DEB packages) that would make it easier for participants to join masschecks. The current documented process is very unclear and confusing, and I want to clean this up as well. With more diversity in masscheck participants, perhaps we can do complete rescoring more often than 2 years. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
| < Prev | 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |