|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
[Bug 6046] New: Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
Summary: Add BayesStore backend using direct calls to BerkeleyDB Product: Spamassassin Version: SVN Trunk (Latest Devel Version) Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P3 Component: Libraries AssignedTo: dev@... ReportedBy: mdorman@... This is intended to be a high-speed, high-concurrency local BayesStore implementation. I think it succeeds in all of those respects. In benchmarking using an updated (but largely unchanged in procedure) masses/bayes-testing/benchmark setup, against a corpus of 6000 messages each of ham and spam, it beats all of the other back-ends I tested in all operations. The smallest gains are against the DB_File back-end, where it generally shows a margin of 10%-20%, while it scores upwards of 3x the speed of the SQL back-ends. I have not yet run this code in production. I intend to start that this week. I would certainly appreciate more eyes on the code. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
Justin Mason <jm@...> changed: What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|3.2.6 |3.3.0 --- Comment #1 from Justin Mason <jm@...> 2009-01-19 06:57:41 PST --- aiming at 3.3.0. 3.2.6 is unlikely to see any new features ;) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #2 from Michael Alan Dorman <mdorman@...> 2009-01-22 05:52:16 PST --- Created an attachment (id=4427) --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4427) The patch adding BayesStore/BDB.pm I've still not quite got this in production use yet, but on benchmarks it thrashes all of the other options pretty soundly. I'll try to add those figures after a quick re-run. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #3 from Mark Martinec <Mark.Martinec@...> 2009-01-22 06:12:19 PST --- > I've still not quite got this in production use yet, but on benchmarks it > thrashes all of the other options pretty soundly. I'll try to add those > figures after a quick re-run. What does benchmarking say on auto-purging the database? In my experience, lengthy auto expiration runs was the single and most pressing reason to abandon bdb and switch our Bayes to SQL. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #4 from Michael Alan Dorman <mdorman@...> 2009-01-22 06:31:56 PST --- (In reply to comment #3) > What does benchmarking say on auto-purging the database? > In my experience, lengthy auto expiration runs was the single > and most pressing reason to abandon bdb and switch our Bayes > to SQL. Well we don't have to do the dance of move-and-rebuild that DBM does, in part because we are able to create a secondary index on atime that makes it very easy to estimate whether we would expire too many tokens in a given run. So I would expect it to be as efficient as the SQL back-ends in that respect. Still, it's a good question that would not be answered out of the box by the benchmark code, so I've changed the local.cf files in the benchmark directory to remove bayes_auto_expire 0, and I'll re-run and see what the results look like. I'm curious, which SQL back-end are you using? And if it's MySQL, do you have any performance tuning tips? I've been a little startled to find that PgSQL is actually outperforming MySQL on my benchmarks (given their reps) but then I know how to tune PgSQL well; I'm more ignorant about MySQL. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #5 from Mark Martinec <Mark.Martinec@...> 2009-01-22 07:31:13 PST --- > Well we don't have to do the dance of move-and-rebuild that DBM does, in part > because we are able to create a secondary index on atime that makes it very > easy to estimate whether we would expire too many tokens in a given run. So I > would expect it to be as efficient as the SQL back-ends in that respect. Good to hear. > Still, it's a good question that would not be answered out of the box by the > benchmark code, so I've changed the local.cf files in the benchmark directory > to remove bayes_auto_expire 0, and I'll re-run and see what the results look > like. Thanks! > I'm curious, which SQL back-end are you using? Mail::SpamAssassin::BayesStore::MySQL > And if it's MySQL, do you have any performance tuning tips? Initially not. Later I added (/etc/my.cnf) : [mysqld] bind=127.0.0.1 key_buffer_size=60M innodb_buffer_pool_size=384M innodb_log_buffer_size=6M innodb_flush_log_at_trx_commit=0 max_connections=60 based on some tips from: http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-server-after-installation/ This is with MySQL 5.1.24 InnoDB (current size 5.5 GB), on FreeBSD, SA 3.3; currently bayes_token has 1M records, bayes_seen has 26M records (I know, I need to ditch bayes_seen and start it from scratch). Initially I used MyISAM and Mail::SpamAssassin::BayesStore::SQL, which would get me in trouble every now and then, requiring a REPAIR TABLE. Now with InnoDB and the dedicated BayesStore::MySQL it never again got me into trouble in two years. > I've been a little startled to find that PgSQL is > actually outperforming MySQL on my benchmarks (given their reps) > but then I know how to tune PgSQL well; I'm more ignorant about MySQL. I was running Bayes for a while on PostgreSQL 8.2 using Mail::SpamAssassin::BayesStore::PgSQL, but the SELECT ... IN (...) with a large set of tokens in the IN-set was quite slow, much slower than with MySQL. I'm still using PostgreSQL for everything else except Bayes, i.e. for AWL, and for amavisd-new SQL logging / pen pals database, which is quite large and outperforms MySQL, especially on purging. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #6 from Michael Alan Dorman <mdorman@...> 2009-01-22 09:39:41 PST --- So, running masses/bayes-testing/benchmark (with auto-expire enabled) across each back-end on the same corpus on an otherwise idle machine produces the following results: Benchmarking BDB real 14m19.809s user 10m27.823s sys 0m42.495s Benchmarking PostgreSQL real 44m23.183s user 8m15.167s sys 0m30.978s Benchmarking MySQL real 44m18.552s user 11m2.257s sys 1m6.612s Benchmarking SDBM real 25m14.481s user 9m56.833s sys 1m1.748s Benchmarking DB_File real 41m6.284s user 12m33.203s sys 0m57.692s I won't pretend that MySQL is the best-tuned MySQL install ever, though I have done some. I may make try tweaking that a bit based on Mark Martinec's suggestions and re-run. Anyway, that's based on the patch I've included here. I would appreciate any comments or suggestions anyone has about the code. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #7 from Justin Mason <jm@...> 2009-01-23 01:59:22 PST --- wow. those are great numbers ;) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #8 from Quanah Gibson-Mount <quanah@...> 2009-01-23 09:47:35 PST --- Just as an aside, is there any way to make it so the bayes table can be shared when using BDB? I'm guessing not, but figured it doesn't hurt to ask. ;) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #9 from Michael Alan Dorman <mdorman@...> 2009-01-23 10:32:34 PST --- > Just as an aside, is there any way to make it so the bayes table can be shared > when using BDB? I'm guessing not, but figured it doesn't hurt to ask. ;) Technically, I believe it is possible. My DB4.6 reference material discusses the an RPC-based client-server mechanism. That is, however, so far beyond my expertise, I wouldn't really be sure where to begin. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #10 from Michael Alan Dorman <mdorman@...> 2009-01-23 10:42:09 PST --- (In reply to comment #7) > wow. those are great numbers ;) Yeah, they're so great that I have to admit I wonder if I've got a bug somewhere. Specifically, bdb and db_file are very close right up until the long single-threaded slogs through large mailboxes at the end of the benchmark, at which point bdb pulls out a 5x-6x speed differential. That seems to me wildly improbable. I think the code is pretty straightforward; I'd love to have other eyes looking at it. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #11 from Michael Alan Dorman <mdorman@...> 2009-01-23 11:18:45 PST --- > Specifically, bdb and db_file are very close right up until the long > single-threaded slogs through large mailboxes at the end of the benchmark, at > which point bdb pulls out a 5x-6x speed differential. That seems to me wildly > improbable. Oh, wait, I forgot, this was with auto_expire enabled, so it's not surprising that it takes longer. with auto_expire disabled, bdb is a little faster...but only a little. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #12 from Mark Martinec <Mark.Martinec@...> 2009-02-06 04:14:47 PST --- I intended to try out the module, but was sidetracked by a perl5.8.9 problem, which crashes the interpreter on FreeBSD when it tries to compile the code autogenerated from rules as soon as some module with threading like BerkeleyDB is brought in: perl -MBerkeleyDB body_0.pm Bus error: 10 (core dumped) I filed a perl bug report #63054: http://rt.perl.org/rt3/Public/Bug/Display.html?id=63054 Perl_peep recursion exceeds stack in 5.8.9 eval-ing a long program but I'm not yet sure how to proceed. Looks like I need to downgrade to 5.8.8 or upgrade to 5.10.0, which is a bit of a pain when most Perl modules are installed from FreeBSD ports. A workaround could be if SA produced smaller code sections, the 83600 lines of code for body rules at prio=0 is a lot (using some SARE and ZDF rules), perl5.8.9 can survive about two thirds of that. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #13 from Justin Mason <jm@...> 2009-02-06 04:34:14 PST --- (In reply to comment #12) > I intended to try out the module, but was sidetracked by a perl5.8.9 > problem, which crashes the interpreter on FreeBSD when it tries to > compile the code autogenerated from rules as soon as some module > with threading like BerkeleyDB is brought in: :( > A workaround could be if SA produced smaller code sections, > the 83600 lines of code for body rules at prio=0 is a lot > (using some SARE and ZDF rules), perl5.8.9 can survive about > two thirds of that. yeah, we can do that I think. can you open a separate SA bug for this? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #14 from Mark Martinec <Mark.Martinec@...> 2009-02-06 05:22:28 PST --- > > A workaround could be if SA produced smaller code sections, > > the 83600 lines of code for body rules at prio=0 is a lot > > yeah, we can do that I think. can you open a separate SA bug for this? Ok, opened Bug 6060. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #15 from Mark Martinec <Mark.Martinec@...> 2009-02-06 05:32:10 PST --- Now back to the original topic. The module looks alright, but needs testing. I have a few minor coding/style comments, but that can be addressed later or during importing it. Michael, is the attached code the latest that you have, or have you done any changes later? Do we have a licensing agreement from Michael? I'd suggest we go on with importing the module - - as it is an optional backend module what is important is that we have an intention of adopting it and that it compiles cleanly, any minor bugs can be fixed along the way. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
Re: [Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB> The module looks alright, but needs testing. I have a few minor
> coding/style comments, but that can be addressed later or during > importing it. Yeah, I tried to keep to the style, but I have to admit a couple of things go against deeply-ingrained habits, so I wouldn't be surprised if I missed some some stuff. > Michael, is the attached code the latest that you have, or have > you done any changes later? I haven't done any more development at this point. Though as I noted in another mail, some more development may be warranted---it did not perform as expected when I set it up in production. I anticipate this being relatively minor to correct, since it seems to work well in testing. > Do we have a licensing agreement from Michael? You should---I emailed a scan of a completed and signed form to secretary@... more than a week ago, and there should also be a corporate assignment for me from ironicdesign.com that was faxed in even before that. If either of those are missing, please let me know---we didn't get any ack for either. > I'd suggest we go on with importing the module - > - as it is an optional backend module what is important is that > we have an intention of adopting it and that it compiles cleanly, > any minor bugs can be fixed along the way. That would be very cool. I certainly intend to keep working on it, but I would also appreciate any input others have. Mike. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #16 from Justin Mason <jm@...> 2009-02-06 06:18:27 PST --- (In reply to comment #15) > Michael, is the attached code the latest that you have, or have > you done any changes later? Do we have a licensing agreement > from Michael? I'd suggest we go on with importing the module - > - as it is an optional backend module what is important is that > we have an intention of adopting it and that it compiles cleanly, > any minor bugs can be fixed along the way. yep, the CLA is sorted; http://people.apache.org/~jim/committers.html . so it's ready to go! Mark, if you want to import it into SVN, go right ahead... I'm +1 -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #17 from Mark Martinec <Mark.Martinec@...> 2009-02-09 06:42:37 PST --- > Mark, if you want to import it into SVN, go right ahead... I'm +1 Imported a new Bayes backend M::S::BayesStore::BDB, along with its benchmarks and tests (from Bug 6046 attachment); author: Michael Alan Dorman Sending MANIFEST Adding lib/Mail/SpamAssassin/BayesStore/BDB.pm Sending masses/bayes-testing/benchmark/README Adding masses/bayes-testing/benchmark/helper/bdb Adding masses/bayes-testing/benchmark/helper/bdb/cleardb Adding masses/bayes-testing/benchmark/helper/bdb/dbsize Adding masses/bayes-testing/benchmark/helper/bdb/setup Adding masses/bayes-testing/benchmark/tests/bdb Adding masses/bayes-testing/benchmark/tests/bdb/site Adding masses/bayes-testing/benchmark/tests/bdb/site/init.pre Adding masses/bayes-testing/benchmark/tests/bdb/site/local.cf Adding masses/bayes-testing/benchmark/tests/bdb/user_prefs Adding t/bayesbdb.t Transmitting file data .......... Committed revision 742522 ( https://svn.apache.org/viewcvs.cgi?view=rev&rev=742522 ). Quench the warning: BerkeleyDB::db_version" used only once: possible typo at t/bayesbdb.t Prevent bayesbdb.t from aborting: Can't use an undefined value as an ARRAY reference at ../blib/lib/Mail/SpamAssassin/BayesStore/BDB.pm line 683 Sending lib/Mail/SpamAssassin/BayesStore/BDB.pm Sending t/bayesbdb.t Transmitting file data .. Committed revision 742535 ( https://svn.apache.org/viewcvs.cgi?view=rev&rev=742535 ). -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
|
|
[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDBhttps://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046
--- Comment #18 from Mark Martinec <Mark.Martinec@...> 2009-02-09 11:10:46 PST --- some rather cosmetic changes to BDB.pm: wrap long lines; remove some redundant parenthesis; TAB -> SP in POD sections (doesn't come out nice); uncomment most dbg calls for the time being, change debug area id 'BDB:' -> 'bayes:' for consistency with other backends; changed 'assert'-like idiom to: 'assert_condition or die' (previously a mix of 'negative_condition and die' and 'positive_cond or die' was used); test status of BerkeleyDB methods for zero as per docs, instead of testing for boolean; avoid idiom 'my $v=expr or die/dbg' separating assignment and a test ('my $x=0 or die' yields: 'Found = in conditional', while 'my($x)=0 or die' never dies) Sending lib/Mail/SpamAssassin/BayesStore/BDB.pm Transmitting file data . Committed revision 742680 ( https://svn.apache.org/viewcvs.cgi?view=rev&rev=742680 ). -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug. |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |