[Bug 6046] New: Add BayesStore backend using direct calls to BerkeleyDB

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

[Bug 6046] New: Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046

           Summary: Add BayesStore backend using direct calls to BerkeleyDB
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P3
         Component: Libraries
        AssignedTo: dev@...
        ReportedBy: mdorman@...


This is intended to be a high-speed, high-concurrency local BayesStore
implementation.  I think it succeeds in all of those respects.

In benchmarking using an updated (but largely unchanged in procedure)
masses/bayes-testing/benchmark setup, against a corpus of 6000 messages each of
ham and spam, it beats all of the other back-ends I tested in all operations.

The smallest gains are against the DB_File back-end, where it generally shows a
margin of 10%-20%, while it scores upwards of 3x the speed of the SQL
back-ends.

I have not yet run this code in production.  I intend to start that this week.
I would certainly appreciate more eyes on the code.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046


Justin Mason <jm@...> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|3.2.6                       |3.3.0




--- Comment #1 from Justin Mason <jm@...>  2009-01-19 06:57:41 PST ---
aiming at 3.3.0.  3.2.6 is unlikely to see any new features ;)


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #2 from Michael Alan Dorman <mdorman@...>  2009-01-22 05:52:16 PST ---
Created an attachment (id=4427)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4427)
The patch adding BayesStore/BDB.pm

I've still not quite got this in production use yet, but on benchmarks it
thrashes all of the other options pretty soundly.  I'll try to add those
figures after a quick re-run.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #3 from Mark Martinec <Mark.Martinec@...>  2009-01-22 06:12:19 PST ---
> I've still not quite got this in production use yet, but on benchmarks it
> thrashes all of the other options pretty soundly.  I'll try to add those
> figures after a quick re-run.

What does benchmarking say on auto-purging the database?
In my experience, lengthy auto expiration runs was the single
and most pressing reason to abandon bdb and switch our Bayes
to SQL.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #4 from Michael Alan Dorman <mdorman@...>  2009-01-22 06:31:56 PST ---
(In reply to comment #3)
> What does benchmarking say on auto-purging the database?
> In my experience, lengthy auto expiration runs was the single
> and most pressing reason to abandon bdb and switch our Bayes
> to SQL.

Well we don't have to do the dance of move-and-rebuild that DBM does, in part
because we are able to create a secondary index on atime that makes it very
easy to estimate whether we would expire too many tokens in a given run.  So I
would expect it to be as efficient as the SQL back-ends in that respect.

Still, it's a good question that would not be answered out of the box by the
benchmark code, so I've changed the local.cf files in the benchmark directory
to remove bayes_auto_expire 0, and I'll re-run and see what the results look
like.

I'm curious, which SQL back-end are you using?  And if it's MySQL, do you have
any performance tuning tips?  I've been a little startled to find that PgSQL is
actually outperforming MySQL on my benchmarks (given their reps) but then I
know how to tune PgSQL well; I'm more ignorant about MySQL.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #5 from Mark Martinec <Mark.Martinec@...>  2009-01-22 07:31:13 PST ---
> Well we don't have to do the dance of move-and-rebuild that DBM does, in part
> because we are able to create a secondary index on atime that makes it very
> easy to estimate whether we would expire too many tokens in a given run.  So I
> would expect it to be as efficient as the SQL back-ends in that respect.

Good to hear.

> Still, it's a good question that would not be answered out of the box by the
> benchmark code, so I've changed the local.cf files in the benchmark directory
> to remove bayes_auto_expire 0, and I'll re-run and see what the results look
> like.

Thanks!

> I'm curious, which SQL back-end are you using?

Mail::SpamAssassin::BayesStore::MySQL

> And if it's MySQL, do you have any performance tuning tips?

Initially not. Later I added (/etc/my.cnf) :

[mysqld]
bind=127.0.0.1
key_buffer_size=60M
innodb_buffer_pool_size=384M
innodb_log_buffer_size=6M
innodb_flush_log_at_trx_commit=0
max_connections=60

based on some tips from:
http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-server-after-installation/

This is with MySQL 5.1.24 InnoDB (current size 5.5 GB), on FreeBSD,
SA 3.3; currently bayes_token has 1M records, bayes_seen has 26M records
(I know, I need to ditch bayes_seen and start it from scratch).

Initially I used MyISAM and Mail::SpamAssassin::BayesStore::SQL,
which would get me in trouble every now and then, requiring a
REPAIR TABLE. Now with InnoDB and the dedicated BayesStore::MySQL
it never again got me into trouble in two years.

> I've been a little startled to find that PgSQL is
> actually outperforming MySQL on my benchmarks (given their reps)
> but then I know how to tune PgSQL well; I'm more ignorant about MySQL.

I was running Bayes for a while on PostgreSQL 8.2 using
Mail::SpamAssassin::BayesStore::PgSQL, but the SELECT ... IN (...)
with a large set of tokens in the IN-set was quite slow,
much slower than with MySQL.

I'm still using PostgreSQL for everything else except Bayes,
i.e. for AWL, and for amavisd-new SQL logging / pen pals database,
which is quite large and outperforms MySQL, especially on purging.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #6 from Michael Alan Dorman <mdorman@...>  2009-01-22 09:39:41 PST ---
So, running masses/bayes-testing/benchmark (with auto-expire enabled) across
each back-end on the same corpus on an otherwise idle machine produces the
following results:

Benchmarking BDB

real    14m19.809s
user    10m27.823s
sys     0m42.495s

Benchmarking PostgreSQL

real    44m23.183s
user    8m15.167s
sys     0m30.978s

Benchmarking MySQL

real    44m18.552s
user    11m2.257s
sys     1m6.612s

Benchmarking SDBM

real    25m14.481s
user    9m56.833s
sys     1m1.748s

Benchmarking DB_File

real    41m6.284s
user    12m33.203s
sys     0m57.692s

I won't pretend that MySQL is the best-tuned MySQL install ever, though I have
done some.  I may make try tweaking that a bit based on Mark Martinec's
suggestions and re-run.

Anyway, that's based on the patch I've included here.  I would appreciate any
comments or suggestions anyone has about the code.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #7 from Justin Mason <jm@...>  2009-01-23 01:59:22 PST ---
wow.  those are great numbers ;)


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #8 from Quanah Gibson-Mount <quanah@...>  2009-01-23 09:47:35 PST ---
Just as an aside, is there any way to make it so the bayes table can be shared
when using BDB?  I'm guessing not, but figured it doesn't hurt to ask. ;)


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #9 from Michael Alan Dorman <mdorman@...>  2009-01-23 10:32:34 PST ---
> Just as an aside, is there any way to make it so the bayes table can be shared
> when using BDB?  I'm guessing not, but figured it doesn't hurt to ask. ;)

Technically, I believe it is possible.  My DB4.6 reference material discusses
the an RPC-based client-server mechanism.

That is, however, so far beyond my expertise, I wouldn't really be sure where
to begin.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #10 from Michael Alan Dorman <mdorman@...>  2009-01-23 10:42:09 PST ---
(In reply to comment #7)
> wow.  those are great numbers ;)

Yeah, they're so great that I have to admit I wonder if I've got a bug
somewhere.

Specifically, bdb and db_file are very close right up until the long
single-threaded slogs through large mailboxes at the end of the benchmark, at
which point bdb pulls out a 5x-6x speed differential.  That seems to me wildly
improbable.

I think the code is pretty straightforward; I'd love to have other eyes looking
at it.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #11 from Michael Alan Dorman <mdorman@...>  2009-01-23 11:18:45 PST ---
> Specifically, bdb and db_file are very close right up until the long
> single-threaded slogs through large mailboxes at the end of the benchmark, at
> which point bdb pulls out a 5x-6x speed differential.  That seems to me wildly
> improbable.

Oh, wait, I forgot, this was with auto_expire enabled, so it's not surprising
that it takes longer.  with auto_expire disabled, bdb is a little faster...but
only a little.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #12 from Mark Martinec <Mark.Martinec@...>  2009-02-06 04:14:47 PST ---
I intended to try out the module, but was sidetracked by a perl5.8.9
problem, which crashes the interpreter on FreeBSD when it tries to
compile the code autogenerated from rules as soon as some module
with threading like BerkeleyDB is brought in:

perl -MBerkeleyDB body_0.pm
Bus error: 10 (core dumped)

I filed a perl bug report #63054:

  http://rt.perl.org/rt3/Public/Bug/Display.html?id=63054
  Perl_peep recursion exceeds stack in 5.8.9 eval-ing a long program

but I'm not yet sure how to proceed. Looks like I need to
downgrade to 5.8.8 or upgrade to 5.10.0, which is a bit of a
pain when most Perl modules are installed from FreeBSD ports.

A workaround could be if SA produced smaller code sections,
the 83600 lines of code for body rules at prio=0 is a lot
(using some SARE and ZDF rules), perl5.8.9 can survive about
two thirds of that.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #13 from Justin Mason <jm@...>  2009-02-06 04:34:14 PST ---
(In reply to comment #12)
> I intended to try out the module, but was sidetracked by a perl5.8.9
> problem, which crashes the interpreter on FreeBSD when it tries to
> compile the code autogenerated from rules as soon as some module
> with threading like BerkeleyDB is brought in:

:(

> A workaround could be if SA produced smaller code sections,
> the 83600 lines of code for body rules at prio=0 is a lot
> (using some SARE and ZDF rules), perl5.8.9 can survive about
> two thirds of that.

yeah, we can do that I think.  can you open a separate SA bug for this?


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #14 from Mark Martinec <Mark.Martinec@...>  2009-02-06 05:22:28 PST ---
> > A workaround could be if SA produced smaller code sections,
> > the 83600 lines of code for body rules at prio=0 is a lot
>
> yeah, we can do that I think.  can you open a separate SA bug for this?

Ok, opened Bug 6060.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #15 from Mark Martinec <Mark.Martinec@...>  2009-02-06 05:32:10 PST ---
Now back to the original topic.

The module looks alright, but needs testing. I have a few minor
coding/style comments, but that can be addressed later or during
importing it.

Michael, is the attached code the latest that you have, or have
you done any changes later? Do we have a licensing agreement
from Michael? I'd suggest we go on with importing the module -
- as it is an optional backend module what is important is that
we have an intention of adopting it and that it compiles cleanly,
any minor bugs can be fixed along the way.


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Re: [Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Michael Alan Dorman-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> The module looks alright, but needs testing. I have a few minor
> coding/style comments, but that can be addressed later or during
> importing it.

Yeah, I tried to keep to the style, but I have to admit a couple of
things go against deeply-ingrained habits, so I wouldn't be surprised
if I missed some some stuff.

> Michael, is the attached code the latest that you have, or have
> you done any changes later?

I haven't done any more development at this point.  Though as I noted
in another mail, some more development may be warranted---it did not
perform as expected when I set it up in production.  I anticipate this
being relatively minor to correct, since it seems to work well in
testing.

> Do we have a licensing agreement from Michael?

You should---I emailed a scan of a completed and signed form to
secretary@... more than a week ago, and there should also be a
corporate assignment for me from ironicdesign.com that was faxed in
even before that.

If either of those are missing, please let me know---we didn't get any
ack for either.

> I'd suggest we go on with importing the module -
> - as it is an optional backend module what is important is that
> we have an intention of adopting it and that it compiles cleanly,
> any minor bugs can be fixed along the way.

That would be very cool.  I certainly intend to keep working on it, but
I would also appreciate any input others have.

Mike.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #16 from Justin Mason <jm@...>  2009-02-06 06:18:27 PST ---
(In reply to comment #15)
> Michael, is the attached code the latest that you have, or have
> you done any changes later? Do we have a licensing agreement
> from Michael? I'd suggest we go on with importing the module -
> - as it is an optional backend module what is important is that
> we have an intention of adopting it and that it compiles cleanly,
> any minor bugs can be fixed along the way.

yep, the CLA is sorted; http://people.apache.org/~jim/committers.html .
so it's ready to go!

Mark, if you want to import it into SVN, go right ahead... I'm +1


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #17 from Mark Martinec <Mark.Martinec@...>  2009-02-09 06:42:37 PST ---
> Mark, if you want to import it into SVN, go right ahead... I'm +1

Imported a new Bayes backend M::S::BayesStore::BDB, along
with its benchmarks and tests (from Bug 6046 attachment);
author: Michael Alan Dorman
Sending        MANIFEST
Adding         lib/Mail/SpamAssassin/BayesStore/BDB.pm
Sending        masses/bayes-testing/benchmark/README
Adding         masses/bayes-testing/benchmark/helper/bdb
Adding         masses/bayes-testing/benchmark/helper/bdb/cleardb
Adding         masses/bayes-testing/benchmark/helper/bdb/dbsize
Adding         masses/bayes-testing/benchmark/helper/bdb/setup
Adding         masses/bayes-testing/benchmark/tests/bdb
Adding         masses/bayes-testing/benchmark/tests/bdb/site
Adding         masses/bayes-testing/benchmark/tests/bdb/site/init.pre
Adding         masses/bayes-testing/benchmark/tests/bdb/site/local.cf
Adding         masses/bayes-testing/benchmark/tests/bdb/user_prefs
Adding         t/bayesbdb.t
Transmitting file data ..........
Committed revision 742522 ( https://svn.apache.org/viewcvs.cgi?view=rev&rev=742522 ).


Quench the warning:
  BerkeleyDB::db_version" used only once: possible typo at t/bayesbdb.t
Prevent bayesbdb.t from aborting:
  Can't use an undefined value as an ARRAY reference at
  ../blib/lib/Mail/SpamAssassin/BayesStore/BDB.pm line 683
Sending        lib/Mail/SpamAssassin/BayesStore/BDB.pm
Sending        t/bayesbdb.t
Transmitting file data ..
Committed revision 742535 ( https://svn.apache.org/viewcvs.cgi?view=rev&rev=742535 ).


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6046] Add BayesStore backend using direct calls to BerkeleyDB

by Bugzilla from bugzilla-daemon@bugzilla.spamassassin.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6046





--- Comment #18 from Mark Martinec <Mark.Martinec@...>  2009-02-09 11:10:46 PST ---
some rather cosmetic changes to BDB.pm: wrap long lines; remove some
redundant parenthesis; TAB -> SP in POD sections (doesn't come out
nice); uncomment most dbg calls for the time being, change debug
area id 'BDB:' -> 'bayes:' for consistency with other backends;
changed 'assert'-like idiom to: 'assert_condition or die' (previously
a mix of 'negative_condition and die' and 'positive_cond or die'
was used); test status of BerkeleyDB methods for zero as per docs,
instead of testing for boolean; avoid idiom 'my $v=expr or die/dbg'
separating assignment and a test ('my $x=0 or die' yields:
'Found = in conditional', while 'my($x)=0 or die' never dies)

Sending        lib/Mail/SpamAssassin/BayesStore/BDB.pm
Transmitting file data .
Committed revision 742680 ( https://svn.apache.org/viewcvs.cgi?view=rev&rev=742680 ).


--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
< Prev | 1 - 2 | Next >