|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
[PATCH] tsearch parser inefficiency if text includes urls or emailsHi,
While playing around/evaluating tsearch I notices that to_tsvector is obscenely slow for some files. After some profiling I found that this is due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If a multibyte encoding is in use TParserInit copies the whole remaining input and converts it to wchar_t or pg_wchar - for every email or protocol prefixed url in the the document. Which obviously is bad. I solved the issue by having a seperate TParserCopyInit/TParserCopyClose which reuses the the already converted strings of the original TParser - only at different offsets. Another approach would be to get rid of the separate parser invocations - requiring a bunch of additional states. This seemed more complex to me, so I wanted to get some feedback first. Without patch: andres=# SELECT to_tsvector('english', document) FROM document WHERE filename = '/usr/share/doc/libdrm-nouveau1/changelog'; ───────────────────────────────────────────────────────────────────────────────────────────────────── ... (1 row) Time: 5835.676 ms With patch: andres=# SELECT to_tsvector('english', document) FROM document WHERE filename = '/usr/share/doc/libdrm-nouveau1/changelog'; ───────────────────────────────────────────────────────────────────────────────────────────────────── ... (1 row) Time: 395.341 ms Ill cleanup the patch if it seems like a sensible solution... Is this backpatch-worthy? Andres PS: I let the additional define in for the moment so that its easier to see the performance differences. [reuse-strings-in-tparser-recursion.patch] diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 301c1eb..14757d9 100644 *** a/src/backend/tsearch/wparser_def.c --- b/src/backend/tsearch/wparser_def.c *************** *** 24,29 **** --- 24,30 ---- /* Define me to enable tracing of parser behavior */ /* #define WPARSER_TRACE */ + #define WPARSER_REUSE /* Output token categories */ *************** TParserInit(char *str, int len) *** 328,333 **** --- 329,358 ---- return prs; } + static TParser * + TParserCopyInit(const TParser const* orig, size_t offset, int len) + { + TParser *prs = (TParser *) palloc0(sizeof(TParser)); + + prs->charmaxlen = orig->charmaxlen; + prs->usewide = orig->usewide; + prs->lenstr = len; + + prs->str = orig->str + offset; + prs->pgwstr = orig->pgwstr + offset; + prs->wstr = orig->wstr + offset; + + + prs->state = newTParserPosition(NULL); + prs->state->state = TPS_Base; + + #ifdef WPARSER_TRACE + fprintf(stderr, "parsing \"%.*s\"\n", len, orig->str + offset); + #endif + + return prs; + } + static void TParserClose(TParser *prs) { *************** TParserClose(TParser *prs) *** 345,351 **** --- 370,388 ---- if (prs->pgwstr) pfree(prs->pgwstr); #endif + pfree(prs); + } + static void + TParserCopyClose(TParser *prs) + { + while (prs->state) + { + TParserPosition *ptr = prs->state->prev; + + pfree(prs->state); + prs->state = ptr; + } pfree(prs); } *************** p_isignore(TParser *prs) *** 617,623 **** static int p_ishost(TParser *prs) { ! TParser *tmpprs = TParserInit(prs->str + prs->state->posbyte, prs->lenstr - prs->state->posbyte); int res = 0; tmpprs->wanthost = true; --- 654,664 ---- static int p_ishost(TParser *prs) { ! #ifdef WPARSER_REUSE ! TParser *tmpprs = TParserCopyInit(prs, prs->state->posbyte, prs->lenstr - prs->state->posbyte); ! #else ! TParser *tmpprs = TParserInit(prs->str + prs->state->posbyte, prs->lenstr - prs->state->posbyte); ! #endif int res = 0; tmpprs->wanthost = true; *************** p_ishost(TParser *prs) *** 631,637 **** --- 672,682 ---- prs->state->charlen = tmpprs->state->charlen; res = 1; } + #ifdef WPARSER_REUSE + TParserCopyClose(tmpprs); + #else TParserClose(tmpprs); + #endif return res; } *************** p_ishost(TParser *prs) *** 639,645 **** static int p_isURLPath(TParser *prs) { ! TParser *tmpprs = TParserInit(prs->str + prs->state->posbyte, prs->lenstr - prs->state->posbyte); int res = 0; tmpprs->state = newTParserPosition(tmpprs->state); --- 684,694 ---- static int p_isURLPath(TParser *prs) { ! #ifdef WPARSER_REUSE ! TParser *tmpprs = TParserCopyInit(prs, prs->state->posbyte, prs->lenstr - prs->state->posbyte); ! #else ! TParser *tmpprs = TParserInit(prs->str + prs->state->posbyte, prs->lenstr - prs->state->posbyte); ! #endif int res = 0; tmpprs->state = newTParserPosition(tmpprs->state); *************** p_isURLPath(TParser *prs) *** 654,660 **** --- 703,713 ---- prs->state->charlen = tmpprs->state->charlen; res = 1; } + #ifdef WPARSER_REUSE + TParserCopyClose(tmpprs); + #else TParserClose(tmpprs); + #endif return res; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
|
Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new versionOn Sunday 01 November 2009 16:19:43 Andres Freund wrote:
> While playing around/evaluating tsearch I notices that to_tsvector is > obscenely slow for some files. After some profiling I found that this is > due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If > a multibyte encoding is in use TParserInit copies the whole remaining > input and converts it to wchar_t or pg_wchar - for every email or protocol > prefixed url in the the document. Which obviously is bad. > > I solved the issue by having a seperate TParserCopyInit/TParserCopyClose > which reuses the the already converted strings of the original TParser - > only at different offsets. > > Another approach would be to get rid of the separate parser invocations - > requiring a bunch of additional states. This seemed more complex to me, so > I wanted to get some feedback first. > > Without patch: > andres=# SELECT to_tsvector('english', document) FROM document WHERE > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > ────────────────────────────────────────────────────────────────────────── > ─────────────────────────── ... > (1 row) > > Time: 5835.676 ms > > With patch: > andres=# SELECT to_tsvector('english', document) FROM document WHERE > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > ────────────────────────────────────────────────────────────────────────── > ─────────────────────────── ... > (1 row) > > Time: 395.341 ms > > Ill cleanup the patch if it seems like a sensible solution... issue? Andres [0001-Fix-TSearch-inefficiency-because-of-repeated-copying.patch] From cbdeb0bb636f3b7619d0a3019854809ea5565dac Mon Sep 17 00:00:00 2001 From: Andres Freund <andres@...> Date: Sun, 8 Nov 2009 16:30:42 +0100 Subject: [PATCH] Fix TSearch inefficiency because of repeated copying of strings --- src/backend/tsearch/wparser_def.c | 63 ++++++++++++++++++++++++++++++++++-- 1 files changed, 59 insertions(+), 4 deletions(-) diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 301c1eb..7bbd826 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -328,6 +328,41 @@ TParserInit(char *str, int len) return prs; } +/* + * As an alternative to a full TParserInit one can create a + * TParserCopy which basically is a normally TParser without a private + * copy of the string - instead it uses the one from another TParser. + * This is usefull because at some places TParsers are created + * recursively and the repeated copying around of the strings can + * cause major inefficiency. + * Obviously one may not close the original TParser before the copy. + */ +static TParser * +TParserCopyInit(const TParser const* orig) +{ + TParser *prs = (TParser *) palloc0(sizeof(TParser)); + + prs->charmaxlen = orig->charmaxlen; + prs->usewide = orig->usewide; + prs->lenstr = orig->lenstr - orig->state->posbyte; + + prs->str = orig->str + orig->state->posbyte; + if(orig->pgwstr) + prs->pgwstr = orig->pgwstr + orig->state->poschar; + if(orig->wstr) + prs->wstr = orig->wstr + orig->state->poschar; + + prs->state = newTParserPosition(NULL); + prs->state->state = TPS_Base; + +#ifdef WPARSER_TRACE + fprintf(stderr, "parsing copy \"%.*s\"\n", len, str); +#endif + + return prs; +} + + static void TParserClose(TParser *prs) { @@ -350,6 +385,26 @@ TParserClose(TParser *prs) } /* + * See TParserCopyInit + */ +static void +TParserCopyClose(TParser *prs) +{ + while (prs->state) + { + TParserPosition *ptr = prs->state->prev; + + pfree(prs->state); + prs->state = ptr; + } +#ifdef WPARSER_TRACE + fprintf(stderr, "closing parser copy"); +#endif + pfree(prs); +} + + +/* * Character-type support functions, equivalent to is* macros, but * working with any possible encodings and locales. Notes: * - with multibyte encoding and C-locale isw* function may fail @@ -617,7 +672,7 @@ p_isignore(TParser *prs) static int p_ishost(TParser *prs) { - TParser *tmpprs = TParserInit(prs->str + prs->state->posbyte, prs->lenstr - prs->state->posbyte); + TParser *tmpprs = TParserCopyInit(prs); int res = 0; tmpprs->wanthost = true; @@ -631,7 +686,7 @@ p_ishost(TParser *prs) prs->state->charlen = tmpprs->state->charlen; res = 1; } - TParserClose(tmpprs); + TParserCopyClose(tmpprs); return res; } @@ -639,7 +694,7 @@ p_ishost(TParser *prs) static int p_isURLPath(TParser *prs) { - TParser *tmpprs = TParserInit(prs->str + prs->state->posbyte, prs->lenstr - prs->state->posbyte); + TParser *tmpprs = TParserCopyInit(prs); int res = 0; tmpprs->state = newTParserPosition(tmpprs->state); @@ -654,7 +709,7 @@ p_isURLPath(TParser *prs) prs->state->charlen = tmpprs->state->charlen; res = 1; } - TParserClose(tmpprs); + TParserCopyClose(tmpprs); return res; } -- 1.6.5.12.gd65df24 -- Sent via pgsql-hackers mailing list (pgsql-hackers@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
|
Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new versionOn Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote:
> On Sunday 01 November 2009 16:19:43 Andres Freund wrote: > > While playing around/evaluating tsearch I notices that to_tsvector is > > obscenely slow for some files. After some profiling I found that this is > > due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If > > a multibyte encoding is in use TParserInit copies the whole remaining > > input and converts it to wchar_t or pg_wchar - for every email or protocol > > prefixed url in the the document. Which obviously is bad. > > > > I solved the issue by having a seperate TParserCopyInit/TParserCopyClose > > which reuses the the already converted strings of the original TParser - > > only at different offsets. > > > > Another approach would be to get rid of the separate parser invocations - > > requiring a bunch of additional states. This seemed more complex to me, so > > I wanted to get some feedback first. > > > > Without patch: > > andres=# SELECT to_tsvector('english', document) FROM document WHERE > > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > > > ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? > > ????????????????????????????????????????????????????????????????????????????????? ... > > (1 row) > > > > Time: 5835.676 ms > > > > With patch: > > andres=# SELECT to_tsvector('english', document) FROM document WHERE > > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > > > ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? > > ????????????????????????????????????????????????????????????????????????????????? ... > > (1 row) > > > > Time: 395.341 ms > > > > Ill cleanup the patch if it seems like a sensible solution... > As nobody commented here is a corrected (stupid thinko) and cleaned up > version. Anyone cares to comment whether I am the only one thinking this is an > issue? > > Andres +1 As a user of tsearch, I can certainly appreciate to speed-up in parsing -- more CPU for everyone else. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
|
Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new versionOn Sunday 08 November 2009 17:41:15 Kenneth Marshall wrote:
> On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote: > > As nobody commented here is a corrected (stupid thinko) and cleaned up > > version. Anyone cares to comment whether I am the only one thinking this > > is an issue? > > Andres > +1 > As a user of tsearch, I can certainly appreciate to speed-up in parsing -- > more CPU for everyone else. Please note that this is mostly an issue when using rather long documents including either email addresses (ie. an @) or links with protocol prefixes (like ?+://) - so it might not give you personally a benefit :-( Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
|
Re: tsearch parser inefficiency if text includes urls or emails - new versionAndres Freund <andres@...> wrote:
> On Sunday 01 November 2009 16:19:43 Andres Freund wrote: >> While playing around/evaluating tsearch I notices that to_tsvector >> is obscenely slow for some files. After some profiling I found that >> this is due using a seperate TSParser in p_ishost/p_isURLPath in >> wparser_def.c. If a multibyte encoding is in use TParserInit copies >> the whole remaining input and converts it to wchar_t or pg_wchar - >> for every email or protocol prefixed url in the the document. Which >> obviously is bad. >> >> I solved the issue by having a seperate TParserCopyInit/ >> TParserCopyClose which reuses the the already converted strings of >> the original TParser - only at different offsets. >> >> Another approach would be to get rid of the separate parser >> invocations - requiring a bunch of additional states. This seemed >> more complex to me, so I wanted to get some feedback first. >> Without patch: >> Time: 5835.676 ms >> With patch: >> Time: 395.341 ms > As nobody commented here is a corrected (stupid thinko) and cleaned > up version. I've taken this one for review, and have taken a preliminary look. It is in context format, applies cleanly, and passes "make check". Next I read through the code, and have the same question that Andres posed 12 days ago. His patch massively reduces the cost of the parser recursively calling itself for some cases, and it seems like the least invasive way to modify the parser to solve this performance problem; but it does beg the question of why a state machine like this should recursively call itself when it hits certain states. The patch is mitigating the penalty for what seems to me to be an existing ugly kludge. Is it acceptable to address the problem in this way, or should the much more invasive work be done to eliminate the kludge? (Note: I personally would much rather see the performance penalty addressed this way, and a TODO added for the more invasive work, than to leave this alone for the next release if there's nobody willing to tackle the problem at a more fundamental level.) If nobody has a contrary opinion, I'll proceed with the review of this patch and add something to the TODO page for the more invasive work. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
|
Re: tsearch parser inefficiency if text includes urls or emails - new versionKevin Grittner wrote:
> (Note: I personally would much rather see the performance > penalty addressed this way, and a TODO added for the more invasive > work, than to leave this alone for the next release if there's nobody > willing to tackle the problem at a more fundamental level.) +1 -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
|
Re: tsearch parser inefficiency if text includes urls or emails - new versionOn Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:
> It is in context format, applies cleanly, and passes "make check". Unfortunately the latter is not saying much - I had a bug there and it was not found by the regression tests. Perhaps I should take a stab and add at least some more... > It is in context format, applies cleanly, and passes "make check". > Next I read through the code, and have the same question that Andres > posed 12 days ago. His patch massively reduces the cost of the parser > recursively calling itself for some cases, and it seems like the least > invasive way to modify the parser to solve this performance problem; > but it does beg the question of why a state machine like this should > recursively call itself when it hits certain states. I was wondering about that as well. I am not completely sure but to me it looks like its just done to reduce the amount of rules and states. I have to say that that code is not exactly clear and well documented... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
|
|
|
|
Re: tsearch parser inefficiency if text includes urls or emails - new versionOn Saturday 14 November 2009 15:33:00 Kevin Grittner wrote:
> Andres Freund wrote: > > On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote: > >> It is in context format, applies cleanly, and passes "make check". > > > > Unfortunately the latter is not saying much - I had a bug there and > > it was not found by the regression tests. Perhaps I should take a > > stab and add at least some more... > > Sounds like a good idea. The one thing to avoid is anything with a > long enough run time to annoy those that run it many times a day. Strange. It was a bug involving uninitialized data so probably the regression tests where just "lucky". > >> It is in context format, applies cleanly, and passes "make check". > >> Next I read through the code, and have the same question that > >> Andres posed 12 days ago. His patch massively reduces the cost of > >> the parser recursively calling itself for some cases, and it seems > >> like the least invasive way to modify the parser to solve this > >> performance problem; but it does beg the question of why a state > >> machine like this should recursively call itself when it hits > >> certain states. > > I was wondering about that as well. I am not completely sure but to > > me it looks like its just done to reduce the amount of rules and > > states. > I'm assuming that's the reason, but didn't dig deep enough to be sure. > I suspect to be really sure, I'd have to set it up without the > recursion and see what breaks. I can't imagine it would be anything > which couldn't be fixed by adding enough states; but perhaps they ran > into something where these types would require so many new states that > the recursion seemed like the lesser of evils. > > I have to say that that code is not exactly clear and well > > documented... > Yeah. I was happy with the level of documentation that you added with > your new code, but what was there before is mighty thin. If you > gleaned enough information while working on it to feel comfortable > adding documentation anywhere else, that would be a good thing. It definitely would be a good thing. But that would definitely be seperate patch. But I fear my current leel of knowledge is sufficient and also I am not sure if I can make myself interested enough in that part. > So far the only vote is to proceed with the mitigation, which was my > preference, and apparently yours -- so I guess we're at 3 to 0 in > favor of that. I'll mark the patch as "Waiting on Author" so you can > add any comments and regression tests you feel are appropriate. > > By the way, I found one typo in the comments -- it should by useful, > not usefull. Ok, will update. > I liked what I saw so far, but need to spend more time desk-checking > for correctness, testing to confirm that it doesn't change results, > and confirming the performance improvement. Thanks again for your reviewing! Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
|
|
Re: tsearch parser inefficiency if text includes urls or emails - new versionAndres Freund <andres@...> wrote:
> On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote: >> Andres Freund wrote: >> > I had a bug there and it was not found by the regression tests. >> > Perhaps I should take a stab and add at least some more... >> >> Sounds like a good idea. > Hm. There actually are tests excercising the part where I had a > bug... Strange. It was a bug involving uninitialized data so > probably the regression tests where just "lucky". OK. I won't be looking for extra tests. >> > I have to say that that code is not exactly clear and well >> > documented... >> Yeah. I was happy with the level of documentation that you added >> with your new code, but what was there before is mighty thin. If >> you gleaned enough information while working on it to feel >> comfortable adding documentation anywhere else, that would be a >> good thing. > It definitely would be a good thing. But that would definitely be > seperate patch. But I fear my current leel of knowledge is > sufficient and also I am not sure if I can make myself interested > enough in that part. Fair enough. I won't be looking for new comments for the old code. >> By the way, I found one typo in the comments -- it should by >> useful, not usefull. > Ok, will update. Given how trivial that is, I'm putting this back in "Needs Review" status, and resuming my review work. Barring surprises, I should wrap this up whenever I can free up a two or three hours. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@...) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |
| Free embeddable forum powered by Nabble | Forum Help |