|
View:
New views
13 Messages
—
Rating Filter:
Alert me
|
|
|
Caching merged diffsHi all,
In Differ, merged diffs are repeatedly calculated whenever we iterate over changes. The merged versions of diffs can theoretically differ, as the underlying text is passed in on each call. However, this text always ends up being the raw file text, and we always update our diffs whenever this text changes. As such, we can cache the merged version of our diffs by disallowing the effectively unused texts argument. I'm attaching two patches, the first of which implements this caching of merged diffs, and the second of which provides a short-circuit for two-way comparisons. On two-way diffs, these patches provide a very slight speed improvement; on three-way diffs, the speed-up is easily measurable -- around 5% in a few quick tests based on scrolling through a three-way source file comparison. Comments appreciated. cheers, Kai _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffs> In Differ, merged diffs are repeatedly calculated whenever we iterate
> over changes. The merged versions of diffs can theoretically differ, > as the underlying text is passed in on each call. However, this text > always ends up being the raw file text, and we always update our diffs > whenever this text changes. As such, we can cache the merged version > of our diffs by disallowing the effectively unused texts argument. > > I'm attaching two patches, the first of which implements this caching > of merged diffs, and the second of which provides a short-circuit for > two-way comparisons. On two-way diffs, these patches provide a very > slight speed improvement; on three-way diffs, the speed-up is easily > measurable -- around 5% in a few quick tests based on scrolling > through a three-way source file comparison. Thats a good idea. I've semi-deliberately not changed any of the diffing code in ages because I've always intended to rewrite it with an asynchronous interface. I think this is the only way to get acceptable performance on large files/large change blocks. It's reasonable then to put the diff computation & most of the i/o in a subprocess, which would make the ui much more responsive. Stephen. _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffsHi,
Maybe I overlooked something, but it seems that passing 2 arguments to seq_sequences_iter() is not needed. Patch attached. Cheers, Piotr Kai Willadsen wrote: > Hi all, > > In Differ, merged diffs are repeatedly calculated whenever we iterate > over changes. The merged versions of diffs can theoretically differ, > as the underlying text is passed in on each call. However, this text > always ends up being the raw file text, and we always update our diffs > whenever this text changes. As such, we can cache the merged version > of our diffs by disallowing the effectively unused texts argument. > > I'm attaching two patches, the first of which implements this caching > of merged diffs, and the second of which provides a short-circuit for > two-way comparisons. On two-way diffs, these patches provide a very > slight speed improvement; on three-way diffs, the speed-up is easily > measurable -- around 5% in a few quick tests based on scrolling > through a three-way source file comparison. > > Comments appreciated. > > cheers, > Kai > > ------------------------------------------------------------------------ > > _______________________________________________ > meld-list mailing list > meld-list@... > http://mail.gnome.org/mailman/listinfo/meld-list From 7946d01f45df9af20fb15d559fefee9b88d7d2de Mon Sep 17 00:00:00 2001 From: Piotr Piastucki <the_leech@...> Date: Tue, 14 Jul 2009 18:46:47 +0200 Subject: [PATCH] Remove texts from set_sequences_iter arguments --- diffutil.py | 4 ++-- filediff.py | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/diffutil.py b/diffutil.py index 3da1c1a..08b8470 100644 --- a/diffutil.py +++ b/diffutil.py @@ -248,7 +248,7 @@ class Differ(object): for c in self._auto_merge(using, texts): yield c - def set_sequences_iter(self, sequences, texts): + def set_sequences_iter(self, sequences): assert 0 <= len(sequences) <= 3 self.diffs = [[], []] self.num_sequences = len(sequences) @@ -260,6 +260,6 @@ class Differ(object): while work.next() == None: yield None self.diffs[i] = matcher.get_difference_opcodes() - self._update_merge_cache(texts) + self._update_merge_cache(sequences) yield 1 diff --git a/filediff.py b/filediff.py index 5e9441b..acc1546 100644 --- a/filediff.py +++ b/filediff.py @@ -591,7 +591,7 @@ class FileDiff(melddoc.MeldDoc, gnomeglade.Component): yield _("[%s] Computing differences") % self.label_text panetext = [self._filter_text(p) for p in panetext] lines = map(lambda x: x.split("\n"), panetext) - step = self.linediffer.set_sequences_iter(lines, self._get_texts()) + step = self.linediffer.set_sequences_iter(lines) while step.next() == None: yield 1 -- 1.6.0.4 _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffs2009/7/15 Piotr Piastucki <leech.miranda@...>:
> Hi, > > Maybe I overlooked something, but it seems that passing 2 arguments to > seq_sequences_iter() is not needed. > Patch attached. I could swear that I had a reason for passing in text, but I can't imagine what it was. Applied, thanks. Kai _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffs2009/7/11 Stephen Kennedy <stevek@...>:
>> In Differ, merged diffs are repeatedly calculated whenever we iterate >> over changes. The merged versions of diffs can theoretically differ, >> as the underlying text is passed in on each call. However, this text >> always ends up being the raw file text, and we always update our diffs >> whenever this text changes. As such, we can cache the merged version >> of our diffs by disallowing the effectively unused texts argument. >> >> I'm attaching two patches, the first of which implements this caching >> of merged diffs, and the second of which provides a short-circuit for >> two-way comparisons. On two-way diffs, these patches provide a very >> slight speed improvement; on three-way diffs, the speed-up is easily >> measurable -- around 5% in a few quick tests based on scrolling >> through a three-way source file comparison. > > Thats a good idea. I've semi-deliberately not changed any of the diffing > code in ages because I've always intended to rewrite it with an asynchronous > interface. I think this is the only way to get acceptable performance on > large files/large change blocks. > > It's reasonable then to put the diff computation & most of the i/o in a > subprocess, which would make the ui much more responsive. I couldn't agree more. This caching patch has the benefit that it drops us to set_sequences_iter() and change_sequence() as the only interface points that modify internal state. Turning these into asynchronous methods shouldn't be *too* hard... Kai _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffs2009/7/11 Stephen Kennedy <stevek@...>:
>> It's reasonable then to put the diff computation & most of the i/o in a >> subprocess, which would make the ui much more responsive. >> Regarding ui responsiveness, I am sending a patch that tries to improve performance of inline highlighting. Test case: 1) run "meld filediff.py meldapp.py" 2) type something in either of the panes 3) wait.... :) I know it is not a real-life example, but it shows exactly what the issue is. Patch: I added a simple cache that stores replace chunks. The cache is updated and checked in _update_highlighting(). Chunks that did not change are not taken into account and as a result inline highlighting becomes somewhat incremental. Moreover I could not find any place where "delete line", "conflict line" or "replace line" tags are set so I changed this piece of code and remove_tag() is called for "inline line" only. Comments welcome. Cheers, Piotr From 7e85c89f9831f5cc5280d75dc0ffae4ae66db234 Mon Sep 17 00:00:00 2001 From: Piotr Piastucki <the_leech@...> Date: Fri, 17 Jul 2009 17:29:29 +0200 Subject: [PATCH] Inline highlighting performance improvements --- filediff.py | 92 +++++++++++++++++++++++++++++++++++++---------------------- 1 files changed, 58 insertions(+), 34 deletions(-) diff --git a/filediff.py b/filediff.py index acc1546..145d061 100644 --- a/filediff.py +++ b/filediff.py @@ -91,6 +91,7 @@ class FileDiff(melddoc.MeldDoc, gnomeglade.Component): self._sync_vscroll_lock = False self._sync_hscroll_lock = False self.linediffer = diffutil.Differ() + self._inline_cache = [] for text in self.textview: text.set_wrap_mode( self.prefs.edit_wrap_lines ) for buf in self.textbuffer: @@ -622,47 +623,70 @@ class FileDiff(melddoc.MeldDoc, gnomeglade.Component): mgr.clear() def _update_highlighting(self): - for b in self.textbuffer: - taglist = ["delete line", "conflict line", "replace line", "inline line"] - table = b.get_tag_table() - for tagname in taglist: - tag = table.lookup(tagname) + alltexts = [b for b in self._get_texts(raw=1)] + alltags = [b.get_tag_table().lookup("inline line") for b in self.textbuffer] + clearchunks = len(self._inline_cache) != 0 + if not clearchunks: + for b, tag in zip(self.textbuffer, alltags): b.remove_tag(tag, b.get_start_iter(), b.get_end_iter() ) + newcache = [] + startlines = [0] * 3 for chunk in self.linediffer.all_changes(): for i,c in enumerate(chunk): if c and c[0] == "replace": bufs = self.textbuffer[1], self.textbuffer[i*2] - #tags = [b.get_tag_table().lookup("replace line") for b in bufs] - starts = [b.get_iter_at_line(l) for b,l in zip(bufs, (c[1],c[3])) ] - text1 = "\n".join( self._get_texts(raw=1)[1 ][c[1]:c[2]] ).encode("utf16") - text1 = struct.unpack("%iH"%(len(text1)/2), text1)[1:] - textn = "\n".join( self._get_texts(raw=1)[i*2][c[3]:c[4]] ).encode("utf16") - textn = struct.unpack("%iH"%(len(textn)/2), textn)[1:] - - tags = [b.get_tag_table().lookup("inline line") for b in bufs] - # For very long sequences, bail rather than trying a very slow comparison - inline_limit = 8000 # arbitrary constant - if len(text1) > inline_limit and len(textn) > inline_limit: - ends = [b.get_iter_at_line(l) for b, l in zip(bufs, (c[2], c[4]))] - for i in range(2): - bufs[i].apply_tag(tags[i], starts[i], ends[i]) - continue - - matcher = difflib.SequenceMatcher(None, text1, textn) - #print "<<<\n%s\n---\n%s\n>>>" % (text1, textn) - back = (0,0) - for o in matcher.get_opcodes(): - if o[0] == "equal": - if (o[2]-o[1] < 3) or (o[4]-o[3] < 3): - back = o[4]-o[3], o[2]-o[1] + text1 = alltexts[1][c[1]:c[2]] + textn = alltexts[i*2][c[3]:c[4]] + cacheitem = (i, c, text1, textn) + newcache.append(cacheitem) + notincache = not cacheitem in self._inline_cache + if clearchunks: + # clean all previous non-replace chunks + this chunk if it was not found in the cache + bufs[1].remove_tag(alltags[i*2], get_iter_at_line_or_eof(bufs[1], startlines[i*2]), get_iter_at_line_or_eof(bufs[1], c[3 + int(notincache)])) + bufs[0].remove_tag(alltags[1], get_iter_at_line_or_eof(bufs[0], startlines[1]), get_iter_at_line_or_eof(bufs[0], c[1 + int(notincache)])) + startlines[i*2] = c[4] + startlines[1] = c[2] + + if notincache: + tags = alltags[1], alltags[i*2] + #tags = [b.get_tag_table().lookup("replace line") for b in bufs] + starts = [get_iter_at_line_or_eof(b, l) for b,l in zip(bufs, (c[1],c[3])) ] + text1 = "\n".join(alltexts[1][c[1]:c[2]] ).encode("utf16") + text1 = struct.unpack("%iH"%(len(text1)/2), text1)[1:] + textn = "\n".join(alltexts[i*2][c[3]:c[4]] ).encode("utf16") + textn = struct.unpack("%iH"%(len(textn)/2), textn)[1:] + + # For very long sequences, bail rather than trying a very slow comparison + inline_limit = 8000 # arbitrary constant + if len(text1) > inline_limit and len(textn) > inline_limit: + ends = [get_iter_at_line_or_eof(b, l) for b, l in zip(bufs, (c[2], c[4]))] + for i in range(2): + bufs[i].apply_tag(tags[i], starts[i], ends[i]) continue - for i in range(2): - s,e = starts[i].copy(), starts[i].copy() - s.forward_chars( o[1+2*i] - back[i] ) - e.forward_chars( o[2+2*i] ) - bufs[i].apply_tag(tags[i], s, e) + + matcher = difflib.SequenceMatcher(None, text1, textn) + #print "<<<\n%s\n---\n%s\n>>>" % (text1, textn) back = (0,0) - yield 1 + for o in matcher.get_opcodes(): + if o[0] == "equal": + if (o[2]-o[1] < 3) or (o[4]-o[3] < 3): + back = o[4]-o[3], o[2]-o[1] + continue + for i in range(2): + s,e = starts[i].copy(), starts[i].copy() + s.forward_chars( o[1+2*i] - back[i] ) + e.forward_chars( o[2+2*i] ) + bufs[i].apply_tag(tags[i], s, e) + back = (0,0) + yield 1 + + # clean up all tailing lines + if clearchunks: + for b, tag, start in zip(self.textbuffer, alltags, startlines): + b.remove_tag(tag, get_iter_at_line_or_eof(b, start), b.get_end_iter() ) + self._inline_cache = newcache + yield 1 + def on_textview_expose_event(self, textview, event): if self.num_panes == 1: -- 1.6.0.4 _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffs2009/7/18 Piotr Piastucki <leech.miranda@...>:
> 2009/7/11 Stephen Kennedy <stevek@...>: >>> >>> It's reasonable then to put the diff computation & most of the i/o in a >>> subprocess, which would make the ui much more responsive. >>> > > Regarding ui responsiveness, I am sending a patch that tries to improve > performance of inline highlighting. > Test case: > 1) run "meld filediff.py meldapp.py" > 2) type something in either of the panes > 3) wait.... :) > I know it is not a real-life example, but it shows exactly what the issue > is. > Patch: > I added a simple cache that stores replace chunks. The cache is updated and > checked in _update_highlighting(). Chunks that did not change are not taken > into account and as a result inline highlighting becomes somewhat > incremental. for a simpler approach though (patch attached), and just cached the SequenceMatcher calls; I'm pretty sure that that's the only really expensive part of the updating. My version is missing cache-clearing, but that's fairly easy to add. > Moreover I could not find any place where "delete line", "conflict line" or > "replace line" tags are set so I changed this piece of code and remove_tag() > is called for "inline line" only. Right - that's a separate issue, and should be fixed regardless. I believe that of the tags we currently set up, only edited and inline are used. One could imagine that tagging the whole buffer appropriately might be useful in some way... but I'm not really sure how. Kai [0001-Cache-inline-diff-results-to-avoid-rediffing-unchang.patch] From 2ffcbe2c770582a315e50a6541859b5846b18fce Mon Sep 17 00:00:00 2001 From: Kai Willadsen <kai.willadsen@...> Date: Fri, 17 Jul 2009 12:58:39 +1000 Subject: [PATCH] Cache inline-diff results to avoid rediffing unchanged blocks --- filediff.py | 17 +++++++++++++++-- 1 files changed, 15 insertions(+), 2 deletions(-) diff --git a/filediff.py b/filediff.py index acc1546..e4ccdff 100644 --- a/filediff.py +++ b/filediff.py @@ -46,6 +46,19 @@ gdk = gtk.gdk MASK_SHIFT, MASK_CTRL = 1, 2 +class CachedSequenceMatcher(object): + def __init__(self): + self.cache = {} + + def __call__(self, text1, textn): + try: + return self.cache[(text1, textn)] + except KeyError: + matcher = difflib.SequenceMatcher(None, text1, textn) + opcodes = matcher.get_opcodes() + self.cache[(text1, textn)] = opcodes + return opcodes + def get_iter_at_line_or_eof(buffer, line): if line >= buffer.get_line_count(): return buffer.get_end_iter() @@ -91,6 +104,7 @@ class FileDiff(melddoc.MeldDoc, gnomeglade.Component): self._sync_vscroll_lock = False self._sync_hscroll_lock = False self.linediffer = diffutil.Differ() + self.cached_match = CachedSequenceMatcher() for text in self.textview: text.set_wrap_mode( self.prefs.edit_wrap_lines ) for buf in self.textbuffer: @@ -648,10 +662,9 @@ class FileDiff(melddoc.MeldDoc, gnomeglade.Component): bufs[i].apply_tag(tags[i], starts[i], ends[i]) continue - matcher = difflib.SequenceMatcher(None, text1, textn) #print "<<<\n%s\n---\n%s\n>>>" % (text1, textn) back = (0,0) - for o in matcher.get_opcodes(): + for o in self.cached_match(text1, textn): if o[0] == "equal": if (o[2]-o[1] < 3) or (o[4]-o[3] < 3): back = o[4]-o[3], o[2]-o[1] -- 1.6.0.6 _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffsKai Willadsen wrote:
> Amusingly enough, I looked at exactly the same issue yesterday. I went > for a simpler approach though (patch attached), and just cached the > SequenceMatcher calls; I'm pretty sure that that's the only really > expensive part of the updating. My version is missing cache-clearing, > but that's fairly easy to add. > Hi, I did a quick test (run meld filediff.py meldapp.py, add a single character) and measured the time taken by _update_highlighting(): meld 1.3 - 1160ms my patch - 12 ms Kai's patch - 52ms No matter which patch is applied the performance gain is *huge*, however, Kai's code looks much better :) Cheers, Piotr _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffs2009/7/18 Piotr Piastucki <leech.miranda@...>:
> I did a quick test (run meld filediff.py meldapp.py, add a single character) > and measured the time taken by _update_highlighting(): > meld 1.3 - 1160ms > my patch - 12 ms > Kai's patch - 52ms I'd be interested to know how you did these tests. I've had real problems getting reliable results when testing performance improvements. > No matter which patch is applied the performance gain is *huge*, however, > Kai's code looks much better :) Cheers. I've attached a second-pass patch with very simple cache eviction, and the results seem to be pretty promising. I'm sure it could be tweaked, but there's no doubt that it's a huge improvement. Kai _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffsKai Willadsen wrote: > I'd be interested to know how you did these tests. I've had real > problems getting reliable results when testing performance > improvements. > > I just measured the time taken by _update_highlighting(). This time it was quite easy as the results are 100% reproducible, which is not always the case. E.g. sometimes I need to add some "warm-up" calls as the first executions of methods tend to be much slower. One thing I have just noticed about my patch - it is a bit more flexible when the highlighting code changes (remember http://bugzilla.gnome.org/show_bug.cgi?id=587410 ?) as caching does not rely on the sequence matching algorithm. Regards, Piotr _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffs2009/7/20 Piotr Piastucki <leech.miranda@...>:
> One thing I have just noticed about my patch - it is a bit more flexible > when the highlighting code changes (remember > http://bugzilla.gnome.org/show_bug.cgi?id=587410 ?) as caching does not rely > on the sequence matching algorithm. It's true that my method would need to be updated slightly for changes to the highlighting code, but it wouldn't be difficult to do so. In fact, a second cache for storing differences between word-level tokens might be a huge benefit in some fairly common cases. (I'm thinking of comparisons that consist of lots of common prefix changes, like log files in which only dates or file paths differ.) One downside to your method is that chunk position is part of the cache. The most obvious issue with this is that any change that inserts or deletes lines causes lots of rediffing. For example, instead of typing a single character in your "meld filediff.py meldapp.py" example, just hit enter. I'm going to look at whether we can combine parts of your approach to avoid rehighlighting at all with my simpler caching mechanism. cheers, Kai _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffsOn Mon, Jul 20, 2009 at 2:44 AM, Kai Willadsen <kai.willadsen@...> wrote:
Yes, that's why I am fully in favour of your approach :) Just wanted to mention one rather minor (as the highlighting code should not change often) disadvantage. Regards, Piotr _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
|
|
Re: Caching merged diffs2009/7/20 Piotr Piastucki <leech.miranda@...>:
> On Mon, Jul 20, 2009 at 2:44 AM, Kai Willadsen <kai.willadsen@...> > wrote: >> >> One downside to your method is that chunk position is part of the >> cache. The most obvious issue with this is that any change that >> inserts or deletes lines causes lots of rediffing. For example, >> instead of typing a single character in your "meld filediff.py >> meldapp.py" example, just hit enter. > > Yes, that's why I am fully in favour of your approach :) > Just wanted to mention one rather minor (as the highlighting code should not > change often) disadvantage. I tested with a few different cases, and there are certainly benefits to doing as little work as possible. For example, in the case of tokenising the input text, just the splitting process can take a while on large change blocks... and my approach doesn't skip that. I've reworked your approach, and integrated my caching with it. I had to change the cache eviction, and while it has some flaws, I think it's pretty good without making the SequenceMatcher calls async or getting overcomplicated. Patches attached. Kai _______________________________________________ meld-list mailing list meld-list@... http://mail.gnome.org/mailman/listinfo/meld-list |
| Free embeddable forum powered by Nabble | Forum Help |