|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
Wikipedia meets gitHallo,
I have gotten the wikipedia article for Kosovo in git. It is fast, distributed, highly compressed, redundant, branchable and usable. The blame function will show you who edited what version. Here Blame on the up to date kosovo article! http://github.com/h4ck3rm1k3/KosovoWikipedia/blame/master/Wiki/Kosovo/article.xml git I have checked in all the code to produce this here : https://code.launchpad.net/~jamesmikedupont/+junk/wikiatransfer thanks, mike _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Thu, Oct 15, 2009 at 2:55 PM, jamesmikedupont@...
<jamesmikedupont@...> wrote: > Hallo, > I have gotten the wikipedia article for Kosovo in git. > It is fast, distributed, highly compressed, redundant, branchable and usable. > > The blame function will show you who edited what version. > > Here Blame on the up to date kosovo article! > http://github.com/h4ck3rm1k3/KosovoWikipedia/blame/master/Wiki/Kosovo/article.xml > git > > I have checked in all the code to produce this here : > https://code.launchpad.net/~jamesmikedupont/+junk/wikiatransfer It is cool that you get the complete history. But— it's a bit uncool that its about 14mbytes when the article is 100k; understandable given that the expanded uncompressed history is about 337mbytes... I repacked the repository using git-pack-objects --progress --window=40000 --depth=40000 --compression=9 --all --delta-base-offset (git-repack doesn't repack, really) And now have 4168915 2009-10-15 16:12 KosovoWikipedia-ae859bbf9446ddcde4b17e09c99c28dcf594da89.pack, which is more reasonable. The number of revisions to a single article is a little bit outside of the normal usage of git. ;) _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Thu, Oct 15, 2009 at 10:16 PM, Gregory Maxwell <gmaxwell@...> wrote:
> It is cool that you get the complete history. > > But— it's a bit uncool that its about 14mbytes when the article is > 100k; understandable given that the expanded uncompressed history is > about 337mbytes... I have the uncompressed history here at 550mb. du -h history/ 556M history/ if I bzip this, it is 29M 2009-10-15 22:35 total.tar.bz 14 mb is still smaller, and the upload is faster! > > The number of revisions to a single article is a little bit outside of > the normal usage of git. ;) There are ways to optimize all of this. Most users will not want to download the full history. This is just one days of work using git, we will be able to optimize this all. I will be able to find other example of large repositories.. http://laserjock.wordpress.com/2008/05/09/bzr-git-and-hg-performance-on-the-linux-tree/ mike _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Thu, Oct 15, 2009 at 4:38 PM, jamesmikedupont@...
<jamesmikedupont@...> wrote: > There are ways to optimize all of this. Most users will not want to > download the full history. Then why are you using git? _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Thu, Oct 15, 2009 at 11:33 PM, Gregory Maxwell <gmaxwell@...> wrote:
> On Thu, Oct 15, 2009 at 4:38 PM, jamesmikedupont@... > <jamesmikedupont@...> wrote: >> There are ways to optimize all of this. Most users will not want to >> download the full history. > > Then why are you using git? I am not most users. I am using git because I think it is the best way forward to implement many of the ideas discussed in the strategy wiki. _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Fri, Oct 16, 2009 at 6:40 AM, jamesmikedupont@...
<jamesmikedupont@...> wrote: > On Thu, Oct 15, 2009 at 11:33 PM, Gregory Maxwell <gmaxwell@...> wrote: >> On Thu, Oct 15, 2009 at 4:38 PM, jamesmikedupont@... >> <jamesmikedupont@...> wrote: >>> There are ways to optimize all of this. Most users will not want to >>> download the full history. >> >> Then why are you using git? > > I am not most users. I am using git because I think it is the best way > forward to implement many of the ideas discussed in the strategy wiki. if you want only the last 3 revisions checked out , it takes about 10 seconds and produces 300k of data. git clone --depth 3 git://github.com/h4ck3rm1k3/KosovoWikipedia.git du -h gittest/ 252K gittest/ Log file : Initialized empty Git repository in /home_data2/2009/10/KosovoWikipedia/gittest/KosovoWikipedia/.git/ remote: Counting objects: 21, done. remote: Compressing objects: 100% (10/10), done. remote: Total 21 (delta 3), reused 20 (delta 3) Receiving objects: 100% (21/21), 40.98 KiB, done. Resolving deltas: 100% (3/3), done. _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets git>> On Thu, Oct 15, 2009 at 11:33 PM, Gregory Maxwell <gmaxwell@...> wrote:
>>> Then why are you using git? It turns out there are a few wikis built on top of git : 1. the git-wiki : http://atonie.org/2008/02/git-wiki http://github.com/jeffbski/git-wiki git-wiki is a wiki that relies on git to keep pages' history and Sinatra to serve them. (ruby) Supports these markups : * Creole= Creole is a Creole-to-HTML converter for Creole, the lightwight markup language (http://wikicreole.org/). * Markdown= Discount Markdown Processor for Ruby http://github.com/rtomayko/rdiscount * Textile = RedCloth is a module for using the Textile markup language in Ruby. http://redcloth.org/ 2. gitit http://hackage.haskell.org/cgi-bin/hackage-scripts/package/gitit gitit: Wiki using happstack, git or darcs, and pandoc. (haskell) 3.ikiwiki http://ikiwiki.info/ Ikiwiki is a wiki compiler. http://ikiwiki.info/ikiwiki/formatting/ 4. wigit : the php git wiki http://el-tramo.be/software/wigit _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitThis is very awesome. I am in the early stages of trying to scope out a
small side project to do a mediawiki <-> git bridge; it is very challenging. Being able to download the complete edit history in this fashion is extremely useful. Thank you very much for sharing this work. -Josh On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont@... < jamesmikedupont@...> wrote: > On Fri, Oct 16, 2009 at 6:40 AM, jamesmikedupont@... > <jamesmikedupont@...> wrote: > > On Thu, Oct 15, 2009 at 11:33 PM, Gregory Maxwell <gmaxwell@...> > wrote: > >> On Thu, Oct 15, 2009 at 4:38 PM, jamesmikedupont@... > >> <jamesmikedupont@...> wrote: > >>> There are ways to optimize all of this. Most users will not want to > >>> download the full history. > >> > >> Then why are you using git? > > > > I am not most users. I am using git because I think it is the best way > > forward to implement many of the ideas discussed in the strategy wiki. > > > if you want only the last 3 revisions checked out , it takes about 10 > seconds and produces 300k of data. > > git clone --depth 3 git://github.com/h4ck3rm1k3/KosovoWikipedia.git > > du -h gittest/ > 252K gittest/ > > Log file : > > Initialized empty Git repository in > /home_data2/2009/10/KosovoWikipedia/gittest/KosovoWikipedia/.git/ > remote: Counting objects: 21, done. > remote: Compressing objects: 100% (10/10), done. > remote: Total 21 (delta 3), reused 20 (delta 3) > Receiving objects: 100% (21/21), 40.98 KiB, done. > Resolving deltas: 100% (3/3), done. > > _______________________________________________ > foundation-l mailing list > foundation-l@... > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l > -- I am running a marathon for the Leukemia & Lymphoma Society. Can you help me reach my fundraising goals? Visit http://pages.teamintraining.org/ma/pfchangs10/joshuagay _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitThat is pretty cool. But wouldn't it make more sense to have a more-
fine grained blame, like the one in wikitrust, down to the character level? cheers, denny On Oct 15, 2009, at 20:55, jamesmikedupont@... wrote: > Hallo, > I have gotten the wikipedia article for Kosovo in git. > It is fast, distributed, highly compressed, redundant, branchable > and usable. > > The blame function will show you who edited what version. > > Here Blame on the up to date kosovo article! > http://github.com/h4ck3rm1k3/KosovoWikipedia/blame/master/Wiki/Kosovo/article.xml > git > > I have checked in all the code to produce this here : > https://code.launchpad.net/~jamesmikedupont/+junk/wikiatransfer > > thanks, > mike > > _______________________________________________ > foundation-l mailing list > foundation-l@... > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Fri, Oct 16, 2009 at 9:45 AM, Denny Vrandecic
<denny.vrandecic@...> wrote: > That is pretty cool. But wouldn't it make more sense to have a more- > fine grained blame, like the one in wikitrust, down to the character > level? I don't know all these wikitools, but if the feature is missing from git, then it will benefit all projects using it. My fascination with using a real distribute version control system is that it provides the features that we are missing from the mediawiki. We can use standard tools to do good things, and not have to reinvent the world all the time. We don't need to have a centralized repository and only one point of view, using a real VCS means that we can multiple hosts, multiple points of view and a failsafe system. My next steps are to work on the reader tool in creating latex output and espeak output of the articles, I am adding in the unicode character support right now. I would like to get that up to speed, to use PDF / Audio rendering of the articles. I will continue to just work with selected articles and improve the import feature. It should be easy to have an import tool feed by an rss feed for some articles that imports them on a regular basis. Mike _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitJust another pointer, here is a distributed MediaWiki system developed
at INRIA. I haven't looked into it yet too deep, but their evaluation looked very promising. <http://m3p.gforge.inria.fr/pmwiki/pmwiki.php> Best, denny On Oct 16, 2009, at 10:30, jamesmikedupont@... wrote: > On Fri, Oct 16, 2009 at 9:45 AM, Denny Vrandecic > <denny.vrandecic@...> wrote: >> That is pretty cool. But wouldn't it make more sense to have a more- >> fine grained blame, like the one in wikitrust, down to the character >> level? > > I don't know all these wikitools, but if the feature is missing from > git, then it will benefit all projects using it. > > My fascination with using a real distribute version control system is > that it provides the features that we are missing from the mediawiki. > > We can use standard tools to do good things, and not have to reinvent > the world all the time. > > We don't need to have a centralized repository and only one point of > view, using a real VCS means that we can multiple hosts, multiple > points of view and a failsafe system. > > My next steps are to work on the reader tool in creating latex output > and espeak output of the articles, I am adding in the unicode > character support right now. I would like to get that up to speed, to > use PDF / Audio rendering of the articles. > > I will continue to just work with selected articles and improve the > import feature. It should be easy to have an import tool feed by an > rss feed for some articles that imports them on a regular basis. > > Mike > > _______________________________________________ > foundation-l mailing list > foundation-l@... > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets git> On Oct 16, 2009, at 10:30, jamesmikedupont@... wrote:
I have make two simple vlogs about what and why i did this http://www.youtube.com/watch?v=jc9jo1ZFLqk http://www.youtube.com/watch?v=7WfRuEuvIso Mike _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Fri, Oct 16, 2009 at 9:45 AM, Denny Vrandecic
<denny.vrandecic@...> wrote: > That is pretty cool. But wouldn't it make more sense to have a more- > fine grained blame, like the one in wikitrust, down to the character > level? Can you please provide some example pages of wikitrust? they seem to be AWOL: In the meantime, you can look at our list of colored pages, http://wikitrust.soe.ucsc.edu/index.php/Colored_pages -> Page not found Thanks, mike _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitHoi,
After a minute of googling I find http://wikitrust.soe.ucsc.edu/home .. I am sure it is there for you as well. Thanks, GerardM 2009/10/16 jamesmikedupont@... <jamesmikedupont@...> > On Fri, Oct 16, 2009 at 9:45 AM, Denny Vrandecic > <denny.vrandecic@...> wrote: > > That is pretty cool. But wouldn't it make more sense to have a more- > > fine grained blame, like the one in wikitrust, down to the character > > level? > > Can you please provide some example pages of wikitrust? > they seem to be AWOL: > > In the meantime, you can look at our list of colored pages, > http://wikitrust.soe.ucsc.edu/index.php/Colored_pages -> Page not found > > Thanks, > mike > > _______________________________________________ > foundation-l mailing list > foundation-l@... > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l > foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Fri, Oct 16, 2009 at 2:08 PM, Gerard Meijssen
<gerard.meijssen@...> wrote: > Hoi, > After a minute of googling I find http://wikitrust.soe.ucsc.edu/home .. I am > sure it is there for you as well. Yes the page is there, it seems to be a good idea. only I am missing some html pages so that we can see what it looks like, a wordlevel blame. the colorized pages are missing. On this page: http://wikitrust.soe.ucsc.edu/home it says : "In the meantime, you can look at our list of colored pages, or look at screenshots of English Wikipedia pages analyzed by WikiTrust. " and the colored pages link to http://wikitrust.soe.ucsc.edu/index.php/Colored_pages which are missing.... mike _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont@...
> if you want only the last 3 revisions checked out , it takes about 10 > seconds and produces 300k of data. 10 seconds? That's horrible. Have you tried using svn? _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitI did not mean that literally,
let me check the exact time for you : 1.258s time git clone --depth 3 git://github.com/h4ck3rm1k3/KosovoWikipedia.git Initialized empty Git repository in /home_data2/2009/10/KosovoWikipedia/gittest2/KosovoWikipedia/.git/ remote: Counting objects: 21, done. remote: Compressing objects: 100% (10/10), done. remote: Total 21 (delta 3), reused 20 (delta 3) Receiving objects: 100% (21/21), 40.99 KiB, done. Resolving deltas: 100% (3/3), done. real 0m1.258s user 0m0.024s sys 0m0.024s On Fri, Oct 16, 2009 at 4:31 PM, Anthony <wikimail@...> wrote: > On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont@... >> if you want only the last 3 revisions checked out , it takes about 10 >> seconds and produces 300k of data. > > 10 seconds? That's horrible. Have you tried using svn? > > _______________________________________________ > foundation-l mailing list > foundation-l@... > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l > _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Fri, Oct 16, 2009 at 10:31 AM, Anthony <wikimail@...> wrote:
> On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont@... >> if you want only the last 3 revisions checked out , it takes about 10 >> seconds and produces 300k of data. > > 10 seconds? That's horrible. Have you tried using svn? On a reasonably fast network it actually only about 10 seconds to pull the entire edit history from his repo, it would take less if the history has been repacked as I described— but that kind of tight repacking makes it take longer when you only want a portion of the history. Still— much of the neat things that can be done by having the article in git are only possible if you have the complete history, for example: generating a blame map needs the entire history. It would be nice if the git archival format was more efficient for the kinds of changes made in Wikipedia articles: Source code changes tends to have short lines and changes tend to change a significant portion of the lines, while edits on Wikipedia are far more likely to change only part of a very long line (really, a paragraph).... so working with line level deltas is efficient for source code while inefficient for Wikipedia data. On this repository a git fast-export --all | lzma -9 produces a 900kbyte output (505783 bytes if you want to be silly and use PAQ8HP12, which is pretty much the state of the art for English text, instead of LZMA). These methods don't provide fast random access but it's still clear that there is a lot of room for improvement. ;) I'm not sure if anyone is working on improved compression for git for these kinds of documents. Getting the entire history of a frequently edited article like this down to ~1-2mb is roughly where I think it's reasonable for someone doing continued non-trivial work on the article to fetch the entire history and thus gain access to functionality that needs most of the history. _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitI have
On Sat, Oct 17, 2009 at 10:18 AM, Gregory Maxwell <gmaxwell@...> wrote: > On Fri, Oct 16, 2009 at 10:31 AM, Anthony <wikimail@...> wrote: >> On Fri, Oct 16, 2009 at 12:45 AM, jamesmikedupont@... >>> if you want only the last 3 revisions checked out , it takes about 10 >>> seconds and produces 300k of data. >> >> 10 seconds? That's horrible. Have you tried using svn? > > Still— much of the neat things that can be done by having the article > in git are only possible if you have the complete history, for > example: generating a blame map needs the entire history. yes, and if you just want to view and edit then you need one revision. if you want to do more, you can pull the history. > > It would be nice if the git archival format was more efficient for the > kinds of changes made in Wikipedia articles: Source code changes tends > to have short lines and changes tend to change a significant portion > of the lines, while edits on Wikipedia are far more likely to change > only part of a very long line (really, a paragraph).... so working > with line level deltas is efficient for source code while inefficient > for Wikipedia data. I have started to work on the blame code to bring it down to the char level and learn about it. I am willing to invest some time to learn how to make git better for WMF. it is much more interesting than hacking php code. Also, I have been able to use the wm-render code on the git archive, you can see the results of new version of my reader script here : 2 hours of reading the full article : http://www.archive.org/details/KosovoWikipediaArticlesVideo I am thinking to store the wikipedia articles in the intermediate xml parse tree format from mw-render, if that would help the diff toos. Another idea would be to allow editing of the articles with open office for example, and provide tracibility in the document structure back to the original article. it could be marked up with blame information, even more, the blame information could be embedded in each word, with an xml attribute. that would allow for exact tracking of where the edits come from. mike _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
|
|
Re: Wikipedia meets gitOn Sat, Oct 17, 2009 at 4:40 AM, jamesmikedupont@...
<jamesmikedupont@...> wrote: >> It would be nice if the git archival format was more efficient for the >> kinds of changes made in Wikipedia articles: Source code changes tends >> to have short lines and changes tend to change a significant portion >> of the lines, while edits on Wikipedia are far more likely to change >> only part of a very long line (really, a paragraph).... so working >> with line level deltas is efficient for source code while inefficient >> for Wikipedia data. > > I have started to work on the blame code > to bring it down to the char level and learn about it. Char level would probably make it too inefficient to merge deltas. Treating a period followed by a space as a line separator would probably be more efficient. The key to efficiency is to use skip deltas, though. You build a binary tree so accessing any revision requires the application of only log(n) deltas. I asked whether or not you tried svn, because svn already uses skip deltas. Is the idea that the entire file would need to be transferred over the Internet, though? If so, I guess you wouldn't want to use skip deltas - they greatly increase access time to early revisions, but at a slight space penalty. _______________________________________________ foundation-l mailing list foundation-l@... Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |