|
View:
New views
8 Messages
—
Rating Filter:
Alert me
|
|
|
Thorough RevlogNG Format SpecificationHi everyone,
I am doing a C++ program that directly read the RevlogNG files (eg. 00changelog.i) and I did not find an exhaustive RevlogNG Format Specification. I have several questions: * I found by looking at revlog.py, that sometimes, the data may be compressed, and that you could notice it by looking at the first char of the data. [1] So I used zlib to decompress the data when the first char was 'x', but I did not understand what the 'u' compression type was for. * I found that the changelog data should provides the following pieces of information in this order: "nodeid, user, date, branch, modified files, comment" [2] But, when decompressing, I sometimes have some bytes instead of the committer name, and I did not manage to find what this bytes are exactly standing for. I found that the first one was often one of 0x53 0x55 0x57 0x47 0x37 which seems like a bifield (?), and that 0x55 seemed to mean that the user was the same as the previous node. The same thing also happens sometimes between modified file names. Am I on the right track ? Is there a documentation somewhere describing this compression ? * About the first bytes of the files, I found that the four bytes are used to store the RevlogNG version, and some flags [3]. I think that the first word (bigendian) is for the version and the second (bigendian) is for the flags, and that one flag is "inline" [4] I also found that "inline" means that small RevlogNG index and data are compacted into a single .i file instead of a .d file for data and a .i file for headers. [5] Could all these pieces of information that I found here and there be added to the RevlogNG Wiki page in order to ease some future work ? Thank you for your great work. William. [1] http://hg.kublai.com/mercurial/main/file/1de5ebfa5585/mercurial/revlog.py#l100 [2] http://mercurial.selenic.com/wiki/ChangeSet [3] http://mercurial.selenic.com/wiki/RevlogNG [4] http://www.selenic.com/pipermail/mercurial/2006-April/007648.html [5] http://www.selenic.com/pipermail/mercurial/2006-March/007359.html _______________________________________________ Mercurial mailing list Mercurial@... http://selenic.com/mailman/listinfo/mercurial |
|
|
Re: Thorough RevlogNG Format SpecificationOn Wed, Nov 4, 2009 at 19:05, William Ledoux <william.ledoux@...> wrote:
> * I found by looking at revlog.py, that sometimes, the data may be > compressed, and that you could notice it by looking at the first char of the > data. [1] > So I used zlib to decompress the data when the first char was 'x', but I did > not understand what the 'u' compression type was for. > > * I found that the changelog data should provides the following pieces of > information in this order: "nodeid, user, date, branch, modified files, > comment" [2] > But, when decompressing, I sometimes have some bytes instead of the > committer name, and I did not manage to find what this bytes are exactly > standing for. > I found that the first one was often one of 0x53 0x55 0x57 0x47 0x37 which > seems like a bifield (?), and that 0x55 seemed to mean that the user was the > same as the previous node. > The same thing also happens sometimes between modified file names. > Am I on the right track ? Is there a documentation somewhere describing this > compression ? It sounds like you're missing the part where it's diffing. If the base revision is set in the header, it doesn't store a complete copy, but just a (binary) diff against the given base revision. > * About the first bytes of the files, I found that the four bytes are used > to store the RevlogNG version, and some flags [3]. > I think that the first word (bigendian) is for the version and the second > (bigendian) is for the flags, and that one flag is "inline" [4] > I also found that "inline" means that small RevlogNG index and data are > compacted into a single .i file instead of a .d file for data and a .i file > for headers. [5] > > Could all these pieces of information that I found here and there be added > to the RevlogNG Wiki page in order to ease some future work ? Go right ahead! Make an account for yourself on the wiki, a complete spec would be helpful. Cheers, Dirkjan _______________________________________________ Mercurial mailing list Mercurial@... http://selenic.com/mailman/listinfo/mercurial |
|
|
Re: Thorough RevlogNG Format SpecificationOn Wed, Nov 04, 2009 at 07:05:16PM +0100, William Ledoux wrote:
> Hi everyone, > > I am doing a C++ program that directly read the RevlogNG files (eg. > 00changelog.i) and I did not find an exhaustive RevlogNG Format > Specification. I have several questions: > > * I found by looking at revlog.py, that sometimes, the data may be > compressed, and that you could notice it by looking at the first char of the > data. [1] > So I used zlib to decompress the data when the first char was 'x', but *I > did not understand what the 'u' compression type was for*. u+text -> text is already uncompressed x+text -> decompress(x+text) with zlib \0+text -> \0+text is already uncompressed > > * I found that the changelog data should provides the following pieces of > information in this order: "nodeid, user, date, branch, modified files, > comment" [2] > But, when decompressing, * I sometimes have some bytes instead of the > committer name*, *and I did not manage to find what this bytes **are ** > exactly** standing for*. > I found that the first one was often one of 0x53 0x55 0x57 0x47 0x37 which > seems like a bifield (?), and that 0x55 seemed to mean that the user was the > same as the previous node. > The same thing also happens sometimes between modified file names. > Am I on the right track ? Is there a documentation somewhere describing this > compression ? Hum, do you handle the binary patching? The closest thing I could think of is the dict found between the date and the modified files. > > * About the first bytes of the files, I found that the four bytes are used > to store the RevlogNG version, and some flags [3]. > I think that the first word (bigendian) is for the version and the second > (bigendian) is for the flags, and that one flag is "inline" [4] > I also found that "inline" means that small RevlogNG index and data are > compacted into a single .i file instead of a .d file for data and a .i file > for headers. [5] Unsure what your question is. > > Could all these pieces of information that I found here and there be added > to the RevlogNG Wiki page in order to ease some future work ? Please feel free to expand/correct the RevlogNG page when appropriate, it's a wiki. If you're unsure, you can ask for a review here or on IRC. regards, Benoit -- :wq _______________________________________________ Mercurial mailing list Mercurial@... http://selenic.com/mailman/listinfo/mercurial |
|
|
Re: Thorough RevlogNG Format SpecificationOn Wed, Nov 4, 2009 at 10:20 PM, Dirkjan Ochtman <dirkjan@...> wrote:
Oh, thanks for the hint ! I thought that diff was used only for files .i .d but not in the changelog itself. So we are trying to obtain a content 'C' from a diff buffer 'D' and a base 'B' We have 3 32bit numbers: (x y z), then a bunch of text, then again 3 32 bit numbers, etc... Again, I was not able to find any documentation about this bdiff format on the wiki. Maybe I mis-searched... I am not familiar with python, so I was not able to find the python code that is responsible from reconstructing the file. Could you help me for this ? I though I had find by myself what the three numbers (x y z) were, but when I tried to get the log of TortoiseHg repository, I obviously made mistakes. Here is what I tried: X: the absolute offset of the next data of B that must not be in in C Y: the absolute offset of the next data of B that must be in C Z: is the length of the next bunch of text in D. (I am pretty sure that this one is true) For example: >> ( 0 41 41) I'll do: copy 0 chars from B to C, copy the next 41 chars in D to C >> (70 168 30) copy from char 41 to 70 of B to C copy the 30 next chars in D to C etc... It worked fine for the 42 first revisions, but after that things went wrong from time to time. So I am asking for help or pointers about the meaning of these three 32bits numbers ! On Wed, Nov 4, 2009 at 10:20 PM, Dirkjan Ochtman <dirkjan@...> wrote:
I saw "page immutable" so I did not even try to create an account and modify it. I'll do it as soon as I am confident in my understanding of the format ! Thank you, William. _______________________________________________ Mercurial mailing list Mercurial@... http://selenic.com/mailman/listinfo/mercurial |
|
|
Re: Thorough RevlogNG Format SpecificationOn Thu, 2009-11-05 at 18:54 +0100, William Ledoux wrote:
> Oh, thanks for the hint ! More hints here: http://mercurial.selenic.com/wiki/Presentations?action=AttachFile&do=view&target=ols-mercurial-paper.pdf -- http://selenic.com : development and support for Mercurial and Linux _______________________________________________ Mercurial mailing list Mercurial@... http://selenic.com/mailman/listinfo/mercurial |
|
|
Re: Thorough RevlogNG Format SpecificationOn Thu, Nov 5, 2009 at 7:02 PM, Matt Mackall <mpm@...> wrote:
Thank you, I already stumbled upon this one ;) I based my parsing on this at first, but after I discovered that you jumped to revlogNG since. (then the revlog index described page 5 is outdated) But it does not answer to my questions about the details of re-building the changelog content from the diff. _______________________________________________ Mercurial mailing list Mercurial@... http://selenic.com/mailman/listinfo/mercurial |
|
|
Re: Thorough RevlogNG Format SpecificationOn Thu, Nov 5, 2009 at 18:54, William Ledoux <william.ledoux@...> wrote:
> Oh, thanks for the hint ! > I thought that diff was used only for files .i .d but not in the changelog > itself. > So we are trying to obtain a content 'C' from a diff buffer 'D' and a base > 'B' > We have 3 32bit numbers: (x y z), then a bunch of text, then again 3 32 bit > numbers, etc... So is there a reason you're not reading the code? It, of course, has the best specification by far. Look in mercurial/bdiff.c, for example. Cheers, Dirkjan _______________________________________________ Mercurial mailing list Mercurial@... http://selenic.com/mailman/listinfo/mercurial |
|
|
Re: Thorough RevlogNG Format SpecificationOn Thu, Nov 05, 2009 at 07:17:23PM +0100, William Ledoux wrote:
> On Thu, Nov 5, 2009 at 7:02 PM, Matt Mackall <mpm@...> wrote: > > > On Thu, 2009-11-05 at 18:54 +0100, William Ledoux wrote: > > > More hints here: > > > > > > http://mercurial.selenic.com/wiki/Presentations?action=AttachFile&do=view&target=ols-mercurial-paper.pdf > > > > > Thank you, I already stumbled upon this one ;) > I based my parsing on this at first, but after I discovered that you jumped > to revlogNG since. (then the revlog index described page 5 is outdated) > But it does not answer to my questions about the details of re-building the > changelog content from the diff. The diff algorithm didn't change from the first version, it's the same format. Basically you need to retrieve the base version and the delta chain, then the patching is quite simple, There's a C version in mercurial/mpatch.c if it helps. regards, Benoit -- :wq _______________________________________________ Mercurial mailing list Mercurial@... http://selenic.com/mailman/listinfo/mercurial |
| Free embeddable forum powered by Nabble | Forum Help |