|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
Questions about fields in wiki dumpsI am working with some enwiki-{YYYYMMDD}-stub-meta-history.xml dumps and
wanted to get clarification on how certain fields of the articles can change: 1. What action will make an article get a new pageId? Is it only move/rename, a redirect, or a deletion and recreation, or are there other ways this could happen? Can any of these changes be detected from the stub-meta-history.xml files? 2. Is it possible for just one particular revision of an article to be deleted, maybe due to a copyright violation? If so, is just the content of the revision deleted or would this include all the data associated with it, so that the revision would not even appear in the stub-meta-history.xml file? 3. Are pageIds recycled? If a page is deleted, could its id number be used for a completely new page in the future? Thanks, Jeff -- Jeff Kubina http://google.com/profiles/jeff.kubina _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Questions about fields in wiki dumps2009/11/10 Jeff Kubina <jeff.kubina@...>:
> I am working with some enwiki-{YYYYMMDD}-stub-meta-history.xml dumps and > wanted to get clarification on how certain fields of the articles can > change: > > 1. What action will make an article get a new pageId? Is it > only move/rename, a redirect, or a deletion and recreation, or are there > other ways this could happen? Can any of these changes be detected from the > stub-meta-history.xml files? > When a page is moved, it'll change its name but keep its pageid. A redirect will be created at the old name with a new pageid. > 2. Is it possible for just one particular revision of an article to be > deleted, maybe due to a copyright violation? If so, is just the content of > the revision deleted or would this include all the data associated with it, > so that the revision would not even appear in the stub-meta-history.xml > file? > Yes. In this case, any trace of the revision ever having existed is gone from the dumps, AFAIK. > 3. Are pageIds recycled? If a page is deleted, could its id number be used > for a completely new page in the future? > No, pageids are never recycled. Roan Kattouw (Catrope) _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Questions about fields in wiki dumpsOn Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina <jeff.kubina@...> wrote:
> 1. What action will make an article get a new pageId? Is it > only move/rename, a redirect, or a deletion and recreation, or are there > other ways this could happen? Can any of these changes be detected from the > stub-meta-history.xml files? Normal deletion/undeletion, moving, or similar things will not create a new page_id. However, there are a couple of things to be aware of: 1) In the old days, deleting an article and recreating it would assign it a new page_id. This hasn't been true for several years. 2) It's still possible to get revisions associated with a different page_id than they were originally written for, by deleting a page, moving another page over it, and undeleting one or more revisions. > 2. Is it possible for just one particular revision of an article to be > deleted, maybe due to a copyright violation? If so, is just the content of > the revision deleted or would this include all the data associated with it, > so that the revision would not even appear in the stub-meta-history.xml > file? Yes, an individual revision can be deleted. There are at least three different ways to do this, last I checked. I would expect that the old ways (oversight, and delete+selective undelete) would leave no traces at all in the dump, while the new way (rev_deleted) might only suppress certain fields. I'm not sure offhand, though. > 3. Are pageIds recycled? If a page is deleted, could its id number be used > for a completely new page in the future? No. page_ids are handed out in strictly increasing order. _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Questions about fields in wiki dumps"Aryeh Gregor" <Simetrical+wikilist@...> wrote in message news:7c2a12e20911100759s1ba211b0k6ef6cb076449be37@...... > On Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina <jeff.kubina@...> > wrote: >> 2. Is it possible for just one particular revision of an article to be >> deleted, maybe due to a copyright violation? If so, is just the content >> of >> the revision deleted or would this include all the data associated with >> it, >> so that the revision would not even appear in the stub-meta-history.xml >> file? > > Yes, an individual revision can be deleted. There are at least three > different ways to do this, last I checked. I would expect that the > old ways (oversight, and delete+selective undelete) would leave no > traces at all in the dump, while the new way (rev_deleted) might only > suppress certain fields. I'm not sure offhand, though. IIRC, any revision that has any of the rev_deleted bitfields set will be excluded from dumps. Don't quote me on that.... --HM _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Questions about fields in wiki dumpsThanks for the help, but I'm still a bit confused about this case: in
enwiki-20090714-stub-meta-history.xml the AmericanSamoa page has a pageId of 6; as shown below. But, in enwiki-20090914-stub-meta-history.xml it has an id of 23741520 <http://en.wikipedia.org/wiki/Special:Export/AmericanSamoa>, with only the last edit history entry. So what happen? Is this an example of a delete, then restore with a new id? Why are the older revisions missing or does a restore only restore the latest revision? XML from enwiki-20090714-stub-meta-history.xml for AmericanSamoa: <page> <title>AmericanSamoa</title> <id>6</id> <redirect /> <revision> <id>233188</id> <timestamp>2001-01-19T01:12:51Z</timestamp> <contributor> <ip>office.bomis.com</ip> </contributor> <comment>*</comment> <text id="233188" /> </revision> <revision> <id>15898942</id> <timestamp>2002-02-25T15:43:11Z</timestamp> <contributor> <ip>Conversion script</ip> </contributor> <minor/> <comment>Automated conversion</comment> <text id="15898942" /> </revision> <revision> <id>18063795</id> <timestamp>2005-07-03T11:14:17Z</timestamp> <contributor> <username>Docu</username> <id>8029</id> </contributor> <minor/> <comment>adding to cur_id=6 {{R from CamelCase}}</comment> <text id="18058393" /> </revision> <revision> <id>133180191</id> <timestamp>2007-05-24T14:41:33Z</timestamp> <contributor> <username>Ngaiklin</username> <id>4477979</id> </contributor> <minor/> <comment>Robot: Automated text replacement (-\[\[(.*?[\:|\|])*?(.+?)\]\] +\g<2>)</comment> <text id="132462505" /> </revision> <revision> <id>133452270</id> <timestamp>2007-05-25T17:12:06Z</timestamp> <contributor> <username>Gurch</username> <id>241822</id> </contributor> <minor/> <comment>Revert edit(s) by [[Special:Contributions/Ngaiklin|Ngaiklin]] to last version by [[Special:Contributions/Docu|Docu]]</comment> <text id="132732979" /> </revision> </page> Thanks, Jeff -- Jeff Kubina http://google.com/profiles/jeff.kubina On Tue, Nov 10, 2009 at 4:15 PM, Happy-melon <happy-melon@...> wrote: > > "Aryeh Gregor" <Simetrical+wikilist@...<Simetrical%2Bwikilist@...>> > wrote in message > news:7c2a12e20911100759s1ba211b0k6ef6cb076449be37@...... > > On Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina <jeff.kubina@...> > > wrote: > >> 2. Is it possible for just one particular revision of an article to be > >> deleted, maybe due to a copyright violation? If so, is just the content > >> of > >> the revision deleted or would this include all the data associated with > >> it, > >> so that the revision would not even appear in the stub-meta-history.xml > >> file? > > > > Yes, an individual revision can be deleted. There are at least three > > different ways to do this, last I checked. I would expect that the > > old ways (oversight, and delete+selective undelete) would leave no > > traces at all in the dump, while the new way (rev_deleted) might only > > suppress certain fields. I'm not sure offhand, though. > > IIRC, any revision that has any of the rev_deleted bitfields set will be > excluded from dumps. Don't quote me on that.... > > --HM > > > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@... > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Questions about fields in wiki dumpsOn Tue, Nov 10, 2009 at 1:15 PM, Happy-melon <happy-melon@...> wrote:
> > "Aryeh Gregor" <Simetrical+wikilist@...> wrote in message > news:7c2a12e20911100759s1ba211b0k6ef6cb076449be37@...... >> On Tue, Nov 10, 2009 at 7:52 AM, Jeff Kubina <jeff.kubina@...> >> wrote: >>> 2. Is it possible for just one particular revision of an article to be >>> deleted, maybe due to a copyright violation? If so, is just the content >>> of >>> the revision deleted or would this include all the data associated with >>> it, >>> so that the revision would not even appear in the stub-meta-history.xml >>> file? >> >> Yes, an individual revision can be deleted. There are at least three >> different ways to do this, last I checked. I would expect that the >> old ways (oversight, and delete+selective undelete) would leave no >> traces at all in the dump, while the new way (rev_deleted) might only >> suppress certain fields. I'm not sure offhand, though. > > IIRC, any revision that has any of the rev_deleted bitfields set will be > excluded from dumps. Don't quote me on that.... I'm not sure what the criteria actually are, but I recall encountering a dump entry where the editor's name had been suppressed (missing in the revision) but where the revision text itself was present. (I had an analysis script choke on this, since up to that time I had assumed every revision would have valid contributor information attached to it.) -Robert Rohde _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Questions about fields in wiki dumpsHappy-melon wrote:
>> Yes, an individual revision can be deleted. There are at least three >> different ways to do this, last I checked. I would expect that the >> old ways (oversight, and delete+selective undelete) would leave no >> traces at all in the dump, while the new way (rev_deleted) might only >> suppress certain fields. I'm not sure offhand, though. > > IIRC, any revision that has any of the rev_deleted bitfields set will be > excluded from dumps. Don't quote me on that.... > > --HM They will appear with a deleted="deleted" attribute, so the content of the suppressed fields isn't available, but that of the other fields is. _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Questions about fields in wiki dumpsOn Tue, Nov 10, 2009 at 2:48 PM, Jeff Kubina <jeff.kubina@...> wrote:
> Thanks for the help, but I'm still a bit confused about this case: in > enwiki-20090714-stub-meta-history.xml the AmericanSamoa page has a pageId of > 6; as shown below. But, in enwiki-20090914-stub-meta-history.xml it has an > id of 23741520 <http://en.wikipedia.org/wiki/Special:Export/AmericanSamoa>, > with only the last edit history entry. So what happen? Is this an example of > a delete, then restore with a new id? Why are the older revisions missing or > does a restore only restore the latest revision? I assume the Page ID answer lies with whatever the hell Graham87 was doing here in July: http://en.wikipedia.org/w/index.php?title=Special:Log&page=AmericanSamoa Also, if you use a URL GET, such as you have above, it only gives the most recent revision. You can uncheck the "Include only the current revision" box at Special:Export if you want to get additional revisions from the online form. -Robert Rohde _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Questions about fields in wiki dumpsJeff Kubina wrote:
> Thanks for the help, but I'm still a bit confused about this case: in > enwiki-20090714-stub-meta-history.xml the AmericanSamoa page has a pageId of > 6; as shown below. But, in enwiki-20090914-stub-meta-history.xml it has an > id of 23741520 <http://en.wikipedia.org/wiki/Special:Export/AmericanSamoa>, > with only the last edit history entry. So what happen? Is this an example of > a delete, then restore with a new id? Why are the older revisions missing or > does a restore only restore the latest revision? See http://en.wikipedia.org/w/index.php?title=Special:Log&page=AmericanSamoa There was a quite a bit of deletion move and undeletion trickery to move the first revision on the XML (the one from office.bomis.com) to the history of American_Samoa. http://en.wikipedia.org/w/index.php?title=American_Samoa&oldid=233188 Seems AmericanSamoa page id was recreated during that. There's another id oddness on that page, since that office edit is from January 2001 and has id 233188. It has listed (wrongly) as previous on the diff links one from July 2002 with revid of 205006. It is listed as previous because 205006 < 233188. That older revision has a newer revid because originally, only current version of articles were imported from UseModWiki (those that are tagged as from Conversion script). Older edits like this one were imported later, after that 205006 edit was made. _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Questions about fields in wiki dumps--- El mié, 11/11/09, Robert Rohde <rarohde@...> escribió:
> I'm not sure what the criteria actually are, but I recall > encountering > a dump entry where the editor's name had been suppressed > (missing in > the revision) but where the revision text itself was > present. (I had > an analysis script choke on this, since up to that time I > had assumed > every revision would have valid contributor information > attached to > it.) Yes, actually that case forced updates on some parsers like mine, since they weren't supposed to expect empty fields on revisions (and specially the rev_user field). F -- > > -Robert Rohde > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@... > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
| Free embeddable forum powered by Nabble | Forum Help |