|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
Datamining infoboxesInfoboxes in Wikipedia often contain information which is quite useful
outside Wikipedia but can be surprisingly difficult to data-mine. I would like to find all Wikipedia pages that use Template:Infobox_Language and parse the parameters iso3 and fam1...fam15 But my attempts to find such pages using either the Toolserver's Wikipedia database or the Mediawiki API have not been fruitful. In particular, SQL queries on the templatelinks table are intractably slow. Why are there no keys on tl_from or tl_title? Andrew Dunbar (hippietrail) -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxes> particular, SQL queries on the templatelinks table are intractably
> slow. Why are there no keys on tl_from or tl_title? How are you planning to get the template parameters? Have I missed a recent schema change? I'd be interested in following your progress. I'm not extracting infobox data, but parameters of the coordinate template. Maybe a similar approach could be interesting for you: The coordinate template stuffs all its parameters int an external link (which can easily be obtained from the externallinks table). Creating dummy links containing parameters for some infoboxes could be one way of making the data available for automatic extraction (yes, it's a hack, but I'd prefer better suggestions over flames). The link could actually be made useful, it could point to a query page for the data in these infoboxes. [[User:Dschwen]] _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxes2009/10/22 Daniel Schwen <lists@...>:
>> particular, SQL queries on the templatelinks table are intractably >> slow. Why are there no keys on tl_from or tl_title? > > How are you planning to get the template parameters? Have I missed a > recent schema change? I've been trying to parse the wikitext of section 0 with a minimal parser that uses just the tokens {{ }} {{{ and }}} but it already has probems when it sees }}}} > I'd be interested in following your progress. I'm not extracting > infobox data, but parameters of the coordinate template. Maybe a > similar approach could be interesting for you: > > The coordinate template stuffs all its parameters int an external > link (which can easily be obtained from the externallinks table). > Creating dummy links containing parameters for some infoboxes could be > one way of making the data available for automatic extraction (yes, > it's a hack, but I'd prefer better suggestions over flames). > > The link could actually be made useful, it could point to a query page > for the data in these infoboxes. The template and parameters I'm interested don't generate any such external links and probably couldn't very easily... But I have just discovered the rvgeneratexml parameter to action=query&prop=revisions This includes a <part> field for each template parameter with a <name> and a <value> for each... Andrew Dunbar (hippietrail) > [[User:Dschwen]] > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@... > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxesThis discussion brings to mind several historical threads.
I wonder if a project to simply mine the whole article contents and provide a DB of some sort with the articles and infobox contents would be worthwhile. Develop a specific parser and generate and publish the complete set of article-infobox-(key-value) sets... On Thu, Oct 22, 2009 at 11:13 PM, Andrew Dunbar <hippytrail@...> wrote: > 2009/10/22 Daniel Schwen <lists@...>: >>> particular, SQL queries on the templatelinks table are intractably >>> slow. Why are there no keys on tl_from or tl_title? >> >> How are you planning to get the template parameters? Have I missed a >> recent schema change? > > I've been trying to parse the wikitext of section 0 with a minimal > parser that uses just the tokens {{ }} {{{ and }}} but it already has > probems when it sees }}}} > >> I'd be interested in following your progress. I'm not extracting >> infobox data, but parameters of the coordinate template. Maybe a >> similar approach could be interesting for you: >> >> The coordinate template stuffs all its parameters int an external >> link (which can easily be obtained from the externallinks table). >> Creating dummy links containing parameters for some infoboxes could be >> one way of making the data available for automatic extraction (yes, >> it's a hack, but I'd prefer better suggestions over flames). >> >> The link could actually be made useful, it could point to a query page >> for the data in these infoboxes. > > The template and parameters I'm interested don't generate any such > external links and probably couldn't very easily... > > But I have just discovered the rvgeneratexml parameter to > action=query&prop=revisions > This includes a <part> field for each template parameter with a <name> > and a <value> for each... > > Andrew Dunbar (hippietrail) > >> [[User:Dschwen]] >> >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l@... >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> > > > > -- > http://wiktionarydev.leuksman.com http://linguaphile.sf.net > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@... > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- -george william herbert george.herbert@... _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxesGeorge Herbert wrote:
> This discussion brings to mind several historical threads. > > I wonder if a project to simply mine the whole article contents and > provide a DB of some sort with the articles and infobox contents would > be worthwhile. Develop a specific parser and generate and publish the > complete set of article-infobox-(key-value) sets... > I don't know anybody on the data side at Metaweb anymore, but I know that they did something like that to import a lot of structured Wikipedia data into their Freebase project. They publish some sort of data dump here: http://download.freebase.com/wex/ Perhaps they'd be willing to open-source their parser. William _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxes2009/10/23 Andrew Dunbar <hippytrail@...>:
> But my attempts to find such pages using either the Toolserver's > Wikipedia database or the Mediawiki API have not been fruitful. In > particular, SQL queries on the templatelinks table are intractably > slow. Why are there no keys on tl_from or tl_title? > There are: CREATE UNIQUE INDEX /*i*/tl_from ON /*_*/templatelinks (tl_from,tl_namespace,tl_title); CREATE UNIQUE INDEX /*i*/tl_namespace ON /*_*/templatelinks (tl_namespace,tl_title,tl_from); It's just that tl_title is always coupled with tl_namespace because that's how you should be using it (tl_namespace=10 for the template namespace). Note that the former index can be used as an index on (tl_from) as well. Roan Kattouw (Catrope) _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxesOn Fri, Oct 23, 2009 at 08:37, George Herbert <george.herbert@...> wrote:
> I wonder if a project to simply mine the whole article contents and > provide a DB of some sort with the articles and infobox contents would > be worthwhile. Develop a specific parser and generate and publish the > complete set of article-infobox-(key-value) sets... That's what DBpedia is doing. The extracted data can be found here, in N-Triples and CSV format: http://wiki.dbpedia.org/Downloads The entries in the row labelled 'Infoboxes' are files that contain the extracted values of all template properties in each page of a Wikipedia instance. For large Wikipedias like en, the unzipped files are pretty big (several GB). Most of the extraction code can be found in these PHP classes: https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/InfoboxExtractor.php https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/infobox/ Christopher _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxesHi Hippietrail!
What do you mean by "intractably slow"? Just how fast must it be? If I do http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0 it says (on one given try) that it was served in 0,047 seconds. How long can it take to read them all? A few minutes? Seems to me that time would be swamped by the time it takes to pull the wikitext for the pages? And methinks you might be trying too hard to parse the text, some fairly simple regex or such can extract the template invocation and the parameters; people use it in a pretty regular way. Oh, and do remember to look for "Template:Infobox language" as well, depending on which way you find them. Robert _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxes> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would > be worthwhile. Develop a specific parser and generate and publish the > complete set of article-infobox-(key-value) sets... That is a brilliant idea... ...that somebody else already had and implemented Templatetiger http://toolserver.org/~kolossos/templatetiger/template-choice.php?lang=enwiki Should have mentioned that earlier. _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxes2009/10/23 Robert Ullmann <rlullmann@...>:
> Hi Hippietrail! > > What do you mean by "intractably slow"? Just how fast must it be? > > If I do > http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Infobox_Language&eilimit=100&einamespace=0 > it says (on one given try) that it was served in 0,047 seconds. How > long can it take to read them all? A few minutes? Yes I found how to get it through the API now. It was actually just the Toolserver database that was intractably slow. > Seems to me that time would be swamped by the time it takes to pull > the wikitext for the pages? > > And methinks you might be trying too hard to parse the text, some > fairly simple regex or such can extract the template invocation and > the parameters; people use it in a pretty regular way. I've been spending hours on the parsing now and don't find it simple at all due to the fact that templates can be nested. Just extracting the Infobox as one big lump is hard due to the need to match nested {{ and }} Andrew Dunbar (hippietrail) > Oh, and do remember to look for "Template:Infobox language" as well, > depending on which way you find them. > > Robert > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@... > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxesGiven the fairly obvious utility for data mining, it might make sense
for someone to extend the Mediawiki API to generate a list of template calls and the parameters sent in each case. -Robert Rohde _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxesI am so glad that someone re-re-resurrects this topic :-)
On Fri, Oct 23, 2009 at 1:27 PM, Andrew Dunbar <hippytrail@...> wrote: > I've been spending hours on the parsing now and don't find it simple > at all due to the fact that templates can be nested. Just extracting > the Infobox as one big lump is hard due to the need to match nested {{ > and }} Not perfect, but try http://toolserver.org/~magnus/wiki2xml/w2x.php 1. Unckeck "Use API", chose "Do not use templates" 2. Enter article name(s) 3. Get XML 4. Parse XML, re-submit the wiki text in templates to process the next level of templates I should really offer #4 in this... Caveat: Will break on things like HTML attributes that are filled by templates etc. Cheers, Magnus _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxes> I've been spending hours on the parsing now and don't find it simple
> at all due to the fact that templates can be nested. Just extracting > the Infobox as one big lump is hard due to the need to match nested {{ > and }} > > Andrew Dunbar (hippietrail) Hi, Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in the text, then count braces. Start at depth=2, count up and down 'till you reach 0, and you are at the end of the template. (you can be picky about only counting them if paired if you like ;-) Then just regex match the lines/parameters you want. However, if you are pulling the wikitext with the API, the XML parse tree option sounds good; then you can just use elementTree (or the like) and pull out the parameters directly Robert _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxes2009/10/23 Robert Ullmann <rlullmann@...>:
>> I've been spending hours on the parsing now and don't find it simple >> at all due to the fact that templates can be nested. Just extracting >> the Infobox as one big lump is hard due to the need to match nested {{ >> and }} >> >> Andrew Dunbar (hippietrail) > > Hi, > > Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in > the text, then count braces. Start at depth=2, count up and down 'till > you reach 0, and you are at the end of the template. (you can be picky > about only counting them if paired if you like ;-) Actually you have to find "{{[Ii]nfobox[ _][Ll]anguage" And I wanted to be robust. It's perfectly legal for single unmatched braces to apear anywhere and I didn't want them to break my code. As it happens there don't seem to currently be any in the language infofoxes. I couldn't be sure whether there would be any cases where a {{{ or }}} might show up either. And a few other edge cases such as HTML comments, <nowiki> and friends, template invocations in values, and even possibly template invokations in names? > Then just regex match the lines/parameters you want. > > However, if you are pulling the wikitext with the API, the XML parse > tree option sounds good; then you can just use elementTree (or the > like) and pull out the parameters directly I've got it extracting the name/value pairs from the XML finally but parsing XML is always a pain. And it still misses Norwegian, Bokmal, and Nynorsk which wrap the infobox in another template... Andrew Dunbar (hippietrail) > Robert > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@... > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxes2009/10/23 Robert Rohde <rarohde@...>:
> Given the fairly obvious utility for data mining, it might make sense > for someone to extend the Mediawiki API to generate a list of template > calls and the parameters sent in each case. > We had a discussion about this Tuesday in the tech staff meeting, and decided that we want to put this data mining possibility in core at some point (using a table like pagelinks to store these key/value pairs and modifying the parser). As you may understand this is not a very high priority project, and I don't know if any of the paid developers are gonna do it any time soon. Roan Kattouw (Catrope) _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxesRobert Ullmann wrote:
>> I've been spending hours on the parsing now and don't find it simple >> at all due to the fact that templates can be nested. Just extracting >> the Infobox as one big lump is hard due to the need to match nested {{ >> and }} >> >> Andrew Dunbar (hippietrail) >> > > Hi, > > Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in > the text, then count braces. Start at depth=2, count up and down 'till > you reach 0, and you are at the end of the template. (you can be picky > about only counting them if paired if you like ;-) > > Then just regex match the lines/parameters you want. > > However, if you are pulling the wikitext with the API, the XML parse > tree option sounds good; then you can just use elementTree (or the > like) and pull out the parameters directly > > Robert > Or you could use the pyparsing Python library, with which you can implement the grammar of your choice, making matching nested template extraction trivial. Using the psyco package to accelerate it, you can parse a whole en: dump in a few hours. See the code below for a sample grammar... -- Neil ------------------------------------------------ # Use pyparsing, enablePackrat() _and_ psyco for a considerable speed-up from pyparsing import * import psyco # These two must be in the correct order, or bad things will happen ParserElement.enablePackrat() psyco.full() wikitemplate = Forward() wikilink = Combine("[[" + SkipTo("]]") + "]]") wikiargname = CharsNotIn("|{}=") wikiargval = ZeroOrMore( wikilink | Group(wikitemplate) | CharsNotIn("[|{}") | "[" | "{" | Regex("}[^}]")) wikiarg = Group(Optional(wikiargname + Suppress("="), default="??") + wikiargval) wikitemplate << (Suppress("{{") + wikiargname + Optional(Suppress("|") + delimitedList(wikiarg, "|")) + Suppress("}}")) wikitext = ZeroOrMore(CharsNotIn("{") | Group(wikitemplate) | "{" ) def parse_page(text): return wikitext.parseString(text) _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxesOn Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar <hippytrail@...> wrote:
> Yes I found how to get it through the API now. It was actually just > the Toolserver database that was intractably slow. There's nothing slow about the TS database here: mysql> pager true PAGER set to 'true' mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND tl_title IN ('Infobox_Language', 'Infobox_language'); 3144 rows in set (0.12 sec) Your query might have been what was slow. _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxesFascinating!
It seems to be a repeating pattern on these mailing lists that people ignore existing solutions and discuss re-inventing wheels (please correct me if I'm wrong here). While I agree this is fun some it rarely helps the OP... [[User:Dschwen]] _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxes2009/10/23 William Pietri <william@...>:
> George Herbert wrote: >> This discussion brings to mind several historical threads. >> I wonder if a project to simply mine the whole article contents and >> provide a DB of some sort with the articles and infobox contents would >> be worthwhile. Develop a specific parser and generate and publish the >> complete set of article-infobox-(key-value) sets... > I don't know anybody on the data side at Metaweb anymore, but I know > that they did something like that to import a lot of structured > Wikipedia data into their Freebase project. They publish some sort of > data dump here: > http://download.freebase.com/wex/ > Perhaps they'd be willing to open-source their parser. They're right into open source, I suspect they would. - d. _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
|
|
Re: Datamining infoboxes2009/10/23 Aryeh Gregor <Simetrical+wikilist@...>:
> On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar <hippytrail@...> wrote: >> Yes I found how to get it through the API now. It was actually just >> the Toolserver database that was intractably slow. > > There's nothing slow about the TS database here: > > mysql> pager true > PAGER set to 'true' > mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND > tl_title IN ('Infobox_Language', 'Infobox_language'); > 3144 rows in set (0.12 sec) > > Your query might have been what was slow. Yes I didn't specify tl_namespace and when I check for which columns have keys I could see none: mysql> describe templatelinks; +--------------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-----------------+------+-----+---------+-------+ | tl_from | int(8) unsigned | NO | | 0 | | | tl_namespace | int(11) | NO | | 0 | | | tl_title | varchar(255) | NO | | | | +--------------+-----------------+------+-----+---------+-------+ 3 rows in set (0.01 sec) But I don't know much about databases and SQL... I have reached an important milestone of extracting all the name value pairs for language infobox ISO 639 language codes and language family string by the way. But the values still need some work before I can try to match them against ISO 639-5 language family codes which is my ultimate goal. Thanks for all the tips. Andrew Dunbar (hippietrail) > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@... > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net _______________________________________________ Wikitech-l mailing list Wikitech-l@... https://lists.wikimedia.org/mailman/listinfo/wikitech-l |
| < Prev | 1 - 2 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |