You may have seen:
A new Basketball season brings a new episode
in the personal information disaster
by connolly on Thu, 2006-11-16 12:39
tags: calendar | GRDDL | microformats | RDF | XHTML
http://dig.csail.mit.edu/breadcrumbs/node/172A new schedule came this week, and it had an unexpected linebreak,
so I upgraded from tidy and regular expressions
to html5lib and ElementTree.
This message has most of the raw materials for another
breadcrumbs episode...
import html5lib #
http://code.google.com/p/html5lib/from html5lib import HTMLParser, treebuilders
from xml.etree import cElementTree
def parseHTML(fn="bball-practice.html"):
"""
>>> e = parseHTML()
>>> e.tag
'html'
>>> rows = e.getiterator('tr')
>>> len(list(rows))
24
"""
f = open(fn)
parser = HTMLParser(tree=treebuilders.getTreeBuilder("etree",
cElementTree))
return parser.parse(f)
...
def eachEvent(...):
...
for t in elt.getiterator('table'):
cell = t.find('tbody/tr/td')
if not cell: continue
hd = cell.findtext('b')
if not hd: continue
if 'First Name' in hd: break
else:
raise ValueError, elt
for my reference, some hg logs:
16:ae65b101cf4c 2007-11-18 got html5lib talking with etree
17:5f81574c79fb 2007-11-18 - use html5lib and ElementTree rather than
tidy and regular expressions
--
Dan Connolly, W3C
http://www.w3.org/People/Connolly/gpg D3C2 887B 0F92 6005 C541 0875 0F91 96DE 6E52 C29E