Find a table cell containing an image tag.

View: New views
6 Messages — Rating Filter:   Alert me  

Find a table cell containing an image tag.

by Ian Skinner-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

This is probably one of those look ahead|behind things that I still do
not grasp about regex.

I want to find <td...> tags in a string of HTML that contain an <img...>
tag.   But I need the entire <td..>...<img...>...</td> string to replace
it.  But I need to ignore all the other <td...>...</td> blocks that do
not contain image tags.

TIA
Ian



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Adobe® ColdFusion® 8 software 8 is the most important and dramatic release to date
Get the Free Trial
http://ad.doubleclick.net/clk;203748912;27390454;j

Archive: http://www.houseoffusion.com/groups/RegEx/message.cfm/messageid:1169
Subscription: http://www.houseoffusion.com/groups/RegEx/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.21

Re: Find a table cell containing an image tag.

by Peter Boughton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> This is probably one of those look ahead|behind things that I still do
> not grasp about regex.
>
> I want to find <td...> tags in a string of HTML that contain an <img...>
> tag.   But I need the entire <td..>...<img...>...</td> string to replace
> it.  But I need to ignore all the other <td...>...</td> blocks that do
> not contain image tags.

Can you elaborate on the context of what you're trying to do here?

Because stuff like this is generally much much simpler with either
XPath or jQuery.



For example, here's how XPath matches any td containing an img:

//td[//img]


And here's how jQuery does it:

$j( 'td:has(img)' )



To do it with RegEx, you want something like this:

(?ims)<td[^>]*>(?:[^/]|/(?!td>))*<img[^>]+>.*?</td>



So yeah, if you really have a specific need to do it with RegEx, you
can probably use that, but I'd generally go for either of the other
two for any tag-based issue like this.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Adobe® ColdFusion® 8 software 8 is the most important and dramatic release to date
Get the Free Trial
http://ad.doubleclick.net/clk;203748912;27390454;j

Archive: http://www.houseoffusion.com/groups/RegEx/message.cfm/messageid:1170
Subscription: http://www.houseoffusion.com/groups/RegEx/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.21

Re: Find a table cell containing an image tag.

by Ian Skinner-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Peter Boughton wrote:
> Can you elaborate on the context of what you're trying to do here?
>
> Because stuff like this is generally much much simpler with either
> XPath or jQuery.
Yeah, XPath would be nice but the reason I'm looking for these <img...>
tags, is because they are not XHTML compliant and I needed to fix that
so that I could get this HTML fragment into an acceptable XML mode.

With a great deal of trial and error I finally came up with this.
<td align="center">.*?(?=<img).*?</td>

Luckily in this HTML string, the image <td..> blocks are the only ones
with the align="center" property.



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Adobe® ColdFusion® 8 software 8 is the most important and dramatic release to date
Get the Free Trial
http://ad.doubleclick.net/clk;203748912;27390454;j

Archive: http://www.houseoffusion.com/groups/RegEx/message.cfm/messageid:1171
Subscription: http://www.houseoffusion.com/groups/RegEx/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.21

Re: Find a table cell containing an image tag.

by Peter Boughton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> the reason I'm looking for these <img...>
> tags, is because they are not XHTML compliant and I needed to fix that
> so that I could get this HTML fragment into an acceptable XML mode.

So, why are you worrying about the tables if it's the image tags that
are the issue?

Just replace <img([^>]*[^/])> with <img\1/> and not worry about the tables?


I believe there are some XPath processors that can handle non-XML
HTML, but only because I've read a couple of references to such -
couldn't give any examples/recommendations.


> With a great deal of trial and error I finally came up with this.
> <td align="center">.*?(?=<img).*?</td>

Well, with a known set of source data that's fine, but potentially
that could match <td align="center">[stuff]</td><td><img/></td>

That's why the more general solution needs the negative lookahead for
/td> - to make sure that any img found is still within the opening
table cell.
The non-greedy matching (.*?) is not smart enough to do that for you.

Hope this all helps. :)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Adobe® ColdFusion® 8 software 8 is the most important and dramatic release to date
Get the Free Trial
http://ad.doubleclick.net/clk;203748912;27390454;j

Archive: http://www.houseoffusion.com/groups/RegEx/message.cfm/messageid:1172
Subscription: http://www.houseoffusion.com/groups/RegEx/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.21

Re: Find a table cell containing an image tag.

by Ian Skinner-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Peter Boughton wrote:
> Just replace <img([^>]*[^/])> with <img\1/> and not worry about the tables?
>  
The main reason is that the fix I'm using is to remove the images, since
they are irrelevant and unavailable to this conversion exercise.  I
could make them compliant and ignore them.  But since this is more a
learning exercise then anything, I wanted to see if I could match the
entire <td...> block and remove it, since without the <img...> tag the
cells would be empty.

> Well, with a known set of source data that's fine, but potentially
> that could match <td align="center">[stuff]</td><td><img/></td>
>
> That's why the more general solution needs the negative lookahead for
> /td> - to make sure that any img found is still within the opening
> table cell.
> The non-greedy matching (.*?) is not smart enough to do that for you.
>
> Hope this all helps. :)
>  

It did, but could somebody parse the regex syntax for me in plain
English.  I can say I could only understand about half of it on sight.
<td[^>]*>(?:[^/]|/(?!td>))*<img[^>].*?</td>

The main features I do not grok is the what role the [^>] plays and how
to interpret this part in front of the negative look behind; ?:[^/]/



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Adobe® ColdFusion® 8 software 8 is the most important and dramatic release to date
Get the Free Trial
http://ad.doubleclick.net/clk;203748912;27390454;j

Archive: http://www.houseoffusion.com/groups/RegEx/message.cfm/messageid:1173
Subscription: http://www.houseoffusion.com/groups/RegEx/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.21

Re: Find a table cell containing an image tag.

by Peter Boughton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> they are irrelevant and unavailable to this conversion exercise.  I
> could make them compliant and ignore them.  But since this is more a
> learning exercise then anything...

Fair enough on both counts. :)



> The main features I do not grok is the what role the [^>] plays

That's for if you don't know (or don't want to limit) what attributes
a tag contains.

A tag cannot contain a > character - any necessary ones would be
escaped as >

You could use a non-greedy wildcard like <tag.*?> but I use <tag[^>]*>
as it is more precise.



> how to interpret this part in front of the negative look behind; ?:[^/]/

Ah, you're mis-reading that slightly. There are a few parts at play
here, I'll attempt to explain them individually in a simpler
context...

(note, I'm adding spaces purely for readability - pretend there are no
spaces in any of the following examples)


( x | y ) is the standard "x OR y"  - the parentheses are necessary to
prevent the OR from applying to the whole of the expression.

However, using parentheses means that regex will capture the contents
for a backreference. This is not necessary here, so tell it to discard
the contents, we put ?: inside the parens, so we get (?: x | y )
p.s. this also works without the OR operator - just as (?: x )

The first part of the OR is [^/] which means simply "not /" - putting
caret (^) inside brackets negates them.
e.g [^abc] means "a single character that is not a nor b nor c"

Then, there's a negative lookahead which is (?! x ) and is the inverse
of a regular lookahead - i.e. it makes sure the contents of the parens
are NOT there.
As with all lookarounds, it is zero-width - it matches only a position
not actual characters. That is perhaps the key to understanding how
they work - that no characters are ever consumed by a lookaround, but
they still must match against the characters that follow the current
position.

Since we're dealing with a position, we need a preceeding character to
actually proceed with the match
For example x (?! y ) will match any x that is not followed a y (but
it will match only the x and will continue checking the rest of the
pattern from the next character).

Since I mentioned the non-capturing (?: x ) above, I'll point out that
this command is implicit in all lookarounds - they do not capture
their contents for backreferences.


So, to put all that together, what all this (?: [^/] |  / (?! td> ) )
is actually saying is:
Look for anything that is not a slash OR if you do find a slash only
accept it if it is not followed by the characters "td>", and when you
find either of these don't bother remembering it and just move on.

Or, put simpler, "if you find /td> in this section then stop trying to match"



Hopefully all of that makes sense? Feel free to ask if any part is unclear. :)




--
Peter Boughton
//hybridchill.com
//blog.bpsite.net

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Adobe® ColdFusion® 8 software 8 is the most important and dramatic release to date
Get the Free Trial
http://ad.doubleclick.net/clk;203748912;27390454;j

Archive: http://www.houseoffusion.com/groups/RegEx/message.cfm/messageid:1174
Subscription: http://www.houseoffusion.com/groups/RegEx/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=17837.14401.21