Removing nested span tags in text

Scott_Griffitts · September 26, 2014, 5:06pm

I’ve got several html files that can contain thousands of nested span tags. Some, but not all, of these spans need to be removed. Also note that it only the selected tags and not the text inside the tags that need to be removed.

For example, this:

This is some <span class="this_tag_needs_to_be_removed">example <span class="this_should_stay">text</span> showing <span>what I'm</span> talking <span class="this_needs_to_go_too">about.</span></span>

Needs to become this:

This is some example <span class="this_should_stay">text</span> showing <span>what I'm</span> talking about.

I have a working process with Applescript and BBedit but it’s clunky and slow. I was wondering if anyone has done anything similar and would have any pointers to share.

Kem_Tekinay · September 27, 2014, 12:22am

How do you decide between what stays and what goes?

Scott_Griffitts · September 27, 2014, 12:57am

I’m looking at these files and determining which span tags of a certain class (or just plain old tags) need to be deleted. The trick is that that the corresponding tag needs to be removed as well. This would be a piece of cake if these things weren’t nested. I’m playing around with finding the first occurrence of the selected tag, then finding the next occurrence of “<span” and the next occurrence of “.” I’ll compare those starting points to try and determine if they’re nested and if I need to dig further. It’s starting to get complicated to visualize so it’s slow going. If I get that working it may turn out to be unbearably slow.

Kem_Tekinay · September 27, 2014, 1:09am

A regular expression can do the job for you.

Scott_Griffitts · September 27, 2014, 1:46am

Do you have an example? I could search for something like (?s)(.+?) but that wouldn’t handle something nested in the middle. I’m not sure how to construct a regex to keep looking for nested tags and their corresponding closing tags.

Kem_Tekinay · September 27, 2014, 2:16am

Yes, I’ll whip something up for you soon.

Scott_Griffitts · September 27, 2014, 3:07am

Well for what it’s worth I’ve come up with something using mainly instr that takes about 45 seconds on a 1 mb file. Using the same file with BBedit and Applescript takes 15 minutes. I’m just going to compare the 2 files and make sure I got the same result.

Kem_Tekinay · September 27, 2014, 3:15am

OK, well, here’s the pattern anyway:

<span(?: class="(?'tag'[^"]*)")?>(?'text'(?:(?R)|(?!</span>).)*)</span>

That will stick the class name (if any) into SubExpressionString( 1 ) and the text between the span tags into SubExpressionString( 2 ).

Kem_Tekinay · September 27, 2014, 3:17am

I should clarify. In your example, running it once would match:

“example text showing what I’m talking about.”

In SubExpressionString( 1 ), you’d find “this_tag_needs_to_be_removed” and in SubExpressionString( 2 ) will be “example text showing what I’m talking about.”. You can then run it again on the text in SubExpressionString( 2 ) to keep drilling down.

Scott_Griffitts · September 27, 2014, 3:24am

Thank you very much. I’ll need to study this. I notice you use (?R) which is recursive, right? I’ve never messed with it and wasn’t sure Xojo offered it as I didn’t see it in the regex documentation.

Kem_Tekinay · September 27, 2014, 3:32am

Xojo uses a pretty recent version of PCRE, so yes, it’s there. Recursive in regular expressions can get pretty tricky, but I created and tested this pattern in RegExRX against your sample text and it works. Of course, you’ll have to tweak it if your actual text doesn’t match your sample text.

Beatrix_Willius · September 27, 2014, 8:26am

Don’t use Regex to parse html. Use Tidy from MBS instead.

Have a look at what happens here for parsing html with Regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags .

Kem_Tekinay · September 27, 2014, 11:58am

I’d agree if the Scott was asking for a general HTML parser, but this is a limited use task so, if the pattern is consistent with his sample text, I can’t see how a RegEx will fail.

Having said that, and while I haven’t used it myself, Tidy looks like it will help too.

Damien_Callaghan · September 28, 2014, 2:30pm

For what it’s worth, I have something I wrote in Foxpro that takes html files saved from MS Word (full of junk) and produces clean, ready to go html code that preserves the formatting I wish to preserve. A significant part of that is matching the closing tag with the correct opening <span tag in a line of code. This is done simply with arrays. Certain of the string functions in Foxpro don’t have Xojo counterparts but they could be reproduced I think. My code looks at the style/class details of the <span tag and inserts the appropriate html code or discards them if they are not required. Obviously, you would have to do your own thing there. If you are interested I’ll convert to Xojo (or say what is needed), add some comments to the code and post it. For me RegEx expressions get confusing quickly but there are folk on this forum who are super at them. RegEx expressions may well do better than my reproduction of Foxpro functions in Xojo. I do like the Fox for text manipulation.

What I am handling are New Testament translation documents. It takes 5 seconds to do the lot with all processing.

Michel_Bujardet · September 28, 2014, 3:25pm

There are always two approaches when it comes to strings : conceptual (Regex) and sequential (str()).

What’s important is to get the job done.