Remove HTML tags

Constantine_Figueroa · October 1, 2018, 10:41pm

Can anyone refresh my mind on regex to remove html tags from a string generated by another regex match?
Thanks in advance!!

Tim_Parnell · October 1, 2018, 10:49pm

What have you tried that hasn’t worked? RegExRx is a great way to work out your regular expressions before using them in Xojo.

Constantine_Figueroa · October 1, 2018, 11:05pm

i am cycling thru the same regex search pattern and end up with a match that has html tags around the actual content i need - those tags could be font color and/or font style - i dont need them - so if you have a reference i can use to add regex that removes those HTML tags then simply let me know otherwise… thanks!

Constantine_Figueroa · October 2, 2018, 12:17am

papa = mtch0.SubExpressionString( 1 )
papa_final_lines = ReplaceAll(papa ,"/<cite\\ .*?<\\/.*?cite>/i","")
papa_final_lines = ReplaceAll(papa_final_lines ,"/<font\\ .*?<\\/.*?font>/i","")

papa is a string found via regex match:

<font color="blue"><cite>NeedToCleanThisTextString</cite></font>

Hi Tim - I think i forgot something after few years of not touching RB/XOJO

Maximilian_Tyrtania · October 2, 2018, 5:23am

If using the MBS plugins is an option, check out the RemoveHTMLTagsMBS -function at https://www.monkeybreadsoftware.net/string-string-method1.shtml.

Constantine_Figueroa · October 2, 2018, 11:10am

found in old RB forums where i used to hang out every day, this snippet - tried it and worked like a charm for me -

Function StripHTMLTags(InputStr As String) As String
  Dim R As New RegEx
  R.SearchPattern = "<[^<>]+>"
  R.ReplacementPattern = ""
  Dim S As String = InputStr
  Dim S2 As String = R.Replace(InputStr)
  While (StrComp(S, S2, 0) <> 0)
    S = S2
    S2 = R.Replace()
  Wend
  Return S2
End Function

but thanks thou to Max for keeping up the MBS plugins - i dont remember exactly but i think i even purchased your package in 2008 or 2009 - not sure but thanks for your enthusiasm and extra care for current and potential paid customers - you are the BEST!!

Maximilian_Tyrtania · October 2, 2018, 1:44pm

Thanks for your kind words, but they should maybe be directed to the author of the plugin which happens to be @Christian Schmitz , not me.

Oliver_Osswald · October 2, 2018, 3:34pm

This is how I do it:

[code]Private Function StripHTML(html As String) as String
Dim re As New RegEx
re.SearchPattern = “(?:<style.+?>.+?|<script.+?>.+?|<(?:!|/?[a-zA-Z]+).*?/?>)”
re.ReplacementPattern = “”
re.Options.ReplaceAllMatches = True

Dim plain As String = re.Replace(html)

Return plain
End Function
[/code]

Constantine_Figueroa · November 4, 2018, 6:18am

Thank you so much, Oliver - it works as well

Carlo_Rubini · November 5, 2018, 5:26am

Without using regex, you could transform your html file into a text file using textutil in a shell. Something like this:

dim sh as new shell
sh.Execute "textutil -convert txt " + myFile.shellPath