Parsing OpenGraph data in HTML

Jeremie_L · December 8, 2022, 2:12am

I am trying to parse OpenGraph data from multiple HTML pages.

The data in the HTML looks like this:

<html>
<head>
  <meta property="og:type"  content="product">
<meta  property="og:title" content="Product Title">
 <meta property="og:image" content="https://www.example.com" >
<meta property="og:description" content="product description" >
...
</head>

Instead of splitting the HTML and reading each line (the HTML could actually be a single line) is there a Regex I can use to extract the “content” tag of each property: “og:type”, “og:title”, “og:image” and “og:description”?

I was thinking of a pattern like this:

.+og:(.*)"\s+content="(.*)"

But I’m not sure if that is optimal.

Beatrix_Willius · December 8, 2022, 2:40am

Use TidyDocumentMBS instead. See Monkeybread Xojo plugin - TidyDocumentMBS class . Regex and html don’t mix very well.

Robert_Weaver · December 8, 2022, 8:55pm

From your example html, you’d like to return something like the following?

og:type, product
og:title, Product Title
og:image, https://www.example.com
og:description, product description

To do this, I used some text routines that I’ve created to help me rebuild my old website from the saved html. They don’t use regex, because I’ve never been good at regex. The above example took me a couple of minutes to code with my text routines. If you’re interested, I can post it. But be warned, it is not a true html parser. It works fine parsing machine generated html, but if the pages were human created, there are occasionally spaces stuck in inconvenient places that may cause problems. The tools address some of these issues, but may not catch all of them.

Jeremie_L · December 8, 2022, 9:07pm

Hi Robert,

Yes that’s exactly it.
I would be interested in seeing your parser and compare the speed against the regex I’m using. On large pages it takes up to 200ms to get the opengraph tags.
While acceptable, I’m trying to optimize my web app as much as possible. There can be up to 300 concurrent sessions.

Robert_Weaver · December 8, 2022, 9:37pm

If you’re already using regex, then I’m sure it will be considerably faster than my code, but the code is below. I’ve also made a small project file that includes a couple of other text manipulation routines.

Public Function ExtractOG(rawHTML As String) as string
  dim cleanHTML As String = CompressWhitespace(rawHTML)
  dim head() As String = parseTxtToAry(cleanHTML,"<head>","</head>")
  dim output as String = ""
  if head.Ubound>=0 then
    dim meta() As String = parseTxtToAry(head(0),"<meta",">")
    if meta.Ubound>=0 then
      'At least one meta tag has been found
      for i as Integer = 0 to meta.Ubound
        output = output + parseTxt(meta(i),"property=""","""")+", "+parseTxt(meta(i),"content=""","""")+EndOfLine
      next
    end if
  else
    'This only happens if the head can't be found
  end if
  return output
End Function


Public Function CompressWhitespace(s1 As String) as String
  'Removes duplicate whitespace characters from text.
  'Also trims leading and trailing whitespace.
  'This is not very efficient code, and should be 
  'replaced with more efficient RegEx code.
  dim s As String =s1
  'Convert all whitespace characters into spaces
  s=ReplaceAll(s,chr(9)," ")
  s=ReplaceAll(s,chr(10)," ")
  s=ReplaceAll(s,chr(13)," ")
  dim L As Integer = len(s)+1
  'Replace duplicate spaces with single spaces repeatedly until length stops changing
  While len(s)<>L
    L=len(s)
    s=ReplaceAll(s,"  "," ")
    'These are specific to html
    s=ReplaceAll(s,"< ","<")
    s=ReplaceAll(s," >",">")
  wend
  return trim(s)
End Function


Public Function parseTxt(s As String, delimA As String, delimB As String) as string
  'Returns a string array (or EOL delimited string) of every substring of s,
  'that falls between left delimiter text delimA and right delimiter text delimB.
  dim outList() As String = Split(s,delimA)
  dim hitCount As Integer = UBound(outList)
  if hitCount>-1 then outList.Remove(0)
  hitCount=hitCount-1
  for i as integer = 0 to hitCount
    dim ss() As String = split(outList(i),delimB)
    if UBound(ss)<0 then
      outList(i)=""
    else
      outList(i)=ss(0)
    end if
  next
  'Choose one of the following return options
  return Join(outList,EndOfLine)
  'return outList
End Function


Public Function parseTxtToAry(s As String, delimA As String, delimB As String) as string()
  'Returns a string array (or EOL delimited string) of every substring of s,
  'that falls between left delimiter text delimA and right delimiter text delimB.
  dim outList() As String = Split(s,delimA)
  dim hitCount As Integer = UBound(outList)
  if hitCount>-1 then outList.Remove(0)
  hitCount=hitCount-1
  for i as integer = 0 to hitCount
    dim ss() As String = split(outList(i),delimB)
    if UBound(ss)<0 then
      outList(i)=""
    else
      outList(i)=ss(0)
    end if
  next
  'Choose one of the following return options
  'return Join(outList,EndOfLine)
  return outList
End Function

PhilippeP · December 9, 2022, 1:16pm

i tried your xojo example with 3 links
i had 3 different results, 57 ms with this

126 ms with this

277 ms with this

PhilippeP · December 9, 2022, 1:27pm

so i never measured my metaparser, just did with same links
i got in same order
10 ms
12 ms
and 25 ms

i have no idea why lol,
my parser is simple scan meta array sequentially, i don’t know regex

PhilippeP · December 9, 2022, 1:32pm

you have a example link ? seems a lot than eamples above

PhilippeP · December 9, 2022, 3:49pm

well your code seem to do more stuff, and mine is not bulletproof
one question, why do you remove whitespaces :
dim cleanHTML As String = CompressWhitespace(rawHTML)
if usefull can i steal your code ?

Jeremie_L · December 9, 2022, 4:06pm

I can’t remember exactly which Regex pattern I was using when it took 200ms.

Now using this pattern

pattern = "<meta.+(og:.*)""\s+content=""(.*)"""

I have the following results for those 3 links:
8ms
9ms
3ms

But that pattern will only work if the HTML uses double-quotes. If the developer goes for something like this, I won’t get any result.

<meta property='og:title'
		content='Title'>

Scott_Griffitts · December 9, 2022, 4:36pm

Try this:

pattern = "<meta\s+property=['""](og:.+?)['""](?s)\s+content=['""](.+?)[""']\s*>"

DerkJ · December 9, 2022, 5:18pm

If you need this data from a URL call, you could use a URLConnection to:

The api call:
https://opengraph.io/api/1.1/site/**url_encoded_link**

  { 
    "hybridGraph": { 
      "title": "Google", 
      "description": "Search the world's information...", 
      "image": "http://google.com/images/srpr/logo9w.png", 
      "url": "http://google.com", 
      "type": "site", 
      "site_name": "Google" 
    } , 
    "openGraph": {..} 
    "htmlInferred": {..} 
  }

Or try these solutions:

Jeremie_L · December 9, 2022, 5:31pm

Thanks @DerkJ, I was using OpenGraph but it was a bit too slow in my tests.
Though I still use it for Amazon URLs because Amazon does not support OpenGraph tags.

But the Regex pattern in the stackoverflow link seems to be perfect for my needs.

Jeremie_L · December 9, 2022, 5:47pm

I slightly updated the regex pattern from here https://stackoverflow.com/a/30778027/1240982

I am now using

//Test regex here: https://regex101.com/r/yEcR9E/1
Dim patterns() as String
patterns.Add "<meta\s[^>]*property=[\""'](og:title)[\""']\s[^>]*content=[\""']([^'^\""]+?)[\""'][^>]*>"
patterns.Add "<meta\s[^>]*property=[\""'](og:image)[\""']\s[^>]*content=[\""']([^'^\""]+?)[\""'][^>]*>"
patterns.Add "<meta\s[^>]*property=[\""'](og:image:url)[\""']\s[^>]*content=[\""']([^'^\""]+?)[\""'][^>]*>"

–
Timing results for the three links above are now.
1ms
2ms
3ms

And the result looks like this in my Xojo Web App (running in Xojo Cloud, not locally)
og_tags_animation

PhilippeP · December 10, 2022, 6:39am

!! great, i guess i know now why i need to learn regex , thanks !
when i have some time, iwill do a method i’ll post somewhere, there is more metatags than that, i’ll try to figure out hoàw to add those tags, and there is a keyword list, dunno if this doable using this
you mind sharing more explicit code ? i’m sure there there’s many lwho would benefit from it, i wanted to do so but my code is dirty

Jeremie_L · December 12, 2022, 3:03pm

Sure, here is my code to extract the OpenGraph title and image url

Dim title as String
Dim imgURL as String

Dim reg As RegEx
Dim myMatch as RegExMatch

if page.Contains("og:") then //page is a String containing the HTML webpage

Dim patterns() as String
patterns.Add "<meta\s[^>]*property=[\""'](og:title)[\""']\s[^>]*content=[\""']([^'^\""]+?)[\""'][^>]*>"
patterns.Add "<meta\s[^>]*property=[\""'](og:image)[\""']\s[^>]*content=[\""']([^'^\""]+?)[\""'][^>]*>"
patterns.Add "<meta\s[^>]*property=[\""'](og:image:url)[\""']\s[^>]*content=[\""']([^'^\""]+?)[\""'][^>]*>"
//Feel free to add more patterns matching twitter:title, twitter:image if necessary

for each pattern in patterns
  
  reg = new RegEx
  reg.SearchPattern = pattern
  reg.Options.Greedy = False
  myMatch = reg.Search(page)
  
  if myMatch <> nil then
    
    if myMatch.SubExpressionCount > 2 then
      Dim og_type As String = myMatch.SubExpressionString(1)
      Dim og_content As String = myMatch.SubExpressionString(2)
      
      Select case og_type
      Case "og:title"
        title = og_content
      Case "og:image"
        imgURL = og_content
      Case "og:image:url"
        imgURL = og_content
      End Select
      
      
      if title.IsEmpty = False and imgURL.IsEmpty = False then
        Exit
      end if
      
    end if
    
  end if
next

End if

//Now do whatever with the title and imageURL

PhilippeP · December 13, 2022, 9:31am

so yesterday morning, i gave me an advent challenge to make your 3 lines works and learn regex the fastest i can.

i read the xojo doc, tried some didn’t undersatand well the doc about regexmatch
so i switch to mbs and i made parsing in no time !!
then i realised its not web
redid using xojo, there is no goodsearch example in te doc, so i tried by mistakes
i created a file to test,
i would have add more tags other than og: but i’m too busy wit not related xojo stuff, will update this file later

you can paste html source page

https://ota.fyi/Xojo/regexHtml.xojo_binary_project.zip

i’ve added your method below right now

yesterday wanted to point that i can’t find tag like this

i could only find
og:image
og:url

you have cases of og:image:url ?

(your code misses a Dim pattern as string )

Jeremie_L · December 13, 2022, 11:39am

Yes, it is rare but still occurs in some webpages.

Sorry about that.

PhilippeP · December 14, 2022, 10:20am

i took a bit of time to make new version with more tags

thanks @Jeremie_L @DerkJ @Robert_Weaver @Scott_Griffitts

you can download source file:

https://ota.fyi/Xojo/regexHtml2.zip

Jeremie_L · December 15, 2022, 12:51pm

Lesson learned, never trust HTML to be well formatted.

The og:image tag on this page contains an endofline and some whitespace.
This was causing a Javascript error in my WebApp.

I am now using Trim on the OpenGraph content.