Extract data from docx file

Windows 11 Xojo 2023r2
I have a large docx file with customer names. I need to extract just the customer name from the file. I am not sure where to start, ie: regex or parseJSON or some other way. Any help is appreciated. Here is a snip from the docx file. Thanks
<option value="1729">1 Source Material Handling, Inc.</option><option value="2804">216 Power Systems Inc.</option><option value="1412">4K Lift</option>

Einhugur has a Microsoft Word plugin that you can use to read the document.

Once you’ve read the contents as plain text, you could use RegEx to find the contact information.

I can copy the info to a plain text file then read it. Just not sure what regex to use to get only the customer names from the long string. Any idea on the Regex part? Thanks

Names can be such a wide variety of things that personally I would look for consistent items around the names. You’re going to have to identify a pattern in the data to use an expression to extract what you’re looking for.

Since you’re working with people’s names, I do not suggest posting a sample set here.

1 Like

It seems that in my original post, this forum extracted the customer names from the string that I added to the post. I have corrected my post to show the actual string that I am using. The text that I need is between the > and </ See below. Thanks
<option value="1729">1 Source Material Handling, Inc.</option><option value="2804">216 Power Systems Inc.</option><option value="1412">4K Lift</option>

Is your source material an entire XML document that you’d be able to parse with the XMLDocument class?

Reference: XMLDocument Documentation

I added a constant named kXML and pasted the text below into the constant. In a button pressed event I added Var xml as New xmlDocument(kXML) When I press the button I get an xml error on that line. Is this not valid XML?

<?xml version="1.0" encoding="UTF-8"?>
<option value="1729">1 Source Material Handling Inc</option>
<option value="2804">216 Power Systems Inc</option>
<option value="1412">4K Lift</option>
<option value="2661">4M Iron</option>
<option value="2521">5280 Equipment LLC</option>
<option value="620">A  J Forklift</option>
<option value="864">A B Sheroki Lift Truck Service</option>

if there is an error with xojo xml document at loading, you don’t get the line where the error is…
you may open it with other tools, such as xmplify (full demo works 14 days) or bbedit to know where the error is

don’t know about einhugur or mbs xml handling because I don’t own them

I just used Split and Replace to get the info that I needed from the text. All done now. Thanks

var anarray() as string
var i as integer
var s as string
anarray = kXML.Split(">")
for i = 1 to anarray.LastIndex step 2
  s = s + anarray(i).Replace("</option", "") + endofline
next i
TextArea1.text = s

this regex would have done it too…

dim rx as new RegEx
rx.SearchPattern = "(?mi-Us)"">(.*)</o"

Gary, your xml is not valid
Add a at the beginning (before the first <option …>) and a at the end

In a XML file only one root node is valid, but the root node can have any number of child nodes

1 Like

Thanks for reply. Someone else sent me the file and that is what I had to work with. I have already extracted the info needed. I will keep your suggestion in mind if this is ever needed again.