Extract data from docx file

Gary_Smith · October 25, 2023, 3:23pm

Windows 11 Xojo 2023r2
I have a large docx file with customer names. I need to extract just the customer name from the file. I am not sure where to start, ie: regex or parseJSON or some other way. Any help is appreciated. Here is a snip from the docx file. Thanks
<option value="1729">1 Source Material Handling, Inc.</option><option value="2804">216 Power Systems Inc.</option><option value="1412">4K Lift</option>

Tim_Parnell · October 25, 2023, 3:26pm

Einhugur has a Microsoft Word plugin that you can use to read the document.

Once you’ve read the contents as plain text, you could use RegEx to find the contact information.

Gary_Smith · October 25, 2023, 3:30pm

I can copy the info to a plain text file then read it. Just not sure what regex to use to get only the customer names from the long string. Any idea on the Regex part? Thanks

Tim_Parnell · October 25, 2023, 3:35pm

Names can be such a wide variety of things that personally I would look for consistent items around the names. You’re going to have to identify a pattern in the data to use an expression to extract what you’re looking for.

Since you’re working with people’s names, I do not suggest posting a sample set here.

Gary_Smith · October 25, 2023, 3:43pm

It seems that in my original post, this forum extracted the customer names from the string that I added to the post. I have corrected my post to show the actual string that I am using. The text that I need is between the > and </ See below. Thanks
<option value="1729">1 Source Material Handling, Inc.</option><option value="2804">216 Power Systems Inc.</option><option value="1412">4K Lift</option>

Tim_Parnell · October 25, 2023, 3:46pm

Is your source material an entire XML document that you’d be able to parse with the XMLDocument class?

Reference: XMLDocument Documentation

Gary_Smith · October 25, 2023, 4:53pm

I added a constant named kXML and pasted the text below into the constant. In a button pressed event I added Var xml as New xmlDocument(kXML) When I press the button I get an xml error on that line. Is this not valid XML?

<?xml version="1.0" encoding="UTF-8"?>
<option value="1729">1 Source Material Handling Inc</option>
<option value="2804">216 Power Systems Inc</option>
<option value="1412">4K Lift</option>
<option value="2661">4M Iron</option>
<option value="2521">5280 Equipment LLC</option>
<option value="620">A  J Forklift</option>
<option value="864">A B Sheroki Lift Truck Service</option>

Jean-Yves_Pochez · October 25, 2023, 6:16pm

if there is an error with xojo xml document at loading, you don’t get the line where the error is…
you may open it with other tools, such as xmplify (full demo works 14 days) or bbedit to know where the error is

don’t know about einhugur or mbs xml handling because I don’t own them

Gary_Smith · October 25, 2023, 6:19pm

I just used Split and Replace to get the info that I needed from the text. All done now. Thanks

var anarray() as string
var i as integer
var s as string
anarray = kXML.Split(">")
for i = 1 to anarray.LastIndex step 2
  s = s + anarray(i).Replace("</option", "") + endofline
next i
TextArea1.text = s

Jean-Yves_Pochez · October 25, 2023, 6:32pm

this regex would have done it too…

dim rx as new RegEx
rx.SearchPattern = "(?mi-Us)"">(.*)</o"

Antonio_Rinaldi · October 26, 2023, 1:32pm

Gary, your xml is not valid
Add a at the beginning (before the first <option …>) and a at the end

In a XML file only one root node is valid, but the root node can have any number of child nodes

Gary_Smith · October 26, 2023, 6:15pm

Thanks for reply. Someone else sent me the file and that is what I had to work with. I have already extracted the info needed. I will keep your suggestion in mind if this is ever needed again.