Read and analyze word document

Hi everyone
I need help to set up the best way to analyse a basic ( letter + logo ) word document
I have to convert it to very clean basic HTML tag… meaning: inline html formating

ex: when its in bold… set surounding word… same for italic, underline
and the paragraph

Doesn’t want the whole html structure…only returning clean text


This is an example

<p><b><i>This is an </i>example</i></b></p>

I thought to get every character style and pairing it in an array
Then re-assembling the text by analyzing the start and the end of X style…
so i’ill get the front and back tag for X number of char style in a row.

Any better idea to get a very clean text return

Hi Mark
i’m not sure to understand
Can i read and get word document with html style text tag ?
what is it look… messy or pretty clean
even after remove extra html tag… is it gone look my example ?
Because i need these tag… but very lean and not css style

I never use xojo for that kind of task
So… consider me as total newbee
I’ve ask help to find the better way to accomplish it

Recently, I had to do something similar, but I was using OpenOffice files rather than MS Word files. After much experimentation, I found that the easiest way to get what I wanted was to have the word processor save the document as HTML, then my Xojo app opens it, scans the HTML and deletes the parts that it doesn’t need. In my case I only needed to save character level formatting, links, list formatting and a couple of other minor things. The rest of the tags were stripped.

Thanks Robert
I’ve tried save it as HTML, but there’s so much noise around the tag itself than i can’t found that right regex to apply.
Thats why i search another way
If xojo is able to recognize the style of each letter, than i’ll decide when i want put my tag
But i’m open to any suggestion… that’s why i’ve post on this forum
If everyone said… the html way is the best
Then i’ll go that way

Did you consider calling pandoc (as an external tool) to make initial word to html conversion?

Thank you very much José María Terry Jiménez
Thats exactly what i need

You are welcome! Glad to help.