.docx to .html conversion

My app calls for uploading 100’s of small .doc or .docx files, and converting them to .html for storage in the database. Does anyone have a solution for converting .doc or .docx to html automatically?

I’m sure I’ll need to delete any stray javascript or sql occurring in these .doc’s as well. Any thoughts on that would be much appreciated as well.

Thanks!
Vince

why don’t you store directly the .doc in the database ???
there must be a web service that does such a conversion but then you depend completely on it…

I searched the internet and found the following interesting links :

http://www.wikihow.com/Convert-a-Word-Document-to-HTML

https://www.jackreichert.com/2012/11/how-to-convert-docx-to-html/

I hope this will help you in converting .doc and .docx files into HTML.

As Yves said, there is nothing special about html for database storage.
You can store the DOCX directly if you wish (and its already an XML document of sorts)
Any embedded SQL wont be accidently run by the db.

If you want to obfuscate it before storage, then encodebase64 , (which turns it into a large text blob) or zip will do that for you.

I also first seemed to agree with Yves, but after some research, it seemed very possible.

We have to assume that Vince has his reasons to make the convertion by its own application. There is a lot of information available on the internet about this subject.

I would even suggest to contact CoffeeCup, which is a very small but very skilled and experienced team when it comes to HTML/HTML5/CSS. Maybe when they are in an excellent mood, they will give Vince very usefull information.

Storing the whole .docx document in the database, will also make it considerable hugh after a while. Stripping of unnessecary tags, will be a space saver and in the long run will be much more profitable.

The easiest way is not always the best way!

Chris

Use Word for the HTML conversion and use the Office plugin in Xojo. I use Xojo in combination with Word to automagically create PDF’s. Word’s build-in PDF convertor renders i.e. pictures much better than many PDFprinters available. I have built a simple queueing app that can handle thousands of PDF requests per hour. Conversion is simply done by calling the Save As routine in Word.

Is HTML needed for display or something? If not don’t convert the files but store the .doc or .docx directly in the database. I use a STRING type column for that and convert the file to a string. This loads very fast. I have not yet come across any size limitations or problem with performance.

if this is running on a Mac, you can use Cocoa class functions to convert doc to other formats.
see NSAttributedStringMBS class.

https://www.monkeybreadsoftware.net/class-nsattributedstringmbs.shtml

and I think I could just add methods there to make PDF.

As Alexander mentioned, Microsoft Word has a built in function for converting to htm. Just add the free Xojo plugin MSOfficeAutomation into your “C:\Program Files (x86)\Xojo\Xojo 2015r4\Plugins” directory and all the files can be converted. Here is code from Example 2-9 in the Word book to save a Word document file to htm.

[code] 'Make a new document
Dim word as new WordApplication
word.Documents.Add
word.Visible = True

'Add some example text
word.Selection.TypeText “This is saved as a HTM file from Word 2013. Lorem ipsum dolor sit amet.”

'Save As an htm file
word.ActiveDocument.SaveAs2 (“C:\test\HTMLDoc.htm”, 8)

'Catch errors
Exception err as NilObjectException
MsgBox Err.Message[/code]

As a web project the conversion could only take place on the Web server if it runs Windows - assuming you can use the COM control in a Web app
Or it has to take place on a client running Windows - the user would have to save the docx as html then upload that

Lots of online services - free and paid
The free ones tend to limit document size and number you can process per session / day
A bunch of Non-Xojo software for doing this (java, php, c#)
Depending on what server you’re planning on running this on one or more may be an option
Some open source & some paid
http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/
https://angelozerr.wordpress.com/2012/12/06/how-to-convert-docxodt-to-pdfhtml-with-java/
http://www.docx4java.org/downloads.html
https://code.google.com/p/jodconverter/
https://code.google.com/p/xdocreport/wiki/Converters
http://developers.itextpdf.com/itext-java

Thanks for all the replies!

I would like to store them as .html in the database so that their contents can be integrated into the app and the users can edit them using an in-app, WYSIWYG HTML editor.

My users are running pc’s and macs, and my hosting will most likely be on Linux, so that eliminates MSOfficeAutomation, right?

If that’s the case then I’ll look closely at php/java/third party converters as some of you suggested.

Yes