Faster XML DOM?

Finance_Dept · November 4, 2013, 4:10pm

Curious. We’re running into some performance issues parsing and processing large XML data sets using the Xojo XML Dom.
Is there something out there thats faster?

Thanks
Gary

Norman_P · November 4, 2013, 4:29pm

Don’t use that - use the XMLReader which doesn’t require loading the entire XML into memory and can be really fast
It’s event driven & I’ve used it to load the old iTunes XML file that was 180 mb in about 2 minutes

Finance_Dept · November 4, 2013, 4:30pm

Thanks Norman

So I’m using Nodelist’s, is that still going to work?

Norman_P · November 4, 2013, 4:37pm

No

The reader is entirely different and you get different events for different nodes.
It puts more work on you but because you don’t have to load the entire document you can load larger documents.
It DOES really force you to rethink HOW you process an XML document and YOU have to track states etc.

Bob_Keeney · November 4, 2013, 5:03pm

The XML Reader is a completely different beast. I found the learning curve to be somewhat painful. BUT, once you figure it out it is fast.

Finance_Dept · November 4, 2013, 5:27pm

Thanks guys. I’ve never used this, so I’ll need some examples. Are there any out there?

Antonio_Rinaldi · November 4, 2013, 7:24pm

How big is the document?

Norman_P · November 4, 2013, 11:49pm

The XMLReader will be faster for just about any size document BUT at the expense of being able to walk the DOM

This is based on experience with a 150 Mb iTunes library file
The XML Document simply barfed trying to load it
The XML Reader sped through it

Antonio_Rinaldi · November 5, 2013, 6:34am

You are right Norman.
But I was asking Gary’s document size.

With big documents is a matter of find the balance between loading speed and kind of work to be done.
XML DOM is more flexible than XML Reader, has more feature (XQL for instance), but becomes very slow as the document becomes larger.
XML Reader is faster but also more rigid, it’s the only choice with very big documents.

With medium large document is a matter of work to be done. If you have to do many “queries” against the document DOM is better (but you have to wait the load time), if you only need to read the data and transfer them in your software Reader is much better.

Finance_Dept · November 5, 2013, 4:25pm

So below is my XML “walk” routine that can happen on an adoXML string that could be HUGE. Sometimes upwards to 80Mb’s.

Yeah, I know thats a lot… but the reality is, I don’t have much control over how many “customers” are going to come back based on the query to the database. I’ve thought about breaking this operation up and splitting out the adoXML into a single customer XML result – but that would require multiple hits to the database to get the result, or taking the XML result and parsing it up into individual XML result sets… which both operations are costly.

So, a faster method would be a better way to go, as I think the DOM is just slowing me down. XMLReader sounds interesting, but how the hell do I do something like this?

xmldoc.LoadXml (adoXML)
Dim nodelist As XMLNodeList
Dim count,i as Integer
nodelist = xmldoc.Xql("//z:row/@custnum")
count = nodelist.Length
SendToLog ("Strart XML DOM Customers")
if nodelist <> nil then
  for i = 0 to count - 1
    try
      Dim node as XmlNode
      Dim Customer as New Dictionary
      CustNum =Val(xmldoc.xql("//z:row/@custnum").Item(i).Value)
      Customer.Value ("alt_customer_id") = CustNum
      Customer.Value ("lastname") = xmldoc.xql("//z:row/@lastname").Item(i).Value
      Customer.Value ("firstname") = xmldoc.xql("//z:row/@firstname").Item(i).Value
      Customer.Value ("email") = Trim (xmldoc.xql("//z:row/@email").Item(i).Value)
             
      node = xmldoc.Xql ("//z:row").Item (i)
      xml = node.ToString
      Dim xmlDocCust as New XmlDocument
      xmlDocCust.LoadXml (xml)
      xslt = Me.GetProductImportTemplate (Me.CustomerAdapterName)
      // transform customer xml.
       xml = xmlDocCust.Transform(xslt)
      
      Customer.Value ("xml")  = xml

      // add the customer's record to the SQLite database.

      if Me.AddCustomerToCache (Customer) then
        // increase total customer count.
        nTotalCustomers = nTotalCustomers + 1
      end if
    
    catch XmlException
      OutputToUser ("Error: " + XmlException.Message)
      return false
    end try
  next i
  nRecord  = App.Controller.Config.SyncCustomersMax
end if
// end.

Paul_Lefebvre · November 5, 2013, 4:59pm

Maybe it would be faster and easier to extract the data into its own mini-SQLite database rather than using XML?

Norman_P · November 5, 2013, 5:42pm

Since the result is consistently formed you can

when you get the node that is a Customer you could create a new customer instance (if you have such a thing) & then store the various tags into it as you receive them. You’d then have to write code to mimic the xql queries
you could instead create an in memory sqlite db (or on disk) & do similar to 1 and put the data in the db tables as you get it
Then everything is just db queries locally

I did something like this for huge iTunes xml files and loaded it into a sqlite db then used the DB
Way faster

Finance_Dept · November 5, 2013, 6:43pm

Thanks for the reply.

The issue is this. The adoXML is just that, XML from an ADO record set off a SQL Server.

So I’m doing the COM access through ADO via Xojo. There record set returned COULD be jammed into a Sqlite database,
but thats still going to have to loop around 77,000 rows and create it. I’m not sure thats going to be a performance benefit
over ADO saving the result set is persisted to ADO XML (format) and then grabbing that and running through it – as you see from the above code.

Would be awesome is ADO allowed for a persist on SQLite. Then my job would be easy. Nah, thats too easy.

So let me try these suggestions.

But you don’t think XMLReader would help me here? I’m also struggling to find an example code base to pull from.
Anybody have a nice example of XMLReader?

Thanks again
Gary

Norman_P · November 5, 2013, 7:18pm

loading the XML into and XML Document via

xmldoc.LoadXml (adoXML)

is slow esp with very large documents

77,000 rows isn’t that big - the XML file I did this with started out as 150Mb and had about 200,000+ “rows”
But you do have to parse the data in a way you didn’t have to using the XMLDocument

Finance_Dept · November 5, 2013, 8:09pm

Well, its 77,000 rows, but about 80Mb’s of data… and I would agree, 77K is not that much.

So the loadXmL isn’t that bad, its the XQL and node walking thats killing me.

I don’t mind implementing an XMLReader version of this, but I’m sort of looking at the class and scratching my head.

J_Andrew_Lipscomb · November 5, 2013, 8:39pm

Another thing to keep in mind about XMLReader is that it does not include an XMLWriter. If you need to create or modify XML, the DOM-based classes or direct string operations are needed. As for examples, they tend to be written in Java, but they should be out there on the 'net–noting that the technology is known generically as SAX.

Norman_P · November 5, 2013, 10:11pm

[quote=44274:@Gary MacDougall]Well, its 77,000 rows, but about 80Mb’s of data… and I would agree, 77K is not that much.

So the loadXmL isn’t that bad, its the XQL and node walking thats killing me.

I don’t mind implementing an XMLReader version of this, but I’m sort of looking at the class and scratching my head.[/quote]

well the queries in a db would be a lot quicker

Antonio_Rinaldi · November 6, 2013, 9:39am

77K rows is not too much
Probably is the XQL that’s killing performance.
you could try to log time execution to understand where you loose performance.

I don’t know the structure of your document (so to have a better navigation time), however reading your code I think that asking the same XQL too much time.

Here you are requesting all the z:row nodes that have a custnum attribute
Then you repeat the XQL to get the value, and other info.

44250:@Gary MacDougall:

if nodelist <> nil then
  for i = 0 to count - 1
    try
      Dim node as XmlNode
      Dim Customer as New Dictionary
      CustNum =Val(xmldoc.xql("//z:row/@custnum").Item(i).Value)
      Customer.Value ("alt_customer_id") = CustNum
      Customer.Value ("lastname") = xmldoc.xql("//z:row/@lastname").Item(i).Value
      Customer.Value ("firstname") = xmldoc.xql("//z:row/@firstname").Item(i).Value
      Customer.Value ("email") = Trim (xmldoc.xql("//z:row/@email").Item(i).Value)

Maybe (but I repeat I don’t know the structure of your document)
you could do:

[code]nodelist = xmldoc.Xql(“//z:row”)
count=nodelist.length-1
for i=0 to count
try
Dim node as XmlNode
Dim Customer as New Dictionary
CustNum =nodelist.Item(i).getAttribute(“custnum”).val
Customer.Value (“alt_customer_id”) = CustNum
Customer.Value (“lastname”) = nodelist.Item(i).getAttribute(“lastname”)
Customer.Value (“firstname”) = nodelist.Item(i).getAttribute(“firstname”)
Customer.Value (“email”) = nodelist.Item(i).getAttribute(“email”).trim

      xml = nodelist.Item(i).ToString
      Dim xmlDocCust as New XmlDocument
      xmlDocCust.LoadXml (xml) //Maybe an import a clone(true) node is faster...
      xslt = Me.GetProductImportTemplate (Me.CustomerAdapterName)

…[/code]