Best approach for handling collections of strings

I have an application that opens a large spreadsheet, then reads in a row with 20 cells, and does a number of data transformations on each cell, converting them into RDF triples (basically, formatted strings) on the way. A row of 20 cells ends up as around 200 separate triples. Finally the collection of triples for that row gets written out to a text file, with each line having one triple. Here is a typical example of a triple (in N-triples format, FWIW).

https://vocabulary.xyz.com/XYZCommonStructure/c621d8e7-629c-d93d-5dc0-eddfa74a1731 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2004/02/skos/core#Concept .

I’m OK with all of the steps necessary to do the transformations and file writing. Where I could do with some advice is the most effective way to store the growing set of triples for the row before I write them all out to the file. I’m reading in the spreadsheet row as an array of strings; while I’ve been developing the conversion code I’ve been opening a text file, writing out an individual triple and then closing it again. This is obviously clunky code, but I’m also concerned that when I run this on a large scale (5000 rows of 20 columns, producing probably 500000 triples) then opening, writing and closing a text file 500000 times is not going to be at all performant. :smiley:

Would it be best to simply write out an array for each row as well? That way I’d open and close the file once for each row, but that’s still 5000 open, write and close steps.

Is it worth considering a 2-D array, and if so can an application cope with creating a 20 x 5000 array of rdf triples?

Or is there another way to do this?

Any suggestions welcome.

“I’ve been opening a text file, writing out an individual triple and then closing it again”

But why do you close the file every time (which is very time-consuming)? You can open the output file, process your data in different methods or even threads and close the file once you are done.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.