I have an application that opens a large spreadsheet, then reads in a row with 20 cells, and does a number of data transformations on each cell, converting them into RDF triples (basically, formatted strings) on the way. A row of 20 cells ends up as around 200 separate triples. Finally the collection of triples for that row gets written out to a text file, with each line having one triple. Here is a typical example of a triple (in N-triples format, FWIW).
I’m OK with all of the steps necessary to do the transformations and file writing. Where I could do with some advice is the most effective way to store the growing set of triples for the row before I write them all out to the file. I’m reading in the spreadsheet row as an array of strings; while I’ve been developing the conversion code I’ve been opening a text file, writing out an individual triple and then closing it again. This is obviously clunky code, but I’m also concerned that when I run this on a large scale (5000 rows of 20 columns, producing probably 500000 triples) then opening, writing and closing a text file 500000 times is not going to be at all performant.
Would it be best to simply write out an array for each row as well? That way I’d open and close the file once for each row, but that’s still 5000 open, write and close steps.
Is it worth considering a 2-D array, and if so can an application cope with creating a 20 x 5000 array of rdf triples?
Or is there another way to do this?
Any suggestions welcome.