Manipulation a pipe delimited string when the last field may contain a pipe character

Tim_Jones · June 10, 2016, 8:41pm

Hi Folks,

Does anyone have a shareable clipping that would look at a fixed field count string and take into account that the fields after a set number (6 in this case) should be processed as one field even if it contains the separator?

For example:

VL:c|1805918208|2|6433044|147428|/Volumes/Xsan/NEGATIVE/01_TIMELAPES/01_SKI11_TLapse/01_JPG/SKI11t|WhistlerRICK042_11_04_17/April17-9246.jpg

In this case, the 6th field causes CountFields and NthFieldB to see 7 fields because of the | character in the segment path name.

I guess that I could SplitB the line into an array and then join any array elements above (5) returning the “|” character to the joined members.

Any other ideas?

Norman_P · June 10, 2016, 8:46pm

Dont use | as the separator if it can exist in a field as you dont have a reliable way of knowing which usage you’re looking at

Or is this ALWAYS 6 and the only place it can exist is in the last field
If not I fear you’re SOL

IF it could only exist in the last one get field 1 - 5 then grab everything past the end of field 5 into field 6

Tim_Jones · June 10, 2016, 8:53pm

It’s always 6 fields and we use | because the remainder of the potential separators are all valid characters in file names.

We’re still trying to determine how the users are able to add the | character as part of a path element, but that’s Apple …

One other thought would be a MidB InStrB replace of each of the first 5 ‘|’ characters with Chr(0) and then splitB on Chr(0).

Testing ideas right now.

Michel_Bujardet · June 10, 2016, 8:59pm

You can have separators with more than one character. Two semicolons in succession usually never appear in normal text, for instance.

Tim_Jones · June 10, 2016, 9:01pm

@Michel Bujardet - true, but we’re not always in control of what gets input here, so those types of modifications at the generation point aren’t always in our control.

Norman_P · June 10, 2016, 9:04pm

[quote=271337:@Tim Jones]It’s always 6 fields and we use | because the remainder of the potential separators are all valid characters in file names.
[/quote]
I tend to use low control characters which aren’t legal in most files name, or simple for users to get in there, but can be read from a text or binary file

[quote=271337:@Tim Jones]We’re still trying to determine how the users are able to add the | character as part of a path element, but that’s Apple …

One other thought would be a MidB InStrB replace of each of the first 5 ‘|’ characters with Chr(0) and then splitB on Chr(0).

Testing ideas right now.[/quote]
Something like that
We do that sort of things for the IDE reading a VCP manifest

Tim_Jones · June 10, 2016, 9:09pm

Okay, so using SplitB and then Joining each member above theArray(5) with the “|” character seems fastest on large sets (1,000,000+ lines).

theArray = SplitB(theLine, "|")
If theArray.Ubound > 5 then // 0 - 5 = 6 fields
  For x = 5 to theArray.Ubound
    thePath = thePath + "|" + theArray(x)
  Next
End If

Tested on both 64bit and 32bit runs, this is faster than replacing the first 5 “|” characters with Chr(0).

Kem_Tekinay · June 11, 2016, 2:40am

I can give you a regular expression for this too, if you’d like.

Tim_Jones · June 11, 2016, 6:33pm

Thanks, @Kem Tekinay - here’e the one that I was using that was sometimes dropping the characters after the 7th “|” character -

^VL:.\\|\\d+\\|[[:xdigit:]]+\\|\\d+\\|-?\\d+\\|" + theEscapedPath + "[^/\\r\
]+(?:/?)$

I have verified that the “theEscapedPath” value is properly shell-escaped in each case.

If I run that with a sample of the expanded string against 1000 or so lines from a log in RegExRX, it works, but for some reason we are witnessing isolated instances where it’s not getting the right answers based on a manual parse of the log. Is it possible that the RegEx engine in Xojo is barfing on 10’s of millions of lines?

Kem_Tekinay · June 12, 2016, 1:52am

I’d use the RegEx as a substitute for Split and apply it one line at a time.

rx.SearchPattern = "^([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)\\|([^|]*)\\|(.*)"

That will split each line by the bar into six parts. It won’t matter if the last part is all bars. The code would be something like this:

while not t.EOF
  dim oneLine as string = ReadOneLine
  dim match as RegExMatch = rx.Search( oneLine )
  if match is nil then
    exit while
  end if

  dim parts() as string
  redim parts( 5 )
  for i as integer = 1 to match.SubExpressionCount - 1
    parts( i - 1 ) = match.SubExpressionString( i )
  next

  // Do something with parts
wend

(Untested pseudo-code.)

I have no idea if this is faster or better than what you’re doing, I merely offer it as an alternative.

Tim_Jones · June 12, 2016, 4:29pm

Hi Kem,

A curiosity between RegExRX and Xojo

For example - this

^VL:.\\|\\d+\\|[[:xdigit:]]+\\|\\d+\\|-?\\d+\\|/Volumes/Xsan/DIGITAL NEGATIVE/01_TIMELAPES/01_SKI11_TLapse/01_JPG/SKI11tl|WhistlerRICK042_11_04_17/(.*)

returns the matched line as field $0:

VL:c|1805924352|2|6403952|147431|/Volumes/Xsan/DIGITAL NEGATIVE/01_TIMELAPES/01_SKI11_TLapse/01_JPG/SKI11tl|WhistlerRICK042_11_04_17/April17-9247.jpg

and the filename as field $1

April17-9247.jpg

in RegExRX, but in the returned data from the Xojo RegEx, SubExpressionString(1) is always empty.

Also, it doesn’t appear that Xojo’s debugger allows us to examine the resulting RegEx results… (old version of Xojo - the project opened in 13r3.3 this morning)

Kem_Tekinay · June 12, 2016, 5:33pm

You forgot to put the slash before the bar in your pattern so it’s acting as an alternator. In the first match, there is no group because it never executes the alternate pattern. RegExRX is showing the same thing. The pattern you intended is:

^VL:.\\|\\d+\\|[[:xdigit:]]+\\|\\d+\\|-?\\d+\\|/Volumes/Xsan/DIGITAL NEGATIVE/01_TIMELAPES/01_SKI11_TLapse/01_JPG/SKI11tl\\|WhistlerRICK042_11_04_17/(.*)