splitting a string by a couple different delimiters

Perry_Paolantonio · January 29, 2019, 11:26pm

I need to extract information from the file names of large image sequences. Ultimately, what I want to get from the file name is this:

Original File Name: “FIFTH_AVENUE_NEW_YORK_30FPS_15489_FLAT_SCAN_00064803.dpx”

seqFirstFile as text = “FIFTH_AVENUE_NEW_YORK_30FPS_15489_FLAT_SCAN_00064803”
seqBase as text = “FIFTH_AVENUE_NEW_YORK_30FPS_15489_FLAT_SCAN_”
seqFrameNum as text = “00064803”

(the last set of numbers is the file’s position in the sequence. I have it as text and not an int, to preserve the padding zeros at the beginning of the number)

I’d like to minimize the number of times I need to iterate over this string, but more importantly, there’s some variability in how the string is constructed. As a general rule, it’s a safe assumption that the extension can be removed by looking for the “.” – though there are some older systems that used dots in the file name, they’re not very common any more. The number of the file in the sequence can be an arbitrary number of digits. Here it’s 8 but it could be 6, it could be 12, it could be variable, in that the number might have been assigned without any zero padding at the beginning. The file could use dashes instead of underscores, or in some cases it might not have a delimiter between the last bit of text and the sequence number.

What’s the best way to quickly look at this and populate the three variables listed above?

Tim_Jones · January 30, 2019, 12:21am

[code]Dim theSplits(-1) As String

seqFirstFile = NthFieldB(theFileName, “.”, 1)
theSplits = Split(seqFirstFile, “")
seqFrameNum = theSplits.Pop
seqBase = Join(theSplits, "”)
[/code]

Daniel_Taylor · January 30, 2019, 1:00am

If the file name will always end with FrameNumber.Type then I would scan right to left using a loop and getting one character at a time with Mid. At the first dot you have your type. At the first digit you have the end of the frame number (remember right to left), and at the last digit you have the beginning.

With those indexes you can then use Mid to split the string up.

DaveS · January 30, 2019, 1:09am

dim v() as string = split(replaceAll(theString,sep1,sep2),sep2)

where

theString is what you want to split
sep1 is one of the delimiters
sep2 is the other
v() is the results

Kem_Tekinay · January 30, 2019, 5:23am

Should I bring up that this is what regular expressions are designed for, or nah?

Kem_Tekinay · January 30, 2019, 5:29am

Assuming there will always be an extension at the end, this will do what you want:

dim rx as new RegEx
rx.SearchPattern = "(?U)^((.*)(\\d+))\\.[^.]+$"

dim match as RegExMatch = rx.Search( filename )
if match isa object then
  dim seqFirstFile as string = match.SubExpressionString( 1 )
  dim seqBase as string = match.SubExpressionString( 2 )
  dim seqFrameNum as string = match.SubExpressionString( 3 )
end if

Perry_Paolantonio · January 30, 2019, 1:30pm

I just hate them. I know I should love them, but I hate the syntax and I find them endlessly confusing. I was hoping to avoid regex for something this simple.

I’ll give all these suggestions a try and see what’s fastest on a set of 10,000 files. (probably regex, knowing my luck)

Thanks everyone!

Kem_Tekinay · January 30, 2019, 2:00pm

My guess is it won’t be the fastest, but it will be the most flexible with the least code. Let us know.

BTW, that pattern says:

Set the mode to Un-greedy.
Look for the start of the line.
Start the first subgroup of the complete name without extension.
Start the second subgroup of the main part of the name (any characters of any length).
Start the third subgroup of 1 or more digits.
Match the dot and the extension of one or more characters that are not dots.
Look for the end of the line.

Daniel_Taylor · January 30, 2019, 6:32pm

RegEx may end up being the fastest because it’s a call out to a highly optimized library written in C or C++. Xojo is no slouch, especially now that we have LLVM with 64-bit builds. But you end up copying substrings when you use Mid which will slow down any algorithm where you’re scanning characters.

The reason I suggested Mid and scanning right-to-left is that you bypass everything that might confuse other methods. Since there’s no reliable delimiter, there may or may not be a dot, there may be many dots, and we must presume there may be other digits in the name, start at the end with the stuff you want. Once you have what you want, everything else is the first part.

That said, I can’t seem to stump Kem’s pattern with the variations you might run into. Unless performance is such a huge concern that you feel compelled to test other methods…and again, RegEx may end up being fastest any way…I would just go with the code he posted.

Tim_Jones · January 30, 2019, 7:56pm

My simple code above does the same thing without the use of Mid. And, I tested a 5000 line sample between Kem’s RegEx and my Split/NthField and for that size sample, I was within sub millisecond range.

When I changed my Split to SplitB, it was almost identical after 10 runs.

Tim_Parnell · January 30, 2019, 8:17pm

Your simple code fails to satisfy the requirements.

This is exactly what RegEx is for.

Tim_Jones · January 30, 2019, 8:37pm

Ah, it fails to satisfy ALL the requirements. My code handles what will most likely meet a 90% scenario.

We work with a number of Resolve colorists that match clips to Premiere Pro and Media Composer users that utilize similar naming conventions for DPX files. As with what happens when the old ISIS and EditShare systems are brought back online with such naming mistakes, production management pitches a fit, hires a script-kiddie to perform a batch rename, and THEN processes them to match the remainder of the assets in the archive. I just went through this with NBCUniversal. It was easier to rename the original frame files than try to work around the variations that they were seeing from cel scans dating back to 2002/3.

Tim_Jones · January 30, 2019, 8:41pm

To take it one step further, one subset had names that contained both hyphens and underscores and dots for separators. Something like this:

“FIFTH-AVENUE_NEW-YORK_30FPS.15489-FLAT-SCAN.00064803.dpx”

Tim_Parnell · January 30, 2019, 8:46pm

Which Kem’s RegEx was able to handle

Tim_Jones · January 30, 2019, 9:43pm

Except the renaming if the original files to match the standard that had been adopted :).