I need to extract information from the file names of large image sequences. Ultimately, what I want to get from the file name is this:
Original File Name: “FIFTH_AVENUE_NEW_YORK_30FPS_15489_FLAT_SCAN_00064803.dpx”
seqFirstFile as text = “FIFTH_AVENUE_NEW_YORK_30FPS_15489_FLAT_SCAN_00064803”
seqBase as text = “FIFTH_AVENUE_NEW_YORK_30FPS_15489_FLAT_SCAN_”
seqFrameNum as text = “00064803”
(the last set of numbers is the file’s position in the sequence. I have it as text and not an int, to preserve the padding zeros at the beginning of the number)
I’d like to minimize the number of times I need to iterate over this string, but more importantly, there’s some variability in how the string is constructed. As a general rule, it’s a safe assumption that the extension can be removed by looking for the “.” – though there are some older systems that used dots in the file name, they’re not very common any more. The number of the file in the sequence can be an arbitrary number of digits. Here it’s 8 but it could be 6, it could be 12, it could be variable, in that the number might have been assigned without any zero padding at the beginning. The file could use dashes instead of underscores, or in some cases it might not have a delimiter between the last bit of text and the sequence number.
What’s the best way to quickly look at this and populate the three variables listed above?
If the file name will always end with FrameNumber.Type then I would scan right to left using a loop and getting one character at a time with Mid. At the first dot you have your type. At the first digit you have the end of the frame number (remember right to left), and at the last digit you have the beginning.
With those indexes you can then use Mid to split the string up.
Assuming there will always be an extension at the end, this will do what you want:
dim rx as new RegEx
rx.SearchPattern = "(?U)^((.*)(\\d+))\\.[^.]+$"
dim match as RegExMatch = rx.Search( filename )
if match isa object then
dim seqFirstFile as string = match.SubExpressionString( 1 )
dim seqBase as string = match.SubExpressionString( 2 )
dim seqFrameNum as string = match.SubExpressionString( 3 )
RegEx may end up being the fastest because it’s a call out to a highly optimized library written in C or C++. Xojo is no slouch, especially now that we have LLVM with 64-bit builds. But you end up copying substrings when you use Mid which will slow down any algorithm where you’re scanning characters.
The reason I suggested Mid and scanning right-to-left is that you bypass everything that might confuse other methods. Since there’s no reliable delimiter, there may or may not be a dot, there may be many dots, and we must presume there may be other digits in the name, start at the end with the stuff you want. Once you have what you want, everything else is the first part.
That said, I can’t seem to stump Kem’s pattern with the variations you might run into. Unless performance is such a huge concern that you feel compelled to test other methods…and again, RegEx may end up being fastest any way…I would just go with the code he posted.
My simple code above does the same thing without the use of Mid. And, I tested a 5000 line sample between Kem’s RegEx and my Split/NthField and for that size sample, I was within sub millisecond range.
When I changed my Split to SplitB, it was almost identical after 10 runs.
Ah, it fails to satisfy ALL the requirements. My code handles what will most likely meet a 90% scenario.
We work with a number of Resolve colorists that match clips to Premiere Pro and Media Composer users that utilize similar naming conventions for DPX files. As with what happens when the old ISIS and EditShare systems are brought back online with such naming mistakes, production management pitches a fit, hires a script-kiddie to perform a batch rename, and THEN processes them to match the remainder of the assets in the archive. I just went through this with NBCUniversal. It was easier to rename the original frame files than try to work around the variations that they were seeing from cel scans dating back to 2002/3.