Faster way to search a folder of 30k files for partial name match?

Problem statement: I have a database of jobs that have been shipped which gets updated every day. I need to delete any files associated files when a job is shipped. The folder that contains the production files associated with this type job I need to delete may contain from 20,000 to 30,0000 files at any given moment. The Job number is a component of the file name. For example Job X10529714 and the actual matching file name is X10529714-S01-V01.sRGB.PDF. The code I’m using to search the folder for files that match the job number is below. It is taking hours to iterate through all of the files and search for a job number match. There will usually be only one match, but occasionally there may be more than one and sometimes no match. What I’m looking for is a faster way to search for the matching files. If I use the built in Windows search function in a folder window it takes seconds to find any matching files so there has to be a better way. Does anyone have any suggestions?

Public Function FindFileMatches() As FolderItem()
  Var matchstring As String
  // Create the folder path object
  Var folderpath As New FolderItem(SearchFilePath,FolderItem.PathModes.Native)
  // Create an array to hold the file list
  Var filelist(-1),matchlist(-1) As FolderItem
  // Check to make sure the SearchFilePath path exists
  If Not (folderpath = Nil) And folderpath.Exists Then
    // Get the SearchFilePath folder contents and store in the filelist array of folderitems
    filelist = FolderContents(folderpath,False)
    // remove any subfolders from the list
    Var idx As Integer
    For idx = filelist.LastIndex DownTo 0
      If filelist(idx).IsFolder Then
        filelist.RemoveAt(idx)
      End If
    Next
    // Match the JobID to files in the list and add to a match list array
    Var rg As New RegEx
    Var rgMatch As RegExMatch
    rg.SearchPattern = "^(\w{9})-.*$"
    For idx = 0 To filelist.LastIndex
      rgMatch = rg.Search(filelist(idx).Name)
      If Not (rgMatch = Nil) Then
        matchlist.Add(filelist(idx))
      End If
    Next
  End If
  Return matchlist
End Function
Public Function FolderContents(dir as FolderItem, includeInvisibles as Boolean) As FolderItem()
  // returns an array of all items inside a folder
  Var items() as FolderItem
  Var n as Integer = dir.Count
  For i as Integer = 1 to n
    Var f as FolderItem
    f = dir.TrueItem(i)
    If f <> nil and f.Exists then
      If includeInvisibles or f.Visible then
        items.Append f
      End
    End
  Next
  Return items
End Function

Could you do a shell command like “dir (asterisk)X10529714(asterisk) > listoffiles.txt” and then read the file “listoffiles.txt” line by line to get all the names you need? (note, I didn’t actually go through your code to see if this will work exactly right) (Edit 2 - there should be an asterisk before and after the “X10529714” up there but it italicized it instead)

Bill that sounds like and interesting approach. The app I’m running is a console app. I may experiment with your suggestion in the morning. I’ll test it in a cmd window first to see how fast that is.

If you have a MBS license, consider using their FileListMBS class and I think you may even be able to exploit the optional filter in the constructor though I have not had need of that filter myself.

It is orders of magnitude faster than folder item traversal.

3 Likes

Thanks Douglas! Yes I do have an MBS license. I’ll check the sample projects and test that.

1 Like

If you control the file structure, create a folder for each job.

well, there are many things I dont understand in your code:

-Why waste memory and CPU creating a FolderItem for aech file?
-Why waste memory and CPU using RegEx to match all the files if you can use a simple instr with the Job number?
-Why call the SLOW TrueItem method for items you dont really care to have the TrueItem? (That is important just for one or two items)
-Why use Var? :nauseated_face: :crazy_face:

Is not easier/faster to have only one loop, and string comparing the name in there?

you can optimize this into one loop

If filelist(idx).IsFolder Then
   //ignore folders
Else
..Test Match
End If

I may be missing something :slight_smile:

why not shell a DOS command

DEL /F /Q /S X10529714*.*

/F -allows read only to be deleted
/q - quiet mode
/s - subdirectories too

1 Like

Thanks for all the input folks. I ended up using FilListMBS that Douglas Handy recommended. With that I could make a list of the file names in a String Array that I could quickly search for Job number matches with regex and then delete each file as it was found. It way faster and FileListMBS also filtered out folders as well as invisibles. This was a more appealing option than using a shell to issue commands.

3 Likes

So do you have a benchmark you can share? If it used to take “hours” with the original approach, what does it take now?

Douglas there was another error in my code that was contributing to it taking “hours” that I discovered that I fixed when I went to using FileListMBS. Using an array of file names to do my regex search of 30K files took maybe a second so it was orders of magnitude faster than what I first attempted. I didn’t actually attempt a benchmark as being necessary since I was satisfied with the speed. Below is a code snippet from the console Run event

// Use FileListMBS (filelist object) to list file names in a MatchList Array
Var SkipMode As Integer
SkipMode = BitwiseOr(SkipMode, FileListMBS.SkipFolders)
SkipMode = BitwiseOr(SkipMode, FileListMBS.SkipHidden)
filelist = New FileListMBS(SearchFilePath, "", SkipMode)
Var filecount As Integer = filelist.Count -1
Var MatchList(-1) As String
For idx = 0 To filecount
  MatchList.Add(filelist.Name(idx))
Next

LogString = "Searching " + SearchFilePath + " for matching files" 
StdOut.WriteLine(LogString)
LogTransactions(LogString)

Var f As FolderItem
Var rg As New RegEx
Var rgMatch As RegExMatch

For row = 0 To DBRecords.LastIndex(1)
  b = False
  
  Job_ID = DBRecords(row,1)
  
  rg.SearchPattern = "^.*-(\w{9})-.*$"
  rgMatch = rg.Search(DBRecords(row,1))
  If Not (rgMatch = Nil) Then
    // We found the Job Number portion of the Job_ID
    JobID = rgMatch.SubExpressionString(1)
    LogString = "Processing Job ID " + Job_ID + " using Job Number " + JobID
    StdOut.WriteLine(LogString)
    LogTransactions(LogString)
    
    // Find the Job Number portion of the file name in the list, (there may be more than 1 file)
    rg.SearchPattern = "^(\w{9})-.*$"
    For idx = MatchList.LastIndex DownTo 0
      rgMatch = rg.Search(MatchList(idx))
      If Not (rgMatch = Nil) Then
        // If the DB Job Number matches the File Job Number Delete it
        If JobID = rgMatch.SubExpressionString(1) Then
          foundfile = MatchList(idx)
          LogString = "Found Matching file " + foundfile
          StdOut.WriteLine(LogString)
          LogTransactions(LogString)
          f = folderpath.child(foundfile)
          If Not (f = Nil) And f.Exists Then
            Try
              f.Remove
              LogString = foundfile + " successfully deleted"
              StdOut.WriteLine(LogString)
              LogTransactions(LogString)
              // Update the database status to '1' using UpdateStatus method, (processed)
              UpdateStatus
            Catch error As IOException
              LogString = error.Message + EndOfLine
              StdOut.WriteLine(LogString)
              LogExceptions(LogString)
            End Try
          Else
            LogString = "Failed to remove the file " + foundfile
            StdOut.WriteLine(LogString)
            LogExceptions(LogString)
            // database status will not be updated so that it can be tried again during the next run.
          End If
          MatchList.RemoveRowAt(idx)
          b = True
        End If
      End If
    Next
    
    If b = False Then
      // No matching files were found
      LogString = "No files found that match " + JobID
      StdOut.WriteLine(LogString)
      LogTransactions(LogString)
      // Update the database status to '1' using UpdateStatus method, (processed)
      UpdateStatus
    End If
  Else 
    // Handle Job ID Error
    LogString = "An error occured  when attemting to isolate the Job Number from the Job ID " + Job_ID + EndOfLine
    StdOut.WriteLine(LogString)
    LogExceptions(LogString)
  End If
Next

Did you also try using the WinFilter option in the constructor to reduce the size of the list it returns? The speed may be acceptable now, but I suspect you could let FileListMBS also perform that portion for you very efficiently – or at least reduce the size of what it returns.

1 Like

No I did not. All the files in the folder that is being searched contain the same PDF file type and naming structure. There are a tiny number of files that have a slightly different prefix that possibly could be filtered, but I doubt that would make that much difference in the speed. In any case I appreciate the recommendation and I’ll look at incorporating a filter for that Prefix.

Now that I’m keeping the folder cleaned of shipped jobs daily the number to search has gone down to 10K to 12K