What's the best way to get a massive directory into an array?

What’s the best way to generate an array containing the structure of a very large directory tree. We’re talking 30,000-40,000 files.

I’m looking for something that’s fast. Initially, the application will be on Linux, but I’d also like to make it work on Windows and Mac at some point. If there are platform specific tricks that speed this up, that’s fine. But if there’s something generic that will work in all cases, even better.

Thanks!

FileListMBS class is made for speed
http://www.monkeybreadsoftware.net/class-filelistmbs.shtml

Thanks - but not linux, at least according to the documentation.

On win/mac how does it handle really big directories?

Do you need FolderItems or just the names of the files within those directories?

Documentation is wrong.
Linux was added last year.
Thanks for the notice.

Fixed for next update.

https://forum.xojo.com/13879-fastest-way-to-get-folderitem-names-into-array/0#p111408

Not entirely sure yet. The application will be copying each file in the folder (most often to LTO tape via an LTFS mount) and making a checksum on each one. Optionally it will read each file off the LTO tape and do a checksum there, to verify the copy was successful.

So it does kind of depend on how I do the checksums. If I run them from the command line, all I really need is an array with the paths to all the files in the source folder, and that should be recursive. If I do it internally with Xojo’s md5 tool, then I would think i’d need the FolderItem, since I have to load the files up before writing them. This may not be a practical option, because in some cases, the files could be 1TB Quicktime movies. In most cases, 12-50MB image files.

The fastest way is probably Linux Ls https://linux.die.net/man/1/ls

You have many options available that should fit whatever format you want. For instance -R.

A simple shell should suffice to get the directory structure.

Yeah, ls is probably the best way to go for Mac and Linux. I was hoping to not have to do it two different ways though, for different platforms. I’m going to play with FileListMBS a bit to see how that does on really big directories first.

The key is to avoid folderitems as much as possible…
They are not very fast.
So if you ask FileListMBS for a folder item on each item, you would get slow again.

Thanks. My current thinking is to make an array that contains the full path to each file, then iterate through those paths. That should be all the information I need, and I should be able to avoid folderitems.

Not sure about ls, but you can do that through find on Mac/Linux. For example, on the Mac:

find -x /path/to/folder

will produce output like:

/path/to/folder/item1
/path/to/folder/dir1/subitem1

For about 2M files in my home directory, that takes ~24s in the Terminal.

Actually, Mac and Linux share the shell language, so it should be just between Mac/Linux and Windows if you need it as well.

@Perry Paolantonio ,

Here is a Windows specific method that we use which is certainly quicker than the FolderItem. I haven’t tested it with the numbers of files you are talking about but it is worth a try. You can change your DIR command to suit:

   ' Create a DIR command to list all the files
    ' /b means output just the filename b=bare
    ' /t:c means use the Created Date to sort
    ' /o:d means order by date (oldest first)
    Dim Dir As String
    Dir = "DIR " + ThisFolder.NativePath + "*." + FileMask + " /b /t:c /o:d"'
    
    ' Do the Dir
    Dim ThisShell As New Shell
    ThisShell.Execute(Dir)
    
    ' Check for Errors from the shell command
    ' Shell returns 1 - File Not Found if there are no files matching the mask
    If ThisShell.ErrorCode > 1 Then
      Raise New RuntimeExceptionEx (ThisShell.ErrorCode, ThisShell.Result)
    End If
        
    ' Don't process if you have File Not Found
    If ThisShell.ErrorCode = 0 Then
      
      ' Get a list of files
      FileNames = Split(ThisShell.Result, Chr(13)+Chr(10))
      
      
      ' Now Publish the files
      ' The last item in the array is empty because of a CRLF on the end of the list filename in the list
      For n = 0 to FileNames.Ubound - 1
        
        ' Publish the document
        PublishDocument FileNames(n)
        
      Next
      
    End If

I just tried this on my system where I have a lot of files sync’d using Box Sync. The DIR command doesn’t seem to recognize the Box Sync folder - it will show in it’s parent directory, but you can not execute a DIR comment on the Box Sync folder.

Can anyone explain that to me?

Maybe try to go into that path, i got some problems with that dir command…so i ever used the CD command and then Dir and it works.

In my last project i used this:

Dim tocSearch as new Shell #if TargetWindows then tocSearch.Execute drive(0) +" & cd " + addonPath.NativePath + " & dir /b /s *.toc" #Elseif TargetMacOS tocSearch.Execute "find """ + addonPath.NativePath + """ -type f -name ""*.toc""" #Else 'no need atm' #endif

If you come to the London Xojo Conference at Wimbledon this Friday I can show you my methods where you pass a folderItem and it build an SQLite database of all the enclosed folders with all Xojo, MBS, EXIF, hash, audio and video metadata. I use FileListMBS for speed.

I use a SQLite database since I can reexamine different aspects of the content without having to re-search the folders, it requires minimal RAM and it survives an application relaunch.

I will demonstrate how I use it in my FileName Extreme Xojo app, amongst others.