Hello guys,
I have a scenario where i need to use a WebApp and a lot of files to process so the idea was to do all that in the background , to use as much as possible the resources available, that includes ram and cores and to display the results in the web app.
Now i have the current restrictions :
- OS , Debian 12.x
- Database Storage SQLite, Encrypted
- App Type WebApp
- Multi user management
Current test load i have , around 500 GB of files , and in those i need one pdf file to have it parsed and data extracted to put it in DB .
Now PDF parsing i managed to doit using python and it returns a JSON with the needed data so that i handle it easilly with a shell call but the rest is quite slow and i use just a limited ammount of the server resources , so what would be the best way to handle those ?
The idea i had was to handle multiple threads that will do specific tasks but then i saw some youtube podcast where they say that it would be problematic to use iterating for files and folders by multiple threads in the same time.
Me, i need to scan all the folders and files, identify the files and file types, extract the needed data, process file names , handle weird characters and normalise file names , then parse the needed pdf files for each folder and once this is complete, process the data in the SQLite, filter it and prepare the final result. and all that should be done from the web app interface.
Now here the idea was to use the Web app as a queue system, and allocate tasks then allow threads to get the tasks, process them and report to interface , but i guess due to the multiple write calls on SQLite would be even more slow, i cannot use In memory DB due to the Preemptive part and i would need to always communicate with the interface and the threads.
Then i would like to keep maybe same way and be able to add multiple tasks from multiple users and those to be taken by the processing threads once they finish the current ones so i prepare my daily tasks and the app would do them and then update interface when needed, or i guess more the DB then interface would be updated once user requests it .
Any ideas here ? thanks