I had this crazy idea…
…to figure out a way to search in a smarter way. Not just searching for exact text. Sometimes I use ChatGPT (or some other LLM), to find information in large documents or websites: simply by uploading a document or entering a URL for the website I want to search through.
Sometimes I need to find information in my own databases or documents.
But I might look for the wrong words, or I might have some typos in the original documents. Yeah, typos happen to all of us, right?
Encoding data: Symbols
So, instead of just storing the “raw” data in a database or document, I could store certain word-concepts
. I mean, “forest” and “woods” are two different words, but with the same meaning. What if I can just generate one "symbol"
for those words.
I could encode that a search query to get a bunch of those symbols. That would make give me results of all the synonyms.
LLM and Data Caching
I could run a small local LLM, to save the expensive API calls to an online platform, like OpenAI.
I could use that LLM to build a dataset of the symbols for words in documents.
That dataset can be used to encode (or decode) my query and data I want to digest.
The reason of using a dataset, is to build some sort of cache, instead of running the LLM all the time, which can be very power (and CPU) hungry.
Am I re-inventing the wheel here?
I am aware that this approach is not suitable for most cases. But it can be useful in a case where I store lots of documents like agreements, contracts, user manuals, etc…