Setting up a LLM using OpenAI or similar on my own datasets

Hi team

I’ve put this question in general as I’m not sure it fits the profile of the AI topic.

I guess wondering if anyone has setup and end to end solution for a user based LLM.

I’d love to setup a web app that gives my users responses to questions based off a dataset that I’ve created. Ie. A whole heap of pdf files will be searched and a response will be given to the user based on knowledge available and the users question.

OpenAI suggests that I need to setup a vector database or similar to hold my pdfs etc.

Has anyone done something similar?

I just completed such a project. I’m traveling this week but will try to make up a mock example project for you later in the week. Ping me Wednesday/Thursday if I don’t respond by then

Instead of pdf files, I had all my content in a database table. The first part breaks each topic into smaller chunks. The second part embeds these chunks into a vector format. The third part is sending the query to OpenAI for the response using only my content

Also, my project is a desktop project, but I’m sure it could work for web with minimal changes

2 Likes

I have the sample project attached, which should hopefully guide you in the right direction in using Chat referencing only your content. The winMainInfo screen has several instructions. Of most importance is to add your own ChatGPT API Key in the placeholder constant (AIAPIKey) in Module1

I included several comments in the code for you to change to make this fit your own needs. Do a global search for “XXXX” ← I did not swear here. That’s the placeholder I use in code so I can find things I need to work on :slight_smile:

This is a full working sample app. It will create a database file on your desktop with two tables, populate these tables, parse them into smaller chunks, embed into vector format, and allow you to chat with the data

Chatbot for forum.xojo_binary_project.zip (24.7 KB)

And if you’d like to see how I am using the chatbot (named Cara) in my desktop app, here is a short video on YouTube. The attached sample app just uses a TextArea for the chat, but in my final project, I made chat bubbles for the user experience
https://youtu.be/bvtoMYwLEH8

One final note. Creating something like this was well beyond my coding expertise. I got help from Chat in building this. “Machines making machines” ← bonus points for who said that and from what movie haha :grin:

7 Likes

Ryan, this example is absolutely amazing. One of the cleanest examples I’ve seen on this forum. Thankyou so much!
I’d love to figure out how to upload pdf files instead of plain text but for those playing at home, please have a look at this example!

3 Likes

congrats yes looks great seeing the video

when i click on link on youtube video it says " Your access to this site has been limited by the site owner", not public i guess ?

what do you mean Step 1: Add your content to the content table. Your content will need a title and the full text

what is content ? a topic of research ? thank you

Video works fine here in Australia. I’m not logged in to YouTube or anything special.

[Meet Cara: Your RD Exam Chatbot Powered by Study Suite® - Visual Veggies
(comment)

Ah sorry. I misread. I thought you meant the video didn’t work. I haven’t tried going to the live site, just used the code linked in the post.

And to answer your question, the content could be any text. Ie. A whole textbook for example.

I think my web team for my website blocks users from outside the US, or at least some of the more hacking-fanatic countries. The link is just an info page (announcement) about the chatbot

The content can be anything you want. In my example, and for what my full project does, is every piece of content is linked to one of my articles. Say one of my articles is “nutrition basics”. That’s the title of my article. I have a wealth of text talking about this subject that would go into the full text spot. When the article is parsed (say there’s 3 chunks), the chunk IDs need to be unique, so it’ll be like:

  • nutrition basics#0
  • nutrition basics#1
  • nutrition basics#2

You can also think of the title as a chapter. Chapter 1 would be a title with all the full text of ch 1. Chapter 2 would be the next title and so on

Make it work your own way. You may not want titles. You may not want to pull data from tables. This is the way I needed mine to work, and it should hopefully act as a starting point for anyone who wants to make a bot sourcing only your content

2 Likes

Hi Ryan, I’m poking around and am having some issues with context. Ie, I inputted a bunch of data the first time I ran the program and it worked well. I’ve since gone back and tried to add more data and it get’s added to my local database, but the chat doesn’t seem to know about it. How do I define what the model has access to? Ie, do I need to upload the database at each run or is it stored persistently by OpenAI? I can’t get it to work either way but would love your thoughts. Has the change to GTP5 affected this demo Ryan?

Hi James. After you inputted your second round of data, did you go into and Parse the data? This is step 2 of the process. If you take a look at your local db table, the last column in the tblEmbed table, the Embedding column should be populated with data, and the NeedsEmbeding should be set to zero (meaning it has parsed the data into a format suitable for Chat)

You don’t need to do the Parse step after every article is added. It will store this data in the tblContent table (a temporary table that needs to be parsed). I will add a bunch of content at a time, and then when I am ready to test it, I will then do step 2 to parse the materials I recently added

Nothing here is being saved to OpenAIs server. It is all housed locally on the machine

I haven’t tested yet with GPT5, but I don’t see why it wouldn’t work. You would need to change the model in the winChat window in the thrSendRequest if desired. It is defaulted in my sample project to gpt-4

Hi Ryan.

Thanks for the reply. Yes. I can confirm the database is as above. Really strange as it’s only aware of the first articles. Now I’m starting to wonder if there are any special characters etc in the data that are blocking somehow.