Wanting to split a large text into an array of smaller phrases

Hello. I am trying to accomplish a “predictive search” feature. I want to be able to search large groups of text in a database table but to also give the user suggested search criteria as they are typing. I am using a predictive search feature from an XDC 2018 presentation. This is working nicely for searching and predicting the title of my articles in the db table, but I was hoping to search and suggest within the main text

I was thinking of using .Split, but this looks to only be able to split at a known character, such as a space. So I can split the whole text by word and search that way and provide a prediction of search by word, but I was hoping to provide the search word plus 3 or 4 proceeding words in the same line of text. Does this make sense?

For example, in the following text, if I type in “carbohydrates”, I would like the predictive search to display “carbohydrates yields approximately 4 kcals”, “carbohydrates (glucose) in the body”, “Carbohydrates are very readily available”, and anywhere below in bold

  • Essential macronutrient that is made up of carbon, hydrogen, and oxygen molecules
  • One gram of carbohydrates yields approximately 4 kcals
  • Preferred fuel source by the body. With a lack of available carbohydrates (glucose) in the body, fat and protein will be broken down. The by-product of fat metabolism is known as ketones, or ketone bodies
  • The brain utilizes glucose exclusively as the form of energy. During periods of starvation, or when there is simply not adequate glucose available, the brain will utilize ketone bodies as the energy source
  • Glycogen, a branched polysaccharide, is the stored form of carbohydrates in mammals, and this is stored in the liver and muscles. The average human can store approximately 1,500 to 2,000 calories from glycogen
  • Plants store carbohydrates in the form of starches
  • Carbohydrates are very readily available in foods, such as fruits, vegetables, grains, cereals, dairy, and sugars

There will be better ways, but maybe do this…
Assuming sentences end with periods…

Take the full text and replace all commas, semi colons, and periods by spaces
SPLIT the text on spaces
Assemble a dictionary of words, by working through the array, omitting ‘useless’ words such as the, a, and, but if, … anything less than 5 (?) letters

Then using the dictionary, find all phrases in the original text starting with the keyword, and ending at the first comma or period.

create an in-memory data base with a table phrasetable containing
[keyword] / [start of sentence] / [keyword and remainder of the phrase] as 3 columns

populate same

finally, when the user types Carbohydrate, you

select from phrasetable where keyword like

and you have a set of possibles.

Perhaps use an in-memory SQLite database. Ingest the text for full-text search; this will handle all of the parsing, stemming and other issues that will make using Split or similar approaches fragile.

Full-text search documentation: SQLite FTS5 Extension

You can use the MATCH query and return words around the found term with the SNIPPET function.

1 Like

Thanks for the help guys. I will try some of these suggestions out and will post back if I need more assistance