Search string and RTFData

JensK · December 7, 2017, 9:40pm

I have written a note taking app, that among other things shows notes in a TextArea and saves them to an SQLite database as RTFData (or more specifically I use some declares to get the RTF data). I implemented a search method that reads text from a TextField, modifies a prepared statement and loads only data containing the search string from an SQLite database to a record set.
That works very well when I search for German or English words, but I also have some ancient greek text passages among my notes, and there the search doesn’t work. The reason seems to be that the RTF data are converted to unicode; e.g. when I type the word “???” (idleness) in the TextArea, it is automatically saved as "\u963 \u967 \u959 \u955 \u942 " in the database, so that the lemma “???” in the search field doesn’t get any results.

Can you think of a proper way to convert my characters to unicode in order to compare them to the stored data? I am completely at a loss here

Norman_P · December 7, 2017, 9:54pm

maybe use a full text index on the NON-RTFData so you can tell which rows you are interested in (we do this for the built in language reference)
the fun part is that it doesn’t have to contain the data - you can just build an index from the non-rtfdata

but you’d need to read the sqlite docs on full text search indexes etc

JensK · December 7, 2017, 10:04pm

I’m not sure I can follow you. Are you suggesting to save the plain text in addition to the rtf data (which would be some data overhead)?

DaveS · December 7, 2017, 10:15pm

http://www.sqlitetutorial.net/sqlite-full-text-search/

Jean-Yves_Pochez · December 7, 2017, 10:43pm

yes. that’s the only way to search for non english strings in sql databases with xojo.

Norman_P · December 7, 2017, 10:45pm

No
just indexing the plain text for full text searches
those indexes for full text search can be searched and then tell you which rftdata contained the plain text searched for

but you really do have to read the sqlite docs about this to get the hang of it

Norman_P · December 7, 2017, 10:45pm

no its not … with FTS you can search full text without having to store the data twice

JensK · December 8, 2017, 6:41am

There’s a lot I still don’t know…
I’ll have a look at it, thanks for your suggestions!

JensK · December 8, 2017, 3:04pm

I have created a virtual table containing the plain text of my notes. Searching it works very well and fast. Thanks again for the suggestion!
And now there remains some nasty work preparing different statements for several search scenarios…

Jean-Yves_Pochez · December 8, 2017, 3:31pm

I’m really not sure FTS is unicode characters aware… had a lot of pain to search for french characters with sqlite. ended up with storing french text, and ascii text for the search.

JensK · December 9, 2017, 6:30pm

Full text search works fine now, but I have a subsequent problem with Ancient Greek text: Ancient Greek has six diacritics that can occur in different combinations; for example the letter “?” can appear in 24 variations: “???”. That leads to the problem that for example a search for “???” (meta) shows all results containing the lemma “???”, but not Aristotle’s Metaphysics, because in the original title “??? ?? ???” there is a different diacritic/unicode.
So my idea was to save the plain text string without diacritics; e.g. “???” and “???” are both saved as “???”, so that the search string “???” would find both variants.

And here is the problem: How can I replace the characters in the fastest way? ConvertEncoding won’t work as the characters a not ASCII symbols. Looping through the string and checking/replacing with arrays of Greek letters character for character is too slow (I have to save documents with 10.000 words and more); a MemoryBlock to speed things up doesn’t seem to work either (but I am not really familiar with that), since the Greek characters are not bytes but unicodes of different size.
I found out, that AscB returns 205 or 206 for “normal” Greek characters and 225 for any Greek character with any diacritics, so at least checking if a letter needs replacing is done reasonably easy; but the replacing routine still gives me headaches.

So what would be the fastest way to replace the characters before saving the plain text to the database?

Carlo_Rubini · December 10, 2017, 6:59am

I temporarily strip the text and the search-word of accents, sub-iotas etc.
For my very simple purposes, it works. Something like this:

myFullText = resolveGreek(myFullText)
mySearchWord = resolveGreek(mySearchWord)
//now search for mySearchWord in myFullText

Public Const kGreekVowelSource as String = ??,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?, etc
Public Const kGreekVowelStrip as String = ??,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?, etc

Public Function resolveGreek(s as String) as string
s = lowercase(s)
dim orig() as String = Split(kGreekVowelSource, “,”)
dim destn() as String = Split(kGreekVowelStrip, “,”)
for i as Integer = 0 to orig.Ubound
s = s.ReplaceAll(orig(i), destn(i))
next
Return s
End Function

JensK · December 10, 2017, 8:13am

I tried a similar approach with ReplaceAll at first, but only replacing the A vowels in a Greek text with 6500 words took almost a second, which is way to long for my needs.

Just now I tried a different approach: I made a Dictionary with all A vowels; then I looped through the text and replaced the A characters that needed replacing:

Function ErsetzeZeichen(ParamString As String) As String Dim Arr() As String = ParamString.Split("") Dim Ubd As Integer = Arr.Ubound For i As Integer = 0 To Ubd If Ersetzungstabelle.HasKey(Arr(i)) Then Arr(i) = Ersetzungstabelle.Value(Arr(i)) End If Next Return Join(arr, "") End Function
Now it took only 22 milliseconds to replace all A vowels in the above mentionend text. I think this might be a suitable way

JensK · December 10, 2017, 7:50pm

Supplement: The above written method really seems to be the way to go for me. After completing the dictionary for all (167) Greek characters, I did some test runs with two different strings (pure Greek):
6695 words; 4172 characters replaced in 15.6 milliseconds
10050 words; 9434 characters replaced in 31.7 milliseconds
There may be more to get, but for now this is fast enough for me