String.IndexOf wildcard

Alexander_van_der_Linden · May 17, 2022, 8:19am

I have some scanned texts where the OCR has gone wrong and left me with words that have a space inserted before the last character, so house = ‘house e’ and computer = ‘compute r’, but is not always the case. To fix this I want to use string.indexof and search for chr(32)+ a random single character. I noted that wildcards are not supported. Any idea how to do this?

ChristopheDV · May 17, 2022, 8:28am

You could use RegEx instead.

Jeff_Tullin · May 17, 2022, 9:53am

A loop from A to Z?

for x as integer = 65 to 91
TheString = TheString.replaceall (" " + chr(x) , chr(x))
next

Emile_Schwarz · May 17, 2022, 10:55am

Nice, very nice.

I never would have thought of that…

Carsten_Belling · May 17, 2022, 11:01am

I would search for space in every word. If a word contains one the elemination is special(?).

Jeff_Tullin · May 17, 2022, 11:01am

revision: car s becomes carS using the first code above
So it actually needs to be changed to lower case letters

for x as integer = 97 to 122
TheString = TheString.replaceall (" " + chr(x) , chr(x))
next

Emile_Schwarz · May 17, 2022, 11:14am

How do you define a word ?

I think it is when it have a space at left and at right, so…

Thom_McGrath · May 17, 2022, 11:20am

This would replace the first character of every word, essentially removing all spaces.

I would only approach this with regex. Even then, I’d have to be extra careful not to replace lone I’s or A’s. Something like (/s/D+)/s([b-hj-z]/s) perhaps.

Jeff_Tullin · May 17, 2022, 1:04pm

Good point, which I had missed mentally.

Alexander_van_der_Linden · May 17, 2022, 1:22pm

Well, my approach would be: load the text into the TextArea, start with looking for the first space+character pair (put the cursor there) and halt. User (me) decides if it should be corrected by deleting the space. If so, delete the space and continue to next space+char pair. This combined action to be put in a method and a hotkey.
A full auto replacement is not possible, that generates too much errors. A text sample:
Mr . Jackson , bein g clos e t o th e hom e o f W . E . Wilsie , wok e hi m u p i n tim e t o se e th e light s o f the machin e befor e i t disappeare d Th e sam e night , H . E . Allatt , postmaste r a t Imperial , wa s awakene d fro m slee p b y a brigh t light shinin g int o his room . Ther e wa s n o moon , th e light wa s though t t o b e a fire, an d Mr . Allat t ros e t o investigate , bu t n o fire wa s found . Lookin g a t his watch , th e tim e wa s discovere d t o b e 1:30 o’clock , an d i t i s believe d tha t th e brilliant ligh t wa s cause d b y the searchligh t fro m this mysteriou s airship .

[Edit] Now with a fresh look, the string to search for should not be space + Character, but space+char+space and replace that with char+space.

Jeff_Tullin · May 17, 2022, 2:40pm

or space+char+period

hous e wa s on fir e.

although even that can’t save hi m u p i n tim from being himupin time
although thats become a bit more tricky than ‘words that have a space inserted before the last character’

AlbertoD · May 17, 2022, 3:15pm

and what will happen with:

Is this a one time ‘problem’?

If possible try to fix the source (scan again with other OCR software/settings).

Alexander_van_der_Linden · May 17, 2022, 4:28pm

It are scans and OCR of out-of-print books. So I have the PDF’s. There are many pages with this strange errors.

AlbertoD · May 17, 2022, 4:39pm

Do you have the scans or just the OCR?
If you have the scans, then you may try different OCR software/settings and see if you can get better OCR results.

Alexander_van_der_Linden · May 17, 2022, 6:31pm

Just the text unfortunately…