I have some scanned texts where the OCR has gone wrong and left me with words that have a space inserted before the last character, so house = ‘house e’ and computer = ‘compute r’, but is not always the case. To fix this I want to use string.indexof and search for chr(32)+ a random single character. I noted that wildcards are not supported. Any idea how to do this?
You could use RegEx instead.
A loop from A to Z?
for x as integer = 65 to 91 TheString = TheString.replaceall (" " + chr(x) , chr(x)) next
Nice, very nice.
I never would have thought of that…
I would search for space in every word. If a word contains one the elemination is special(?).
revision: car s becomes carS using the first code above
So it actually needs to be changed to lower case letters
for x as integer = 97 to 122 TheString = TheString.replaceall (" " + chr(x) , chr(x)) next
How do you define a word ?
I think it is when it have a space at left and at right, so…
This would replace the first character of every word, essentially removing all spaces.
I would only approach this with regex. Even then, I’d have to be extra careful not to replace lone I’s or A’s. Something like
Good point, which I had missed mentally.
Well, my approach would be: load the text into the TextArea, start with looking for the first space+character pair (put the cursor there) and halt. User (me) decides if it should be corrected by deleting the space. If so, delete the space and continue to next space+char pair. This combined action to be put in a method and a hotkey.
A full auto replacement is not possible, that generates too much errors. A text sample:
Mr . Jackson , bein g clos e t o th e hom e o f W . E . Wilsie , wok e hi m u p i n tim e t o se e th e light s o f the machin e befor e i t disappeare d Th e sam e night , H . E . Allatt , postmaste r a t Imperial , wa s awakene d fro m slee p b y a brigh t light shinin g int o his room . Ther e wa s n o moon , th e light wa s though t t o b e a fire, an d Mr . Allat t ros e t o investigate , bu t n o fire wa s found . Lookin g a t his watch , th e tim e wa s discovere d t o b e 1:30 o’clock , an d i t i s believe d tha t th e brilliant ligh t wa s cause d b y the searchligh t fro m this mysteriou s airship .
[Edit] Now with a fresh look, the string to search for should not be space + Character, but space+char+space and replace that with char+space.
hous e wa s on fir e.
although even that can’t save hi m u p i n tim from being himupin time
although thats become a bit more tricky than ‘words that have a space inserted before the last character’
and what will happen with:
Is this a one time ‘problem’?
If possible try to fix the source (scan again with other OCR software/settings).
It are scans and OCR of out-of-print books. So I have the PDF’s. There are many pages with this strange errors.
Do you have the scans or just the OCR?
If you have the scans, then you may try different OCR software/settings and see if you can get better OCR results.
Just the text unfortunately…