Detecting Non-Roman Characters in Strings

Charles_Kelley · June 23, 2020, 8:10am

I am catalog a number of multilingual books. Some books are in the Roman alphabet but quite a few books are in Chinese, Japanese, Korean, Malay, Russian, that is a non-Roman alphabet.

Under the old standard, a title might look like this.

	Record 1
		100  a ??, ??.
		242  a Handbook of modern Japanese grammar : including lists of words and expressions with English equivalents for reading aid / Shusei Sato. eng
		245  a ???????? : ???????????????????? / ????.
		260  a ?? : ???, 1999. 
	
		700  a Sato, Shusei.
		740  a Kogo Nihon bumpo beran : Nihongo tokushu hyogen to sono Eigo soto yaku oyobi reidai tsuki / Sato Shusei.
	
About fifteen years or so ago, a new standard emerged. The cataloging record above is supposed to be composed as below.

	Record 2
		100  6 880-01 a Sato, Shusei.
		242  a Handbook of modern Japanese grammar : including lists of words and expressions with English equivalents for reading aid / Shusei Sato.eng
		245  6 880-02 a Kogo Nihon bumpo beran : Nihongo tokushu hyogen to sono Eigo soto yaku oyobi reidai tsuki / Sato Shusei.
		264  6 880-03 a Tokyo : Hokuseido [Press], 1999.
	
   		880  6 100-01/$1 a ??, ??.
		880  6 245-02/$1 a ???????? : ???????????????????? / ????.
		880  6 264-03/$1 ?? : ???, 1999.

Only the Roman alphabet is supposed to appear in the 1XX through 7XX fields and 9XX, that is the fields prefixed with a number between 000 and 999, and only non-Roman writing systems are supposed to appear in 880. In fact the 880 field is the only field in which non-Roman writing systems are supposed to appear.

It's somewhat a mechanical process to move and copy the fields and subfields, and to add in the 6 subfields so that the Roman counterparts end up corresponding with their non-Roman counterparts. That is not my question.

My question is this: Is it possible in Xojo to detect a non-Roman character (or string) from a Roman character (or string)? I am familiar with the ASC function, but I don't know whether ASC fails when a character whose code point falls outside 0-127 or maybe 0-255 to allow for letters with diacritical marks. If ASC handles characters outside the Roman alphabet, then it greatly reduces the complexity of the larger task.

Thanks for your help in this matter.

Footnote on MARC (MAchine Readable Cataloging) cataloging

100 signals the main author.
242 signals the title in translation.
245 signals the full title as it appears on the title page.
260 and 264 signal the publisher, distributor, etc.
700 signals an additional contributor (author, compiler, editor, etc.) or a variation in the spelling of the contributor's name.
740 signals an additional title or the title of a part within the work. (246 fields do this, too, but the criteria for a 740 field vs. 246 field are different.)
880 signals that one of the above or another field that will appear in a non-Roman writing system. 880 is the only field in which non-Roman characters are supposed to appear. The 6 points to the corresponding 0XX - 9XX field in the Roman alphabet.

I recognize MARC cataloging is cumbersome and it's even been called outdated, but it's what I have to work with at present.

P. S. The encoding is not a problem.

P. P. S. Sorry for the long-windedness.

Thanks again.

Beatrix_Willius · June 23, 2020, 8:46am

From a forum member I got this lovely code:

dim myDataLen As Integer = myData.Length dim p As Integer = InStr(myData,&uFFFD) if p>0 then return "Illegal character at position "+str(p) dim e,e2,stDev,skippedCharRatio As Double = 0 dim codePoint,nChars,mean,nonNumeric As Integer = 0 dim language As String = "" 'calculate sum and sum of squares of codePoints for i As Integer = 1 to myDataLen codePoint=asc(mid(myData,i,1)) 'skip numerals, punctuation and chars above &h700 if codePoint > 64 then nonNumeric=nonNumeric+1 if codePoint< &h700 and not(codePoint>&h7F and codePoint<&hC0) then e=e+codePoint e2=e2+codePoint*codePoint nChars=nChars+1 end if end if next 'Now calculate mean and standard deviation e=e/nChars e2=e2/nChars stDev = sqrt(e2-e^2) mean = e+0.5 'round to integer skippedCharRatio=nonNumeric/nChars if skippedCharRatio>2 then 'This happens if there is a large percentage of skipped characters language = "Unknown" ElseIf mean >= &h0041 and mean <= &h007A then language = "Latin" ElseIf mean >= &h007B and mean <= &h00AF then 'This occurs if a large percentage of characters are in the &h0080..&h00FF range 'If the mean value is close to the low end of the 7A..FF range then it may just 'be highly accented Latin language = "Accented Latin" ElseIf mean >= &h00B0 and mean <= &h00FF and stDev <40 then 'This occurs if a large percentage of characters are in the &h0080..&h00FF range 'If the standard deviation is small and the mean is in the middle or 'upper end of this range, then mis-encoding is more likely. language = "Mis-encoded" ElseIf mean >= &h007B and mean <= &h00FF then 'If the mean is in the range &h0080..&h00FF, and neither of the two preceding 'conditions apply, then there's not enough info to make a guess. language = "Unknown" ElseIf mean >= &h0391 and mean <= &h03c9 then language = "Greek" ElseIf mean >= &h0410 and mean <= &h052F then language = "Cyrillic" ElseIf mean >= &h0530 and mean <= &h058F then language = "Armenian" ElseIf mean >= &h0590 and mean <= &h05FF then language = "Hebrew" ElseIf mean >= &h0600 and mean <= &h06FF then language = "Arabic" 'More ElseIf cases can be included here for other languages else language = "Unknown" end if 'language=language +", Mean = &h"+hex(mean)+", StDev = "+str(stDev) +", SkpRatio = "+str(skippedCharRatio)+EndOfLine return language

Charles_Kelley · June 23, 2020, 9:45am

Wow! I did not expect such a tidy answer so soon!

It addresses the writing systems "recognized" by MARC catalog: Roman, Greek, Cyrillic, Arabic, Chinese (and Japanese and Korean), and Hebrew. I've seen references to a few others, but I don't quite remember where off the top of my head.

Many thanks!

Carlo_Rubini · June 23, 2020, 10:09am

Years ago Kem (thank you, Kem) offered this code that I succesfully use in my apps.

The two parameters being the string whose language should be detected, and the name of the language you want it to be compared to.

[code]if returnLanguage(mString, “Hebrew”) or returnLanguage(mString, “Syriac”) then
//detected
else
//not detected
end if

Public Function returnLanguage(txt as string,mLang as String) as Boolean
//if txt = chr(0) then Return false
dim rx as new RegEx
rx.SearchPattern = “\A[\p{” + mLang + “}\PL]+\z”

if rx.Search(txt) is nil then
Return false
end if
Return true
End Function
[/code]

Charles_Kelley · June 25, 2020, 2:34am

I’ll have to try this, too.

Many thanks!