Detecting Non-Roman Characters in Strings

  1. 2 weeks ago

    Charles K

    Jun 23 Testers, Xojo Pro Japan, Kanagawa-ken, Yamato-sh...

    I am catalog a number of multilingual books. Some books are in the Roman alphabet but quite a few books are in Chinese, Japanese, Korean, Malay, Russian, that is a non-Roman alphabet.

    Under the old standard, a title might look like this.

    Record 1
    100 ‡a 佐藤, 秀才.
    242 ‡a Handbook of modern Japanese grammar : including lists of words and expressions with English equivalents for reading aid / Shusei Sato. eng
    245 ‡a 口語日本文法便覧 : 日本語特殊表現とその英語相当訳及び例題付 / 佐藤秀才.
    260 ‡a 東京 : 北星堂, 1999.

    700 ‡a Sato, Shusei.
    740 ‡a Kogo Nihon bumpo beran : Nihongo tokushu hyogen to sono Eigo soto yaku oyobi reidai tsuki / Sato Shusei.

    About fifteen years or so ago, a new standard emerged. The cataloging record above is supposed to be composed as below.

    Record 2
    100 ‡6 880-01 ‡a Sato, Shusei.
    242 ‡a Handbook of modern Japanese grammar : including lists of words and expressions with English equivalents for reading aid / Shusei Sato.eng
    245 ‡6 880-02 ‡a Kogo Nihon bumpo beran : Nihongo tokushu hyogen to sono Eigo soto yaku oyobi reidai tsuki / Sato Shusei.
    264 ‡6 880-03 ‡a Tokyo : Hokuseido [Press], 1999.

    880 ‡6 100-01/$1 ‡a 佐藤, 秀才.
    880 ‡6 245-02/$1 ‡a 口語日本文法便覧 : 日本語特殊表現とその英語相当訳及び例題付 / 佐藤秀才.
    880 ‡6 264-03/$1 東京 : 北星堂, 1999.

    Only the Roman alphabet is supposed to appear in the 1XX through 7XX fields and 9XX, that is the fields prefixed with a number between 000 and 999, and only non-Roman writing systems are supposed to appear in 880. In fact the 880 field is the only field in which non-Roman writing systems are supposed to appear.

    It's somewhat a mechanical process to move and copy the fields and subfields, and to add in the ‡6 subfields so that the Roman counterparts end up corresponding with their non-Roman counterparts. That is not my question.

    My question is this: Is it possible in Xojo to detect a non-Roman character (or string) from a Roman character (or string)? I am familiar with the ASC function, but I don't know whether ASC fails when a character whose code point falls outside 0-127 or maybe 0-255 to allow for letters with diacritical marks. If ASC handles characters outside the Roman alphabet, then it greatly reduces the complexity of the larger task.

    Thanks for your help in this matter.

    Footnote on MARC (MAchine Readable Cataloging) cataloging

    100 signals the main author.
    242 signals the title in translation.
    245 signals the full title as it appears on the title page.
    260 and 264 signal the publisher, distributor, etc.
    700 signals an additional contributor (author, compiler, editor, etc.) or a variation in the spelling of the contributor's name.
    740 signals an additional title or the title of a part within the work. (246 fields do this, too, but the criteria for a 740 field vs. 246 field are different.)
    880 signals that one of the above or another field that will appear in a non-Roman writing system. 880 is the only field in which non-Roman characters are supposed to appear. The ‡6 points to the corresponding 0XX - 9XX field in the Roman alphabet.

    I recognize MARC cataloging is cumbersome and it's even been called outdated, but it's what I have to work with at present.

    P. S. The encoding is not a problem.

    P. P. S. Sorry for the long-windedness.

    Thanks again.

  2. Beatrix W

    Jun 23 Testers, Third Party Store Europe (Germany)

    From a forum member I got this lovely code:

    dim myDataLen As Integer = myData.Length
    dim p As Integer = InStr(myData,&uFFFD)
    if p>0 then return "Illegal character at position "+str(p)
    dim e,e2,stDev,skippedCharRatio As Double = 0
    dim codePoint,nChars,mean,nonNumeric As Integer = 0
    dim language As String = ""
    'calculate sum and sum of squares of codePoints
    for i As Integer = 1 to myDataLen
      codePoint=asc(mid(myData,i,1))
      'skip numerals, punctuation and chars above &h700
      if codePoint > 64 then
        nonNumeric=nonNumeric+1
        if codePoint< &h700 and not(codePoint>&h7F and codePoint<&hC0) then 
          e=e+codePoint
          e2=e2+codePoint*codePoint
          nChars=nChars+1
        end if
      end if
    next
    'Now calculate mean and standard deviation
    e=e/nChars
    e2=e2/nChars
    stDev = sqrt(e2-e^2)
    mean = e+0.5 'round to integer
    skippedCharRatio=nonNumeric/nChars
    if skippedCharRatio>2 then
      'This happens if there is a large percentage of skipped characters
      language = "Unknown"
    ElseIf mean >= &h0041 and mean <= &h007A then
      language = "Latin"
    ElseIf mean >= &h007B and mean <= &h00AF then
      'This occurs if a large percentage of characters are in the &h0080..&h00FF range
      'If the mean value is close to the low end of the 7A..FF range then it may just
      'be highly accented Latin
      language = "Accented Latin"
    ElseIf mean >= &h00B0 and mean <= &h00FF and stDev <40 then
      'This occurs if a large percentage of characters are in the &h0080..&h00FF range
      'If the standard deviation is small and the mean is in the middle or
      'upper end of this range, then mis-encoding is more likely.
      language = "Mis-encoded"
    ElseIf mean >= &h007B and mean <= &h00FF then
      'If the mean is in the range &h0080..&h00FF, and neither of the two preceding
      'conditions apply, then there's not enough info to make a guess.
      language = "Unknown"
    ElseIf mean >= &h0391 and mean <= &h03c9 then
      language = "Greek"
    ElseIf mean >= &h0410 and mean <= &h052F then
      language = "Cyrillic"
    ElseIf mean >= &h0530 and mean <= &h058F then
      language = "Armenian"
    ElseIf mean >= &h0590 and mean <= &h05FF then
      language = "Hebrew"
    ElseIf mean >= &h0600 and mean <= &h06FF then
      language = "Arabic"
      'More ElseIf cases can be included here for other languages
    else
      language = "Unknown"
    end if
    'language=language +", Mean = &h"+hex(mean)+", StDev = "+str(stDev) +", SkpRatio = "+str(skippedCharRatio)+EndOfLine
    return language
  3. Charles K

    Jun 23 Testers, Xojo Pro Japan, Kanagawa-ken, Yamato-sh...

    Wow! I did not expect such a tidy answer so soon!

    It addresses the writing systems "recognized" by MARC catalog: Roman, Greek, Cyrillic, Arabic, Chinese (and Japanese and Korean), and Hebrew. I've seen references to a few others, but I don't quite remember where off the top of my head.

    Many thanks!

  4. Edited 2 weeks ago

    Years ago Kem (thank you, Kem) offered this code that I succesfully use in my apps.

    The two parameters being the string whose language should be detected, and the name of the language you want it to be compared to.

    if returnLanguage(mString, "Hebrew") or returnLanguage(mString, "Syriac") then
    //detected
    else
    //not detected
    end if
    
    Public Function returnLanguage(txt as string,mLang as String) as Boolean
      //if txt = chr(0) then Return false
      dim rx as new RegEx
      rx.SearchPattern = "\A[\p{" + mLang + "}\PL]+\z"
      
      if rx.Search(txt) is nil then
        Return false
      end if
      Return true
    End Function
  5. Charles K

    Jun 24 Testers, Xojo Pro Japan, Kanagawa-ken, Yamato-sh...

    I'll have to try this, too.

    Many thanks!

or Sign Up to reply!