Finding bi-directional text

Hi…

I have an xml file that is over 10 megs… If I open it with Oxygen XML editor it tells me that it includes some bi-directional text (Hebrew, Arabic, etc…)… but does not identify it… and I need to find it.

Any ideas on how I can do that?

Keep in mind that I do not know which language or if the characters are composed or decomposed.

Try this RegEx pattern:

\P{Latin}

That means “not Latin unicode”.

(No idea if that will do it for you, just throwing out an idea.)

Let me revise that:

(?=\pL)\P{Latin}+

That means, “upcoming character is a letter, but it not Latin”, and will match the entire string.

The Graphics class has a TextDirection function. You could try passing sub strings to that to see what their direction is.

Thanks for the suggestions

Kem: (?=\pL)\P{Latin}+

… did find some instances of the “micro sign” character (UTF-8: C2B5)… but nothing else.

Keving: I will give that a try.