Is there a built-in feature to remove unprintable chars?

Tim_Seyfarth · October 29, 2024, 7:51pm

Thank you all!

This did the trick! Since I am unfamiliar (at this point in time) with RegEx, I used the looping method discussed above. This is also for a API1 2019 R1.1 project too. Hopefully soon it will be an API2 project!

Tim

Tim_Seyfarth · October 30, 2024, 12:48am

In both 2019 R1.1 and 2024R3.1 I get a RegexSearchPatternException. The message is

unknown property name after \P or \p

unknown property name after \P or \p

Any idea where the problem lies?
Tim

Robert_Livingston · October 30, 2024, 6:55am

Var re As New RegEx
re.SearchPattern = "[\x00-\x1F\x7F\xA0\xAD\xE2\x80\xA8\xE2\x80\xA9\xE2\x80\x8B\xE2\x80\x8C]"
re.ReplacementPattern = ""
re.Options.ReplaceAllMatches = True
Var rawText As String = "naïve" + Chr(9) + "fiancé" + Chr(13) + "Rabbit jump" + Chr(11) + "ABC" + Chr(10) + "DEF" + Chr(127) + "GHI"
Var cleanText As String = re.Replace(rawText)

This is for UTF-8 formatted text which is the most popular.

The Search regex formulation explanation:

\x00-\x1F\x7F-\x9F: Matches control characters in the ASCII and Latin-1 ranges.
\xE2\x80\x8B: Zero Width Space (U+200B)
\xE2\x80\x8C: Zero Width Non-Joiner (U+200C)
\xE2\x80\x8D: Zero Width Joiner (U+200D)
\xE2\x80\xA8: Line Separator (U+2028)
\xE2\x80\xA9: Paragraph Separator (U+2029)
\xE2\x81\xA0: Word Joiner (U+2060)

Limited testing on my part, but this code at least runs and the example in the code for rawText returns

naïvefiancéRabbit jumpABCDEFGHI

Robert_Livingston · October 30, 2024, 7:02am

If that is so you can also stop at 1

I agree that you could stop at 1 (If you accepted that Trim worked in the context of the problem)

the contents of the If does not replicate Trim

This I do not quit understand. Under what circumstance (what characters) does the behavior of Trim and the Loop that is being examined differ?

Ian_Kennedy · October 30, 2024, 9:41am

Trim would have to be forced to work the same as the loop, which puts two places where the same decision would have to be coded. If changed in the future, both changed etc. all to avoid two iterations of a loop. Doesn’t seem worth it.

Tim_Hare · October 30, 2024, 6:00pm

Two differences between Trim and the loop approach:

If you test for <32 then you leave beginning and trailing spaces that Trim would remove.

If you test for <33 then you remove spaces from the body of the string, which is probably not desireable.

Use Trim in combination with either the loop or regex. In either case, consider if you want to remove line endings, which so far, both do.

Oliver_Osswald · October 31, 2024, 1:06am

My first thought as well: define unprintable!

Just started to wrap my brain around this:

Control characters that do not produce visible symbols are special ASCII or Unicode characters that convey control information but have no graphic representation. Here are some of the most common ones:

1. ASCII Control Characters

Null (NUL) - \x00: Acts as a null terminator in languages like C.

Bell (BEL) - \x07: Produces a sound (bell) on older systems.

Backspace (BS) - \x08: Moves the cursor one position back.

Horizontal Tab (HT) - \x09: Inserts a horizontal tab space.

Line Feed (LF) - \x0A: Denotes a line break, commonly used on Unix.

Vertical Tab (VT) - \x0B: Inserts a vertical space, rarely used.

Form Feed (FF) - \x0C: Advances to the next page in printing.

Carriage Return (CR) - \x0D: Returns the cursor to the start of the line.

Escape (ESC) - \x1B: Used to initiate control sequences like terminal commands.

Space (SP) - \x20: Produces an invisible space.

2. Unicode Control Characters (Invisible Control Characters)

Zero Width Space (ZWSP) - \u200B: An invisible space, often used for word wrapping.

Zero Width Non-Joiner (ZWNJ) - \u200C: Prevents the joining of characters into ligatures.

Zero Width Joiner (ZWJ) - \u200D: Encourages ligature formation between characters.

Non-breaking Space (NBSP) - \u00A0: An invisible space that prevents word breaks.

Left-to-Right Mark (LRM) - \u200E: Controls text alignment in bidirectional text.

Right-to-Left Mark (RLM) - \u200F: Controls text alignment in languages like Arabic.

Word Joiner (WJ) - \u2060: Prevents a line break between characters.

Soft Hyphen (SHY) - \u00AD: An optional hyphen that only appears if a word breaks.

Arnaud_N · October 31, 2024, 7:51am

And, in your list, some are really useful in texts (like tab and end of line). Are they “unprintable”?

Sveinn_Runar_Sigurdsson · October 31, 2024, 8:36am

Yo, use RegEx to do this (much faster, and specifically when working with larger strings). Looping through the string is a slow approach. The RegEx solutions above can be simplified by using “-” within regex to define the range of non-printable (non printable ascii) in a single go, without having to define a long regex pattern.

As a method

Public Function RemoveUnprintableChars(InputString as string) As string
  Var rx As New RegEx
  rx.SearchPattern = "[^\x20-\x7E]"  // Match characters outside the printable ASCII range 
  rx.ReplacementPattern = ""         // Replaces matches with an empty string
  rx.Options.ReplaceAllMatches = True
  
    return rx.Replace(inputString)
End Function

Extending string

Public Function RemoveUnprintableChars(Extends InputString as string) As string
  Var rx As New RegEx
  rx.SearchPattern = "[^\x20-\x7E]"  // Match characters outside the printable ASCII range (space to ~)
  rx.ReplacementPattern = ""         // Replaces matches with an empty string
  rx.Options.ReplaceAllMatches = True
  
  
  return rx.Replace(inputString)
End Function

so you can use it like this.

NewString.RemoveUnprintableChars(oldstring)

TimStreater · October 31, 2024, 8:38am

Why?