RegEx: Parse formatted TextArea

Martin_T · April 16, 2016, 8:55pm

Hi, i try to create a Report-Engine by myself.

I have a TextArea with Text and Keywords like this:

Hello Word! {paragraphstyle:Normal}This is my {characterstyle:T1}se{characterstyle:T2}c{/characterstyle}ond{/characterstyle} paragraph.{/paragraphstyle}
Now i need all of the content:

“Hello World!”
"This is my "
“se”
“c”
“ond”
" paragraph."

and the Keywords:

```
{paragraphstyle:...}{/paragraphstyle}
```
```
{characterstyle:...}{/characterstyle}
```

And i need to get the Paragraph-/Character-Style (Paragraph-Style: “Normal”, Character-Style “T1” and Character-Style “T2”). Paragraphs starts with a new Line. So first thing could be to split the parts into an Array (Split(MyString, EndOfLine)).

Structure of my Classes:

Paragraphs (optional with Style) with Content-Array (Runs, optional with Style).

Its like a HTML-Syntax. What is a good way/strategy to get all information i want?

Loannis_Kolliageorgas · April 16, 2016, 9:26pm

I see two ways Ntfield and regex.
Martin can you be a little more specific?
Do you need the words between }se{ ?

Martin_T · April 16, 2016, 9:32pm

Yes Loannis i need the words between se.
Same sample in HTML (maybe its better to understand):

Hello Word!

This is my second paragraph.

Loannis_Kolliageorgas · April 16, 2016, 9:33pm

i will give a try with regex please be patient
Much better if you post the html code you get,will be more easy

Loannis_Kolliageorgas · April 16, 2016, 10:48pm

The code bellow get anything between braces

dim rx as new RegEx rx.SearchPattern = "(?Umi-s)(?:})\\w.*(?:{)" dim rxOptions as RegExOptions = rx.Options rxOptions.LineEndType = 4 dim match as RegExMatch = rx.Search( textField )
And this get anything inside braces

dim rx as new RegEx rx.SearchPattern = "(?Umi-s)\\{[^}]*\\}" dim rxOptions as RegExOptions = rx.Options rxOptions.LineEndType = 4 dim match as RegExMatch = rx.Search( TextField )
Is that you need?

Martin_T · April 16, 2016, 11:05pm

Thank you Loannis for your input. Its good to start, but i think we need a recursive method:

Split into Paragraphs ^.*$
check if Paragraph has a Style or not (get String and optional Style)
recursive Method to read the Runs (plus get the optional Style of each Run) within each Paragraph

[h]OOP Structure[/h]
Document
Properties

Paragraphs(-1) As Paragraph
…

Paragraph
Properties

Style As ParagraphStyle
Content(-1) As Run

Run
Properties

Style As RunStyle
Content As String

You see, its like an Office Document

Martin_T · April 20, 2016, 9:37pm

I have taken a few more thoughts:

[h]Sample Formatted Text[/h]
# Heading 1

[PS:Name]At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

## Heading 2
Lorem ipsum dolor sit amet, [TS:My Style]consetetur [TS:Cite]sadipscing[/TS] elitr,[/TS] sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.

[PS:Name]Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.

You see, Users can define Headings (1-6) like the MarkDown-Syntax defined it “# Heading 1”.
Paragraphs will create after a blank Line i[/i] and a Linebreak and they have an optional Style-Definition at the beginning “[PS:Name]”.
within Paragraphs it’s possible to use optional Text-Styles “[TS:Name]consetetur [/TS]”.

[h]Parsing[/h]
1. Step: Split into Heading and Paragraph-Objects (RegEx)

(?:\\R*)?(?:(#{1,6})\\s*|\\[PS:(.*)\\])?(.*)(?:\\R)?

Using this, i got 6 Matches. Why? The last match is empty (no need). How to edit the RegEx to get the 5 Matches?

I added a TextArea with the Sample Content from above and a ListBox to a Window. TextAreas1.TextChance looks this:

[code]Listbox1.DeleteAllRows

Dim headingLevel, paragraphStyle, content As String
Dim s As String = Me.Text
Dim rx As New RegEx
rx.SearchPattern = “(?mi-Us)(?:\R*)?(?:(#{1,6})\s*|\[PS:(.)\])?(.)(?:\R)?”

Dim rxOptions As RegExOptions = rx.Options
rxOptions.LineEndType = 4

Dim match As RegExMatch = rx.Search( s )

Do
If match <> Nil Then
headingLevel = match.SubExpressionString(1)
paragraphStyle = match.SubExpressionString(2)
content = match.SubExpressionString(3)

If headingLevel <> "" And content <> "" Then
  Listbox1.AddRow(content + " (Heading)")
ElseIf content <> "" Then
  Listbox1.AddRow(c)
End If

End If

match = rx.Search
Loop Until match Is Nil[/code]
This code freezes the program, why? Isn’t the loop right?

2. Step (ToDo): Parse the content of each paragraph to get Text-Styles
Don’t know how

Beatrix_Willius · April 21, 2016, 4:13am

Don’t use regex for parsing. Really don’t. Use a parser.

Looking at your Textarea content I’m not sure what you want to do. This looks similar to html. What is the benefit of using your sort-of-html? Wouldn’t html be easier? Because then you have a nice ready-made html parser called Tidy in the MBS plugins. Or use XML because then you can use an XML parser to get your data out.

Michel_Bujardet · April 21, 2016, 9:04am

You seem to love Regex, don’t you. Nothing wrong with that as an exercise, but it may not be the best choice for the case.

Regex is not Unicode aware. For a TextArea, this is kind of gauche.

Also, for styles, better use StyleRuns.

Martin_T · April 21, 2016, 1:19pm

Heho,

I wrote Classes to export Text into DOCX, ODT and some other formats (RTF, HTML, MD, TXT). That’s why i wrote my own “language” (mix of MD and own Keywords). It’s just should be a simple Syntax for my users, that’s why i don’t wont to use Html or Xml at the frontend. The keywords parsed in the background into my structures (TextRuns, Tables, Hyperlinks etc.).
I just asked, why i got 6 Matches, while i have only 5 possible Matches (Last Match = Nil) and why the Programm freezes.
My Classes have lots of more Properties to style Texts and Paragraphs then StyledText. Users can predefine Textstyles (TS) and Paragraph-Styles (PS).

Martin_T · April 21, 2016, 5:01pm

Found the Error, i forgot to set rxOptions.MatchEmpty = False