Problems with RegEx to parse text

Hi Everybody,

I have used RegExRX app to get the “Search Pattern” from a large text file with this schema:

[prot_AC] [description]
[sequence]

[quote]Example:

Q28133 Bovin protein
MKAVFLTLLFGLVCTAQETPAEIDPSKIPGEWRIIYAAADNKDKIVEGGPLRNYYRRIECINDC
ESLSITFYLKDQGTCLLLTEVAKRQEGYVYVLEFYGTNTLEVIHVSENMLVTYVENYDGERITK
MTEGLAKGTSFTPEELEKYQQLNSERGVPNENIENLIKTDNCPP

P00257-2 Bovin protein
MAARLLRVASAALGDTAGRWRLLLKSSQFIKVSCSGSWISAAQRAFICYSKSGNITCFLRSED
KITVHFINRDGETLTTKGKIGDSLLDVVVQNNLDIDGFGACEGTLACSTCHLIFEQHIFEKLEA
ITDEENDMLDLAYGLTDRSRLGCQICLTKAMDNMTVRVPDAVSDARESIDMGM
NSSKIE[/quote]

In any text editor such as RegExRX (MAC) or Notepad++ (WIN), I can parse all info with this RegEx function: “\>(\S*)\s(.)
((\w

?\S)*)” and after with the replacement: “$1”, “$2” or “$3”. I am doing an app in Xojo to parse each protein entry from the text file to a Listbox into 3 columns: Accession Code [prot_AC]; Protein Name [description]; and Protein Sequence [sequence]. I have tried to do a method to parse the data from a TextArea using the RegEx pattern but I am not able to obtain the info.

Method: ParseData

[code] Dim rg as New RegEx
Dim myMatch as RegExMatch
rg.SearchPattern=“\>(\S*)\s(.)
((\w

?\S)*)”
myMatch=rg.search(TextArea1.text)

// I don’t know if I have to use “SubexpressionString” or $1, S2, etc

if myMatch <> Nil then
ProtAC = myMatch.SubExpressionString(1)
Description = myMatch.SubExpressionString(2)
Sequence = myMatch.SubExpressionString(3)
else
MsgBox “Text not found!”
return
End if
exception err as RegExException
MsgBox err.message [/code]

Could anybody teach me to parse it ?

Thank you very much,
Wardiam

Try replacing "
" in your pattern with “\R”.

Also, you don’t have to escape characters that do not have special meaning like “>”.

Thanks Kem, I will change this in my pattern. The problem is that I don’t know how to parse the 3 “parts of text” from 1000 ocurrencies (protein entries) … maybe with a loop but, where? in the method or in the pushbutton action code?

Thanks again.
Wardiam

Kem I only want to parse the text between “()” in the search pattern:

[quote]rg.SearchPattern="\>(\S*)\s(.)
((\w

?\S)*)"[/quote]

I may not be understanding the problem, but your code is right for one occurrence. To find them all, do it this way:

myMatch = rg.Search( theText )
while myMatch IsA RegExMatch
    ProtAC = myMatch.SubExpressionString(1) 
    Description = myMatch.SubExpressionString(2)
    Sequence = myMatch.SubExpressionString(3)

    myMatch = rg.Search // Get the next match
wend

sorry Kem, I only explained you a part of the problem. I want to add the SubExpressionStrings (ProtAC, Description and Sequence) from each ocurrence into a Listbox and I don’t know how to do the loop to capture the parsed info. I attach the example file here:

https://www.dropbox.com/s/rsts08yhdg5ejpn/FastaDB.xojo_binary_project?dl=0

Thanks.

I’m afraid I have limited time so I can’t look at the project, but there are a number of ways to do what you want. I recommend that you create a new class with three properties, ProtAC, Description, and Sequence. For every match, create a new instance of that class and add it to an array. Return that array and add each one to your Listbox as desired.

OK Kem, I will try to do it following your comments.

Thanks,
Wardiam

Hi Kem,

I have used your recommendations and I have updated my example in Dropbox. Basically I have created a new class called “FASTAParse” with the three properties. Then I have created a method “ParseData” with this code:

[code] Dim Fparse As New FastaParse
Dim rg as New RegEx
Dim myMatch as RegExMatch
rg.SearchPattern = TextField1.Text
myMatch = rg.search(TextArea1.Text)

while myMatch IsA RegExMatch
Fparse.ProtAC = myMatch.SubExpressionString(1)
Fparse.Description = myMatch.SubExpressionString(2)
Fparse.Sequence = ReplaceLineEndings(myMatch.SubExpressionString(3), “”)

Listbox1.AddRow
Listbox1.Cell(Listbox1.LastIndex, 0) = Fparse.ProtAC
Listbox1.Cell(Listbox1.LastIndex, 1) = Fparse.Description
Listbox1.Cell(Listbox1.LastIndex, 2) = Fparse.Sequence

myMatch = rg.Search // Get the next match

wend[/code]

With this changes the app works but I don’t know how to include the data into an array. Could you help me to modify the code to include the array and update the listbox with its data?

Thanks a lot.
Wardiam

You can dim a variable as an array with parens in the dim statement. For example, if you want an array of Integer, you would do this:

dim arr() as integer

In your case, you would dim an array of your class like this:

dim Fparse() as FastaParse

As you loop through your text, you can append to the array:

  Dim FparseArr() As FastaParse

  Dim rg as New RegEx
  Dim myMatch as RegExMatch
  rg.SearchPattern = TextField1.Text
  myMatch = rg.search(TextArea1.Text)
  
  while myMatch IsA RegExMatch
    dim Fparse as new FastaParse
    Fparse.ProtAC = myMatch.SubExpressionString(1)
    Fparse.Description = myMatch.SubExpressionString(2)
    Fparse.Sequence = ReplaceLineEndings(myMatch.SubExpressionString(3), "")
    
    FparseArr.Append Fparse

    Listbox1.AddRow
    Listbox1.Cell(Listbox1.LastIndex, 0) = Fparse.ProtAC
    Listbox1.Cell(Listbox1.LastIndex, 1) = Fparse.Description
    Listbox1.Cell(Listbox1.LastIndex, 2) = Fparse.Sequence
    
    myMatch = rg.Search // Get the next match
  wend
  
  return FparseArr

On the other end, you can loop through the FparseArr items to add rows to your Listbox.

Hi Kem,

I have updated my method with your code but when I run the project I received this error:

Then I have defined return type of my method as “String” but I get a new error message:

What’s wrong now?

Thanks.
Wardiam

You are returning an array of FastaParse objects, so the return type has to match: FastaParse().

Thanks Kem, now it works perfectly… but (sorry), the last question, if the data of the listbox could be inserted from the Class

Listbox1.AddRow Listbox1.Cell(Listbox1.LastIndex, 0) = Fparse.ProtAC Listbox1.Cell(Listbox1.LastIndex, 1) = Fparse.Description Listbox1.Cell(Listbox1.LastIndex, 2) = Fparse.Sequence

or from the array (could you show me an example?), is there any advantage to use the class or array? I suppose if I have a large text file, the array could be more recommended to use, couldn’t it?

Thanks for all. I’m sorry to be so bothering you.
Wardiam

You can loop through an array like this:

for i as integer = 0 to fastaArr.Ubound
    dim Fparse as FastaParse = fastaArr( i )
    // Fill the row using Fparse
next i

Hi Kem,

I have following your instructions but I don’t get to run the app including this option. I know that you are very busy but, could you take a look at the attached file here?

Example: Dropbox - Error - Simplify your life

Anyway this is the incomplete code that I have included in it:

//Method with Array

[code]Dim FparseArr() As FastaParse

Dim rg as New RegEx
Dim myMatch as RegExMatch
rg.SearchPattern = TextField1.Text
myMatch = rg.search(TextArea1.Text)

while myMatch IsA RegExMatch
dim Fparse as new FastaParse
Fparse.ProtAC = myMatch.SubExpressionString(1)
Fparse.Description = myMatch.SubExpressionString(2)
Fparse.Sequence = ReplaceLineEndings(myMatch.SubExpressionString(3), “”)

FparseArr.Append Fparse

'Listbox1.AddRow
'Listbox1.Cell(Listbox1.LastIndex, 0) = Fparse.ProtAC
'Listbox1.Cell(Listbox1.LastIndex, 1) = Fparse.Description
'Listbox1.Cell(Listbox1.LastIndex, 2) = Fparse.Sequence

//Now I want to include the array information

For i as integer = 0 to FparseArr.Ubound
'dim Fparser as FastaParse = FparseArr( i )
Listbox1.AddRow
Listbox1.Cell(Listbox1.LastIndex, i) = FparseArr(i) // Fill the row using Fparse
next i[/b]

myMatch = rg.Search // Get the next match

wend

return FparseArr[/code]

Using this code I get an error message:

Could you help me, please?

I want to use this example to process a text file with hundreds of thousands sequences. Do you consider using the array option or maybe it could be better to use a dictionary or even a SQLite database to store all dataset?

Thank you very much.
Wardiam

Each element of the array is a member of your FastaParse class, so you have to assign the properties of each one to the rows of the Listbox, not the object itself.

   For i as integer = 0 to FparseArr.Ubound
      dim Fparse as FastaParse = FparseArr( i )      
      Listbox1.AddRow Fparse.ProtAC
      Listbox1.Cell(Listbox1.LastIndex, 1) = Fparse.Description 
      Listbox1.Cell(Listbox1.LastIndex, 2 = Fparse.Sequence
    next i

Hi Kem,

this is exactly that I want but there is a little error in the line: dim Fparse as FastaParse = FparseArr( i )our

I have updated my code with your comment:

[code] Dim FparseArr() As FastaParse
Dim rg as New RegEx
Dim myMatch as RegExMatch
rg.SearchPattern = TextField1.Text
myMatch = rg.search(TextArea1.Text)

while myMatch IsA RegExMatch
dim Fparse as new FastaParse
Fparse.ProtAC = myMatch.SubExpressionString(1)
Fparse.Description = myMatch.SubExpressionString(2)
Fparse.Sequence = ReplaceLineEndings(myMatch.SubExpressionString(3), “”)

FparseArr.Append Fparse

For i as integer = 0 to FparseArr.Ubound
  Dim Fparse as FastaParse = FparseArr( i )
  Listbox1.AddRow 
  Listbox1.Cell(Listbox1.LastIndex, 0) = Fparse.ProtAC
  Listbox1.Cell(Listbox1.LastIndex, 1) = Fparse.Description
  Listbox1.Cell(Listbox1.LastIndex, 2) = Fparse.Sequence
next i

myMatch = rg.Search // Get the next match

wend

return FparseArr[/code]

When I run the app, I get this error message:

If I omit this line, the app runs but in the listbox I get each protein entry repeatedly, it means, the first entry appears once, the second twice and third appears three times. Although I have checked the array and it is right (only 3 entries).

How could I modify this line to be correct?

Thanks.

I thought you were segregating the code that parses the text from the code that fills in the Listbox. I see now that you have combined it into one method, so try this:

  Dim FparseArr() As FastaParse
  Dim rg as New RegEx
  Dim myMatch as RegExMatch
  rg.SearchPattern = TextField1.Text
  myMatch = rg.search(TextArea1.Text)
  
  while myMatch IsA RegExMatch
    dim Fparse as new FastaParse
    Fparse.ProtAC = myMatch.SubExpressionString(1)
    Fparse.Description = myMatch.SubExpressionString(2)
    Fparse.Sequence = ReplaceLineEndings(myMatch.SubExpressionString(3), "")
    
    FparseArr.Append Fparse
    
    Listbox1.AddRow 
    Listbox1.Cell(Listbox1.LastIndex, 0) = Fparse.ProtAC
    Listbox1.Cell(Listbox1.LastIndex, 1) = Fparse.Description
    Listbox1.Cell(Listbox1.LastIndex, 2) = Fparse.Sequence
    
    myMatch = rg.Search // Get the next match
  wend
  
  return FparseArr

Thanks Kem,

this code is similar to the code that you gave me in a previous answer. If you remember I wanted to include the text data into an array and then loading the array information into the listbox. Maybe your code is the easiest way and more effective to do that I wanted but I hope to use the array. Of course I appreciate all your help.

Thanks a lot for your patience.
Sergio