Help with Regex Pattern?

Hi everybody,

I have this protein sequences (without spaces between the sequences) in a text file:

[quote]>Q28133 Bovin protein
MKAVFLTLLFGLVCTAQETPAEIDPSKIPGEWRIIYAAADNKDKIVEGGPLRNYYRRIEC
INDCESLSITFYLKDQGTCLLLTEVAKRQEGYVYVLEFYGTNTLEVIHVSENMLVTYVEN
YDGERITKMTEGLAKGTSFTPEELEKYQQLNSERGVPNENIENLIKTDNCPP

P00257-2 Bovin protein
MAARLLRVASAALGDTAGRWRLLLKSSQFIKVSCSGSWISAAQRAFICYSKSGNITCFLR
SEDKITVHFINRDGETLTTKGKIGDSLLDVVVQNNLDIDGFGACEGTLACSTCHLIFEQH
IFEKLEAITDEENDMLDLAYGLTDRSRLGCQICLTKAMDNMTVRVPDAVSDARESIDMGM
NSSKIE
C1_11500C_B Bovin protein
MDFMKPETVLDLANIRQALVRMEDTIVFDLIERSQFFSSPSVYEKNKYNIPNFDGTFLEW
ALLQLEVAHSQIRRYEAPDETPFFPDQLKTPILPPINYPKILAKYSDEINVNSEIMKFYV
DEIVPQVSCGQGDQKENLGSASTCDIECLQAISRRIHFGKFVAEAKYQSDKPLYIKLILD
KDVKGIENSITNSAVEQKILERLIVKAESYGVDPSLKFGQNVQSKVKPEVIAKLYKDWII
PLTKKVEIDYLLRRLEDEDVELVEKYKK[/quote]

I have tried to do the regex pattern with Kem’s App (RegExRX) and I use this pattern:

RegEx Patterrn: ^>([^ ]*)\\s(.*)[\\r\ ]((([a-zA-Z])+[\\r\ ])*)

This code works perfectly with the first 2 sequences and I can obtain these values:

$1: Q28133
$2: Bovin\s protein
$3: MKAVFLTLLFGLVCTAQETPAEIDPSKIPGEWRIIYAAADNKDKIVEGGPLRNYYRRIEC
INDCESLSITFYLKDQGTCLLLTEVAKRQEGYVYVLEFYGTNTLEVIHVSENMLVTYVEN
YDGERITKMTEGLAKGTSFTPEELEKYQQLNSERGVPNENIENLIKTDNCPP

But I have always a problem with the last sequence and exactly I cannot match the last line. If I include a ‘return’ after this line, my pattern recognizes perfectly this line but I would like to get the same result without modifying the text file. Is it possible?

Could anyone help me, please?

Thank you very much.
Sergio

^>([^ ]*)\\s([^\\r\ ]*)[\\r\ ]+([A-Za-z\\r\ ]+)(?=[\\r\ ]|$)

I might have oversimplified it, but try this:

^>(\\S*)\\s(.*)\\R([a-z\\r\
]+)

Thank you very much Martin and “Super” Kem, both codes work with my text file and extract all the information that I need.

I need to learn more about RegEx pattern, :slight_smile:

Thank you so much again.
Sergio