Problems to get the regular expression

Hi everybody,

I’m having problems to get several protein accession codes from a text file. Usually I use Regex tool from Kem Tekinay but in this case it is a complicated task because I have different format codes. This is an example:

[quote]EGR47954.1 citrate-synthase [Trichoderma reesei QM6a] (Confident); XP_013950202.1 hypothetical protein TRIVIDRAFT_74911 [Trichoderma virens Gv29-8]
ABN04079.1 SprT [Trichoderma koningii], AFJ97206.1 serine protease [Trichoderma orientale] (Confident); AFJ97206.1 serine protease [Trichoderma orientale], EGR46243.1 proteinase T-like protein [Trichoderma reesei QM6a]
P13645 Keratin, type I cytoskeletal 10 (Contact-Cont) OS=Homo sapiens GN=KRT10 PE=1 SV=6 (Confident)
Q9NSB2 Keratin, type II cuticular Hb4 (Contact-Cont) OS=Homo sapiens GN=KRT84 PE=2 SV=2 (Confident)
XP_013959236.1 hypothetical protein TRIVIDRAFT_79190 [Trichoderma virens Gv29-8] (Confident)
3OGV_A Chain A, Complex Structure Of Beta-Galactosidase From Trichoderma Reesei With Petg, EGR46683.1 glycoside hydrolase family 35 [Trichoderma reesei QM6a][/quote]

I don’t know if it’s possible to get all codes using a single regular expression but maybe using several regex.

Could anybody help me to extract only these protein codes, please?

Thank you very much.

It’s these specific codes? You could create a list of alternates:

EGR47954.1| XP_013950202.1 ABN04079.1|...

you’re searching for the bold items ?
Just notice they are on the beginning of a line or after a “;” ?
should be regexable if it is the case

Sorry, I didn’t explain correctly in my previous post. I would like to get in each line only the bold items. Following my example:

ABN04079.1|AFJ97206.1| EGR46243.1

The main problem is the different format of the items into each line. Any suggestion Kem or Jean-Yves?

Thank you very much.

If you don’t know ahead of time what the codes will be and they don’t have a consistent format, I don’t know if regular expressions will work for you. RegEx lets you pick out patterns, but you have to be able to define that pattern.

Thanks Kem. I think… I could reduce the complexity of the file (not the different codes). All the codes could begin with these characters: “];space” (without quotes) then I would have a code and finally there should be a “space” symbol. This “pattern” should be repeated in the same line n times. Could you help me then?


EGR47954.1 protein name1 [specie]; P13645 protein name2 [specie]; XP_013959236.1 protein name3 [specie]


Not all as the first one doesn’t, but let’s try this:


This uses a positive lookbehind to check for the start of the document or the "]; " sequence. It then identifies word characters (letters, numbers, underscore) followed, optionally, by a dot and digit, then a word break anchor (space, end of file, etc.).

Hi Kem,

thank you so much for your regex, it works perfectly. I see that it is quite complicated for my current knowledge about regular expressions. I’m going to study it.

Thanks again,