Problems to get the regular expression

Hi everybody,

I’m having problems to get several protein accession codes from a text file. Usually I use Regex tool from Kem Tekinay but in this case it is a complicated task because I have different format codes. This is an example:

[quote]EGR47954.1 citrate-synthase [Trichoderma reesei QM6a] (Confident); XP_013950202.1 hypothetical protein TRIVIDRAFT_74911 [Trichoderma virens Gv29-8]
ABN04079.1 SprT [Trichoderma koningii], AFJ97206.1 serine protease [Trichoderma orientale] (Confident); AFJ97206.1 serine protease [Trichoderma orientale], EGR46243.1 proteinase T-like protein [Trichoderma reesei QM6a]
P13645 Keratin, type I cytoskeletal 10 (Contact-Cont) OS=Homo sapiens GN=KRT10 PE=1 SV=6 (Confident)
Q9NSB2 Keratin, type II cuticular Hb4 (Contact-Cont) OS=Homo sapiens GN=KRT84 PE=2 SV=2 (Confident)
XP_013959236.1 hypothetical protein TRIVIDRAFT_79190 [Trichoderma virens Gv29-8] (Confident)
3OGV_A Chain A, Complex Structure Of Beta-Galactosidase From Trichoderma Reesei With Petg, EGR46683.1 glycoside hydrolase family 35 [Trichoderma reesei QM6a][/quote]

I don’t know if it’s possible to get all codes using a single regular expression but maybe using several regex.

Could anybody help me to extract only these protein codes, please?

Thank you very much.
Sergio

It’s these specific codes? You could create a list of alternates:

EGR47954.1| XP_013950202.1 ABN04079.1|...

you’re searching for the bold items ?
Just notice they are on the beginning of a line or after a “;” ?
should be regexable if it is the case

Sorry, I didn’t explain correctly in my previous post. I would like to get in each line only the bold items. Following my example:

[quote]EGR47954.1|XP_013950202.1
ABN04079.1|AFJ97206.1| EGR46243.1
P13645
Q9NSB2
XP_013959236.1
3OGV_A|EGR46683.1[/quote]

The main problem is the different format of the items into each line. Any suggestion Kem or Jean-Yves?

Thank you very much.
Sergio

If you don’t know ahead of time what the codes will be and they don’t have a consistent format, I don’t know if regular expressions will work for you. RegEx lets you pick out patterns, but you have to be able to define that pattern.

Thanks Kem. I think… I could reduce the complexity of the file (not the different codes). All the codes could begin with these characters: “];space” (without quotes) then I would have a code and finally there should be a “space” symbol. This “pattern” should be repeated in the same line n times. Could you help me then?

Example:

EGR47954.1 protein name1 [specie]; P13645 protein name2 [specie]; XP_013959236.1 protein name3 [specie]

Please!!

Not all as the first one doesn’t, but let’s try this:

(?<=\\A|\\];\\x20)\\w+(?:\\.\\d)?\\b

This uses a positive lookbehind to check for the start of the document or the "]; " sequence. It then identifies word characters (letters, numbers, underscore) followed, optionally, by a dot and digit, then a word break anchor (space, end of file, etc.).

Hi Kem,

thank you so much for your regex, it works perfectly. I see that it is quite complicated for my current knowledge about regular expressions. I’m going to study it.

Thanks again,
Sergio