finding string matches

What do the 'n’s mean?

DNA sequences are made up of A, C, G, and T. If the character at a given position is not known it is written as an ‘n’. When searching the n’s do not count as matches with anything (even with another ‘n’. However, there should not be any n’s in the query sequences.

I meant how the search set of 15-mers are all overlapped.

For the full DNA I’d actually do it as one sequence. I think the reason you saw a speed up is because InStr goes slower and slower the deeper it gets. Tim’s suggestion of InStrB is pretty much unaffected by length and I think faster besides. That should be easy to test, in your working InStr code add a B and see how it runs.

Also, you’re searching for just 15-mers but ultimately do you want the longest possible match. I mean, are these 15-mer matches coallesced into 37-mers or whatever or are 15-mer matches the end result?

Extending the matches to longer strings should not be hard. I would look for the 15-mer matches and then just extend one character at a time upstream and one character at a time downstream. Each time I would need to check to see if my match is above a certain threshold (e.g. 90%) but I can do that after all of the 15-mer matches were found.

I will try the InStrB.

I am currently running a test using RegExMBS, and it seems pretty fast. I’ll know when it’s done, but it’s only been a few minutes and it’s already about halfway done. Will post more as I know…

That’s in the IDE, btw, and it’s updating a progress bar. I’d expect the compiled version to be faster still, especially since I’m playing Plants vs. Zombies while waiting. :slight_smile:

Will… InStrB seems to be case sensitive so I would need to update all of the sequence files to be the same case. Easy enough to do, but time consuming. When folks entered sequences into GenBank (US national DNA sequence repository), they have been entered in a huge mixture of upper and lower case. Sometimes the case says something about the DNA, perhaps there is less confidence in the DNA sequence at that point, or maybe it is part of a repeated sequence. However, in many cases, the case is not meaningful.

Kem… sounds encouraging. I am anxious to see your code and learn something new.

OK, here are the results (in two posts since it’s too long by itself):

CCGAGAGACACAGGG @ 80610926
GACACAGGGCTTTTT @ 8601204
ACACAGGGCTTTTTA @ 8601205
TTCACTGTATTTTCA @ 135777680
TCACTGTATTTTCAC @ 78991512
ACTGTATTTTCACTT @ 76833525
CCAATGTGATAATTT @ 112630622
GATAATTTATGTGAA @ 25381591
GATAATTTATGTGAA @ 44533106
ATAATTTATGTGAAA @ 25381592
ATAATTTATGTGAAA @ 27685550
ATAATTTATGTGAAA @ 44533107
ATAATTTATGTGAAA @ 80239204
ATAATTTATGTGAAA @ 120403009
ATAATTTATGTGAAA @ 128966308
TAATTTATGTGAAAT @ 25381593
TAATTTATGTGAAAT @ 44533108
TAATTTATGTGAAAT @ 108618106
TAATTTATGTGAAAT @ 120403010
AATTTATGTGAAATC @ 25381594
ATTTATGTGAAATCT @ 23952925
ATTTATGTGAAATCT @ 64815035
TTTATGTGAAATCTT @ 93195925
TTATGTGAAATCTTT @ 53733134
TGAAATCTTTCTGCA @ 118869623
AAATCTTTCTGCAAA @ 55288338
AAATCTTTCTGCAAA @ 93015752
AAATCTTTCTGCAAA @ 93178829
AATCTTTCTGCAAAC @ 93178830
ATCTTTCTGCAAACT @ 63131180
ATCTTTCTGCAAACT @ 93178831
TCTTTCTGCAAACTA @ 63131181
TCTTTCTGCAAACTA @ 93178832
TCTGCAAACTATACA @ 85071603
CTGCAAACTATACAG @ 40387094
TATACAGTATGATAA @ 118177010
ATACAGTATGATAAA @ 118177011
TACAGTATGATAAAA @ 95452123
TACAGTATGATAAAA @ 104235501
ACAGTATGATAAAAA @ 86859117
CAGTATGATAAAAAT @ 20811104
AGTATGATAAAAATA @ 20311758
AGTATGATAAAAATA @ 33051999
AGTATGATAAAAATA @ 59624992
GTATGATAAAAATAT @ 82632728
TATGATAAAAATATA @ 38137369
TATGATAAAAATATA @ 86788452
ATGATAAAAATATAA @ 26518324
TGATAAAAATATAAG @ 63029044
TGATAAAAATATAAG @ 126278590
ATAAAAATATAAGGT @ 13519049
ATAAAAATATAAGGT @ 65288918
TAAAAATATAAGGTA @ 19495007
TAAAAATATAAGGTA @ 113881219
AAAAATATAAGGTAG @ 19495008
AAGGTAGTTTCACTG @ 73813098
TAGTTTCACTGGAAA @ 15054598
AGTTTCACTGGAAAC @ 15054599
GTTTCACTGGAAACA @ 15054600
TTTCACTGGAAACAA @ 15054601
ACTGGAAACAACAGA @ 25279820
CTGGAAACAACAGAA @ 25279821
TGGAAACAACAGAAG @ 25279822
AAACAACAGAAGACT @ 53018646
AACAGAAGACTAGAA @ 120669781
ACAGAAGACTAGAAG @ 102597548
ACTAGAAGCTGATGT @ 116672806
TAGAAGCTGATGTGA @ 2584821
AGAAGCTGATGTGAC @ 2584822
CAGGAATACCCATCT @ 56958510
AGGAATACCCATCTC @ 56958511
TTCTGGATATGTGCT @ 10519187
GATATGTGCTCTGAG @ 89153886
ATATGTGCTCTGAGA @ 89153887
TATGTGCTCTGAGAA @ 94055522
CTCTGAGAAGGTGCC @ 124903330
TCTGAGAAGGTGCCC @ 125471841
CTGAGAAGGTGCCCA @ 15313781
CTGAGAAGGTGCCCA @ 125471842
TGAGAAGGTGCCCAT @ 15313782
TGAGAAGGTGCCCAT @ 125471843
GGTGCCCATGTCACC @ 126385513
GACCACACTGGAGGC @ 26672579
CCACACTGGAGGCCA @ 35957585
CCACACTGGAGGCCA @ 36564901
CCACACTGGAGGCCA @ 127904454
GGAGGCCAATGCAGA @ 73024762
CAATGCAGATACTGG @ 3770152
AATGCAGATACTGGG @ 3770153
ATGCAGATACTGGGG @ 89546525
TGCAGATACTGGGGG @ 29600573
TGCAGATACTGGGGG @ 65778906
CTGGGGGAAGGTTCC @ 85290423
TGGGGGAAGGTTCCA @ 85290424
GGGGGAAGGTTCCAT @ 85290425
AATCACTGAAGTTCC @ 86012441
ACTGAAGTTCCCTGA @ 128190379
CTGAAGTTCCCTGAT @ 51838685
TAATCTCTCTAGAGT @ 88527546
TAATCTCTCTAGAGT @ 101506298
AATCTCTCTAGAGTT @ 88527547
ATCTCTCTAGAGTTG @ 88527548
CTCTCTAGAGTTGGA @ 122173329
TAGAGTTGGATGAAA @ 47141871
AGAGTTGGATGAAAG @ 117047541
ACTGTGCTGCCTTGA @ 69174751
CTGTGCTGCCTTGAA @ 136878894
TGTGCTGCCTTGAAG @ 48475733
TGCCTTGAAGCTCTG @ 34847748
TGCCTTGAAGCTCTG @ 74430314
GCCTTGAAGCTCTGA @ 34847749
CCTTGAAGCTCTGAG @ 34847750
GAAGCTCTGAGAGAT @ 46153933
GCCATGCCAATTCAA @ 114921368
TTTATTGTTGAAACT @ 19214708
TGTTGAAACTCTTGC @ 134005682
GTAATGACATCTTTA @ 36417689
TAATGACATCTTTAT @ 27851027
TAATGACATCTTTAT @ 36417690
TAATGACATCTTTAT @ 51471636
TAATGACATCTTTAT @ 80283907
TAATGACATCTTTAT @ 117270950
AATGACATCTTTATT @ 10768003
AATGACATCTTTATT @ 36417691
AATGACATCTTTATT @ 51471637
TATTCAGGTGAAAAT @ 31961245
TATTCAGGTGAAAAT @ 133448136
ATTCAGGTGAAAATA @ 83256203
ATTCAGGTGAAAATA @ 88883619
ATTCAGGTGAAAATA @ 91677368
ATTCAGGTGAAAATA @ 133448137
TTCAGGTGAAAATAC @ 83256204
CAGGTGAAAATACAG @ 31645955
AGGTGAAAATACAGG @ 31645956
GGTGAAAATACAGGA @ 60897493
TGAAAATACAGGATG @ 4998301
TGAAAATACAGGATG @ 66178956
AAAATACAGGATGAA @ 22546921
ATACAGGATGAATTT @ 67062129
TACAGGATGAATTTC @ 67062130
ATGAATTTCAACTAT @ 68531041
ATTTCAACTATATGA @ 104085174
TTTCAACTATATGAT @ 104085175
TTCAACTATATGATA @ 104085176
TCAACTATATGATAT @ 44955623
TCAACTATATGATAT @ 104085177
CAACTATATGATATT @ 53282289
CAACTATATGATATT @ 104085178
CAACTATATGATATT @ 104387532
CTATATGATATTGTT @ 80641048
TATGATATTGTTTAT @ 83745968
ATGATATTGTTTATG @ 81901878
ATGATATTGTTTATG @ 104125719
TGATATTGTTTATGT @ 104125720
TTGTTTATGTTCCTC @ 74985617
TTTATGTTCCTCAGA @ 33427998
TTATGTTCCTCAGAC @ 33427999
TATGTTCCTCAGACA @ 12858479
TGTTCCTCAGACATG @ 119506266
TTCCTCAGACATGTT @ 14200645
ATGTTATTTGTCTTT @ 3588978
ATGTTATTTGTCTTT @ 82670293
TGTTATTTGTCTTTA @ 3588979
TGTTATTTGTCTTTA @ 89151448
TGTTATTTGTCTTTA @ 91101596
TTATTTGTCTTTACA @ 9751827
TTATTTGTCTTTACA @ 130247098
TATTTGTCTTTACAA @ 8457583

Continued:

ATTTGTCTTTACAAA @ 8457584
GTCTTTACAAAGATT @ 99578785
TCTTTACAAAGATTG @ 33451314
TCTTTACAAAGATTG @ 60811562
TCTTTACAAAGATTG @ 99578786
CTTTACAAAGATTGG @ 40145683
TTTACAAAGATTGGT @ 71655169
ATTGGTTTCAATAAA @ 38606963
TGGTTTCAATAAACT @ 65774494
TCAATAAACTGTGTG @ 11815124
AATAAACTGTGTGAC @ 50128826
GACTGATATTAAATA @ 64880697
ACTGATATTAAATAA @ 64880698
ACTGATATTAAATAA @ 77706013
CTGATATTAAATAAA @ 10794500
GATATTAAATAAACA @ 115548734
ATATTAAATAAACAT @ 4409619
ATATTAAATAAACAT @ 28916100
TATTAAATAAACATG @ 16222110
ATTAAATAAACATGG @ 130190937
TTAAATAAACATGGA @ 86934860
TAAATAAACATGGAA @ 86934861
AAATAAACATGGAAT @ 10289866
AAATAAACATGGAAT @ 59348359
ATAAACATGGAATTT @ 112855926
TAAACATGGAATTTT @ 131710344
AAACATGGAATTTTA @ 131710345
TTTTACACATTCATA @ 41852167
TACACATTCATAATA @ 1901678
ACACATTCATAATAT @ 1901679
ACACATTCATAATAT @ 71784828
ATTCATAATATTTGG @ 22154246
ATTCATAATATTTGG @ 62843887
ATAATATTTGGTTTC @ 28535147
TAATATTTGGTTTCT @ 28535148
AATATTTGGTTTCTT @ 28535149
AATATTTGGTTTCTT @ 96954044
ATTTGGTTTCTTGTG @ 75990388
TTTGGTTTCTTGTGA @ 75702626
TGGTTTCTTGTGATT @ 126894584
GGTTTCTTGTGATTT @ 126894585
GTTTCTTGTGATTTT @ 30986968
GTTTCTTGTGATTTT @ 126894586
TTTCTTGTGATTTTG @ 30986969
TTTCTTGTGATTTTG @ 78652143
TTCTTGTGATTTTGA @ 78652144
TTCTTGTGATTTTGA @ 85357431
TCTTGTGATTTTGAT @ 78652145
TCTTGTGATTTTGAT @ 84620962
TCTTGTGATTTTGAT @ 85357432
TTGTGATTTTGATGA @ 36183337
TTGTGATTTTGATGA @ 57786264
TGTGATTTTGATGAA @ 83996956
TGATTTTGATGAACT @ 17387086
GATGAACTATACTTT @ 29320135
AACTATACTTTCAAG @ 94551800
CTATACTTTCAAGCA @ 8196090
TTCAAGCATTGTCAG @ 91923122
TCAAGCATTGTCAGT @ 91923123
CATTGTCAGTTCATC @ 73717720
TCATCCTCATTTACA @ 20065403
TCATCCTCATTTACA @ 127478076
CATCCTCATTTACAA @ 27616514
ATCCTCATTTACAAG @ 27616515
ATCCTCATTTACAAG @ 72562026
TCCTCATTTACAAGG @ 27616516
TCATTTACAAGGAAA @ 31135295
TTTGCTGTAGCAGAG @ 114172430
TTGCTGTAGCAGAGT @ 114172431
TGCTGTAGCAGAGTG @ 104066066
TAGCAGAGTGATGTA @ 93583751
ATGTAGTGACTTGCT @ 107938271
AGTGACTTGCTTTAC @ 24197857
GTGACTTGCTTTACT @ 24197858
GTGACTTGCTTTACT @ 64231044
TGACTTGCTTTACTT @ 110442270
GACTTGCTTTACTTA @ 88127227
ACTTGCTTTACTTAA @ 29960804
ACTTGCTTTACTTAA @ 106411515
CTTGCTTTACTTAAG @ 106411516
TTGCTTTACTTAAGA @ 96340249
TGCTTTACTTAAGAT @ 125739858
GCTTTACTTAAGATA @ 117261295
TTAAGATAAGTCAAC @ 8982769
AAGATAAGTCAACTT @ 101831570
AAGTCAACTTTTCTA @ 25208731
CAACTTTTCTATTCA @ 39283842
TTCAATGTTTAATGT @ 63988461
AATGTTTAATGTGAA @ 31365584
TGTTTAATGTGAAAC @ 110533215
GTTTAATGTGAAACT @ 82834131
TTTAATGTGAAACTG @ 136363339
TTAATGTGAAACTGA @ 127537128
TGTGAAACTGACTTA @ 79353759
TGTGAAACTGACTTA @ 132956746
GTGAAACTGACTTAA @ 132956747
TGAAACTGACTTAAA @ 132956748
AAACTGACTTAAAAA @ 12837948
AAACTGACTTAAAAA @ 61097185
AACTGACTTAAAAAT @ 61097186
ACTGACTTAAAAATT @ 123787112
TGACTTAAAAATTTT @ 48451028
TGACTTAAAAATTTT @ 65958239
GACTTAAAAATTTTC @ 14186884
ACTTAAAAATTTTCT @ 23313030
CTTAAAAATTTTCTT @ 23313031
CTTAAAAATTTTCTT @ 73951041
TTAAAAATTTTCTTT @ 8615394
TTAAAAATTTTCTTT @ 14017777
TTAAAAATTTTCTTT @ 19779589
TTAAAAATTTTCTTT @ 21143741
TTAAAAATTTTCTTT @ 23313032
TTAAAAATTTTCTTT @ 40158728
TTAAAAATTTTCTTT @ 41199787
TTAAAAATTTTCTTT @ 63610294
TTAAAAATTTTCTTT @ 65399779
TTAAAAATTTTCTTT @ 90954965
TTAAAAATTTTCTTT @ 106558578
TTAAAAATTTTCTTT @ 119069602
TAAAAATTTTCTTTT @ 8615395
TAAAAATTTTCTTTT @ 14017778
TAAAAATTTTCTTTT @ 24799924
TAAAAATTTTCTTTT @ 26825426
TAAAAATTTTCTTTT @ 83574192
AAAAATTTTCTTTTA @ 14017779
AAAAATTTTCTTTTA @ 55803696
AAAAATTTTCTTTTA @ 66006181
AAATTTTCTTTTAGG @ 10733784
AAATTTTCTTTTAGG @ 27785646
AAATTTTCTTTTAGG @ 65454678
AAATTTTCTTTTAGG @ 116994246
AATTTTCTTTTAGGC @ 44825784
AATTTTCTTTTAGGC @ 92370563
TTTTCTTTTAGGCCA @ 127509295
TTTTCTTTTAGGCCA @ 133590616
TTAGGCCACATTTAC @ 23472302
TAGGCCACATTTACA @ 23472303
AGGCCACATTTACAT @ 23126242
GCCACATTTACATTT @ 3729475
ACATTTACATTTCTG @ 118505993
CATTTACATTTCTGT @ 118505994
CATTTACATTTCTGT @ 125718853
TTTACATTTCTGTTT @ 55784832
TTTACATTTCTGTTT @ 77359250
TACATTTCTGTTTGT @ 40619224
TACATTTCTGTTTGT @ 86828922
ACATTTCTGTTTGTA @ 59584800
CATTTCTGTTTGTAT @ 25389019
CATTTCTGTTTGTAT @ 59584801
ATTTCTGTTTGTATA @ 81302122
TTTCTGTTTGTATAA @ 76464583
TATAACATGCATTAT @ 122195209
AACATGCATTATACT @ 78037829
ATGCATTATACTATT @ 53081634
TGCATTATACTATTT @ 53081635
GCATTATACTATTTT @ 53081636
CATTATACTATTTTA @ 85489214
CATTATACTATTTTA @ 105572606
TTATACTATTTTAAC @ 22104743
TTATACTATTTTAAC @ 128195324
CTATTTTAACTTAAT @ 77160086
TATTTTAACTTAATA @ 109317771
TTTAACTTAATACTA @ 85334088
CTTAATACTATGGCT @ 135115458
CTATGGCTATATAAA @ 96135523
TGGCTATATAAACCA @ 70002259
CTATATAAACCAAAG @ 54664899
TATAAACCAAAGACA @ 52656737
ATAAACCAAAGACAT @ 12321653
ATAAACCAAAGACAT @ 136695977
TAAACCAAAGACATC @ 136695978
AAACCAAAGACATCC @ 100134059
CAAAGACATCCTAAT @ 71229643
TCCTAATTCTGTTTT @ 39835153
CCTAATTCTGTTTTT @ 53454346
CTAATTCTGTTTTTA @ 86512458
TAATTCTGTTTTTAA @ 31658507
TAATTCTGTTTTTAA @ 60207366
TAATTCTGTTTTTAA @ 86512459
TAATTCTGTTTTTAA @ 100981873
TAATTCTGTTTTTAA @ 130229627
TAATTCTGTTTTTAA @ 137651802
AATTCTGTTTTTAAT @ 31658508
AATTCTGTTTTTAAT @ 60207367
AATTCTGTTTTTAAT @ 77640801
AATTCTGTTTTTAAT @ 89147495
AATTCTGTTTTTAAT @ 130229628
ATTCTGTTTTTAATT @ 7772213
ATTCTGTTTTTAATT @ 31658509
ATTCTGTTTTTAATT @ 60031795
ATTCTGTTTTTAATT @ 77640802
ATTCTGTTTTTAATT @ 110363277
TTCTGTTTTTAATTT @ 7772214
TTCTGTTTTTAATTT @ 8383518
TTCTGTTTTTAATTT @ 11907306
TTCTGTTTTTAATTT @ 18754199
TTCTGTTTTTAATTT @ 26016344
TTCTGTTTTTAATTT @ 34292703
TTCTGTTTTTAATTT @ 59506406
TTCTGTTTTTAATTT @ 60031796
TTCTGTTTTTAATTT @ 68647557
TTCTGTTTTTAATTT @ 71995203
TTCTGTTTTTAATTT @ 77640803
TTCTGTTTTTAATTT @ 82155805
TTCTGTTTTTAATTT @ 88526894
TTCTGTTTTTAATTT @ 90961398
TTCTGTTTTTAATTT @ 110363278
TTCTGTTTTTAATTT @ 111381711
TTCTGTTTTTAATTT @ 121413806
TTCTGTTTTTAATTT @ 137283747
TCTGTTTTTAATTTA @ 3971568
TCTGTTTTTAATTTA @ 68647558
TCTGTTTTTAATTTA @ 102862219
TCTGTTTTTAATTTA @ 121413807
CTGTTTTTAATTTAT @ 68647559
CTGTTTTTAATTTAT @ 77850953
CTGTTTTTAATTTAT @ 121413808
TGTTTTTAATTTATT @ 20200093
TGTTTTTAATTTATT @ 24803739
TGTTTTTAATTTATT @ 30166053
TGTTTTTAATTTATT @ 58888906
TGTTTTTAATTTATT @ 64508244
TGTTTTTAATTTATT @ 68647560
TGTTTTTAATTTATT @ 77850954
TGTTTTTAATTTATT @ 87255313
TGTTTTTAATTTATT @ 121296317
TGTTTTTAATTTATT @ 121413809
TTTTTAATTTATTGT @ 2431680
TTTTTAATTTATTGT @ 42972540
TTTTTAATTTATTGT @ 56464010
TTTTTAATTTATTGT @ 87311030
TTTTTAATTTATTGT @ 105614509
TTTTTAATTTATTGT @ 111368143
TTTTAATTTATTGTT @ 2431681
TTTTAATTTATTGTT @ 6738992
TTTTAATTTATTGTT @ 42972541
TTTTAATTTATTGTT @ 56464011
TTTTAATTTATTGTT @ 60750818
TTTTAATTTATTGTT @ 87311031
TTTTAATTTATTGTT @ 89579501
TTTTAATTTATTGTT @ 91940094
TTTTAATTTATTGTT @ 119125405
TTTAATTTATTGTTT @ 8357767
TTTAATTTATTGTTT @ 18169483
TTTAATTTATTGTTT @ 56464012
TTTAATTTATTGTTT @ 57573177
TTTAATTTATTGTTT @ 87311032
TTTAATTTATTGTTT @ 91940095
TTTAATTTATTGTTT @ 94409037
TTTAATTTATTGTTT @ 97042972
TTTAATTTATTGTTT @ 125297378
TTAATTTATTGTTTT @ 8357768
TTAATTTATTGTTTT @ 56464013
TTAATTTATTGTTTT @ 57573178
TTAATTTATTGTTTT @ 87311033
TTAATTTATTGTTTT @ 91940096
TTAATTTATTGTTTT @ 94409038
TTAATTTATTGTTTT @ 113397909
TTAATTTATTGTTTT @ 125297379
TAATTTATTGTTTTG @ 56464014
TAATTTATTGTTTTG @ 94409039
TAATTTATTGTTTTG @ 98897904
AATTTATTGTTTTGC @ 85311418
AATTTATTGTTTTGC @ 87012671
AATTTATTGTTTTGC @ 89746733
ATTTATTGTTTTGCA @ 69110761
ATTTATTGTTTTGCA @ 89746734
TTTATTGTTTTGCAA @ 55713091
TTTATTGTTTTGCAA @ 61178342
TTTATTGTTTTGCAA @ 78758779
TATTGTTTTGCAAGA @ 26203828
TTGTTTTGCAAGATA @ 67846648
GTTTTGCAAGATAAA @ 103518771
GTTTTGCAAGATAAA @ 119973395
TTTTGCAAGATAAAT @ 30630630
TTTTGCAAGATAAAT @ 138191283
TTTGCAAGATAAATA @ 30630631
TTGCAAGATAAATAA @ 30630632
TGCAAGATAAATAAT @ 30630633
CAAGATAAATAATTT @ 60932353
CAAGATAAATAATTT @ 98524516
AAGATAAATAATTTT @ 28761128
AAGATAAATAATTTT @ 51074242
AAGATAAATAATTTT @ 60932354
AAGATAAATAATTTT @ 84176033
AGATAAATAATTTTT @ 51486240
AGATAAATAATTTTT @ 72431608
AGATAAATAATTTTT @ 84176034
AGATAAATAATTTTT @ 118783696
GATAAATAATTTTTA @ 53966027
GATAAATAATTTTTA @ 82827010
GATAAATAATTTTTA @ 84437761
ATAAATAATTTTTAA @ 260550
ATAAATAATTTTTAA @ 21366953
ATAAATAATTTTTAA @ 24926494
ATAAATAATTTTTAA @ 31700475
ATAAATAATTTTTAA @ 45250166
ATAAATAATTTTTAA @ 53966028
ATAAATAATTTTTAA @ 62445286
ATAAATAATTTTTAA @ 77381237
ATAAATAATTTTTAA @ 82827011
ATAAATAATTTTTAA @ 84437762
ATAAATAATTTTTAA @ 90015487
ATAAATAATTTTTAA @ 90117165
ATAAATAATTTTTAA @ 106393222
ATAAATAATTTTTAA @ 108581976
ATAAATAATTTTTAA @ 109731147
TAAATAATTTTTAAT @ 8630799
TAAATAATTTTTAAT @ 53966029
TAAATAATTTTTAAT @ 70093203
TAAATAATTTTTAAT @ 87579746
TAAATAATTTTTAAT @ 103789158
AAATAATTTTTAATG @ 21020967
AAATAATTTTTAATG @ 54772661
AAATAATTTTTAATG @ 77140447
AAATAATTTTTAATG @ 98548754
AAATAATTTTTAATG @ 125680121
ATAATTTTTAATGCA @ 111840180
AATTTTTAATGCACT @ 120689362
TTTTTAATGCACTAA @ 62816899
AATGCACTAATGAAG @ 69947146
CACTAATGAAGCTAA @ 108678870
AGTTCATATTAAAAG @ 46997883
TCATATTAAAAGACA @ 63681510
CATATTAAAAGACAT @ 63681511
AAAAGACATGTTATA @ 3265579
AAAGACATGTTATAT @ 3265580
AAGACATGTTATATG @ 48740551
TGACCTACTGAAGTT @ 104546580
TACTGAAGTTGAGCT @ 49414778
CTGAAGTTGAGCTAT @ 71129718
2,040,294,164 microsecs (around 34 minutes)

Here is the code I used, again, in the IDE:

  dim msg as string
  dim sw as new Stopwatch_MTC
  sw.Start
  
  dim rx as new RegExMBS
  rx.CompileOptionCaseLess = True
  rx.CompileOptionDotAll = False
  rx.CompileOptionUngreedy = False
  rx.CompileOptionNewLineAnyCRLF = True
  rx.ExecuteOptionNotEmpty = False
  rx.CompileOptionMultiline = true
  rx.CompileOptionNoUTF8Check = true
  rx.CompileOptionUTF8 = true
  
  dim mbQuery as MemoryBlock = QuerySource
  dim lastIndex as integer = mbQuery.Size - 15
  
  ProgressBar1.Visible = true
  ProgressBar1.Maximum = lastIndex
  ProgressBar1.Value = -1
  
  for i as integer = 0 to lastIndex
    dim thisQuery as string = mbQuery.StringValue( i, 15 )
    if rx.Compile( thisQuery ) and rx.Study then
      dim r as integer = rx.Execute( chromosomeDNA, 0 )
      while r > 0
        AddToResult thisQuery + " @ " + str( rx.OffsetCharacters( 0 ) + 1 )
        r = rx.Execute( rx.Offset( 0 ) + 1 )
      wend
    end if
    ProgressBar1.Value = i
    if i mod 5 = 0 then ProgressBar1.Refresh
    if UserCancelled then exit
  next
  
  sw.Stop
  msg = format( sw.ElapsedMicroseconds, "#," ) + " microsecs"
  AddToResult msg

Thanks Kem. That’s interesting! Do you use a memoryblock for speed? I’ve never used one. Also, what was your DNA? Did you concatenate all of the strings in the file? or were you using just a single block of 100,020?

I did not concatenate the strings, I just used the file you gave me. If I need to, I can remove all the white space from that file pretty quickly before running the test, but while that may impact the result, I doubt it will make much difference to the speed.

Yes, I used the MemoryBlock for speed. It might have been just as fast to use MidB, but I didn’t try.

To be perfectly clear, I was running each set of 15 in QuerySource against the text contained in the chromosomeDNA.txt file you provided.

I will incorporate you approach into my code and see what happens (tomorrow). I will let you know how things work out. I should be able to plug your code in with only minor changes. I will run a new set of analyses tonight with the current (old) method and then will use your approach tomorrow. I’ll post results.

FYI, I am running another test now with RegExMBS.ExecuteMT, the multi-thread version of the Execute command. I’ll let you know if it makes a difference.

No, ExecuteMT didn’t make much of a difference.

BTW, reading each line would be much slower than reading the entire file in one go.

Reading the file does not take very long compared to all the other steps. It’s easy to put each line into an array element as it is read in.

I actually posted a simplified version of the file. In real life the file also contains information about the location of each gene along the DNA sequence and if the gene is in the “forward” or “backward direction” - so this information also has to be read in.

Here’s a routine that runs in a bit over 2.5 minutes.
http://home.comcast.net/~trochoid/code/DNASearcher.zip

Press the Open button and choose file chromosomeDNA first then QuerySource. The dna is read into MemoryBlock dnaMem and the query read into mersInput. Then mersInput is scanned to build several arrays for each 15-mer, 1 as the full String, 1 for the first 8 chars as UInt64, another for the next 4 chars as UInt32, next 2 as UInt16 and last as UInt8. Then a map has to be built because the UInt64s may collide/be-the-same. So 2 more arrays are built, one for unique UInt64 values and the other for an array of indices with that value. This loading all happens pretty quick and I haven’t timed it.

Then click the search button. It scans the dna data byte position by byte position, searching the unique UInt64s with IndexOf. If a match is found the cooresponding indices are looped over and the rest of each 15mer is checked for inclusion in results.

There’s an issue with endianness I’m not sure about. You might need to change the line…
buildMatch.LittleEndian = TRUE
. . .to FALSE.

Also, I call Uppercase on the dna data to homogenize it, runs amazingly quick. Should probably do this for the query too. And I’m not sure about the effect of Ns, do they appear both in the query and dna?