Test if a string is a valid URL

I’m struggling to write a working method that takes a string and returns True if it’s a valid http URL or False if it’s not.

By valid I simply mean ‘grammatically’ correct, not necessarily available.

It would also need to match ports. E.g:

http://localhost:8888/index.html

Does anybody happen to have a method that might do this? Am struggling with the RegEx class.

I’d guess the best way would be to do as you are, use Regular Expressions. I, however, wouldn’t try to create my own as a valid URL can be a complicated regular expression and others have already done the work, for example: http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url

I’m hoping I’m just misunderstanding the RegEx class because I can’t figure out why this isn’t working.

Here’s a method: isValidURL( url as String ):

[code] // Test for a valid URL

dim reg as new RegEx
dim myMatch as RegExMatch

reg.SearchPattern = “/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-])\/?$/”

myMatch = reg.Search(url)

if myMatch <> Nil then
msgBox “Valid!”
else
msgBox “Not valid”
end if[/code]

The string I’m passing to the method is:

http://www.google.com

The method uses the RegEx pattern from here. Everytime I try to use one of the RegEx patterns from StackOverflow (like the one you posted) I get a RegExSearchPatternException saying:

character value in \\x{...} sequence is too large

Thoughts?

1 Like

:stuck_out_tongue:
If this has to do with the wiki to dash project … fun stuff aint it ?
Contact me offlist - I might save yer sanity :stuck_out_tongue:

Caught me Norm! Staring at this code at midnight is probably not the most sensible thing in the world to be doing :slight_smile:

Here is what I use… without REGEX

FUNCTION isValidURL(URL as string) as boolean
  Dim err_flag As Boolean
  Dim i As Integer
  Dim s As String
  err_flag=False
  If Left(url,7)="HTTP://" Then
    url=Mid(url,8)
  Elseif Left(url,8)="HTTPS://" Then
    url=Mid(url,9)
  End If
  url=ReplaceAll(url,"\",".")
  url=ReplaceAll(url,"/",".")
  err_flag=(url.Len=0)
  If Not err_flag Then
    For i=1 To url.Len
      Select Case Mid(url,i,1)
      Case "A" To "Z","a" To "z","0" To "9","_","-",".","~"
      Case "%" ' hex char
        s="&H"+Mid(url,i+1,2)
        If (Val(s)=0 And s<>"&H00") Then err_flag=True
      Case Else
        err_flag=True
      End Select
      If err_flag=True Then Exit For
    Next i
  End If
  Return Not err_flag
END FUNCTION

It returns TRUE if URL is VALID (not if URL exists)… just it meets the patter requirement

Here ya go:

[code]Function IsValidURL(url As String) As Boolean
Dim r As New RegEx

r.SearchPattern = “((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-]*)?\??(?:[-\+=&;%@.\w])#?(?:[\w]))?)”

Return (r.Search(url) <> Nil)

End Function
[/code]

For the fun of it, profiling the two solutions,

IsValidURL (w/o RegEx) = 0.0198ms
IsValidURL (w/RegEx) = 0.0186ms

Average across 100,000 iterations. So, they are equals, well within the err of benchmarking. The regex version, though, will test all sorts of URLs, such as ftp://, mailto:, etc… It will also handle URLs with ports, users and passwords, for example: https://john:doe@google.com:394/jack … So I believe it to be a bit more robust.

[quote=16226:@Jeremy Cowgar]For the fun of it, profiling the two solutions,

IsValidURL (w/o RegEx) = 0.0198ms
IsValidURL (w/RegEx) = 0.0186ms

Average across 100,000 iterations. So, they are equals, well within the err of benchmarking. The regex version, though, will test all sorts of URLs, such as ftp://, mailto:, etc… It will also handle URLs with ports, users and passwords, for example: https://john:doe@google.com:394/jack … So I believe it to be a bit more robust.[/quote]

You know, if your app does a lot of IsURL checking, the regexp version can be run in 1/2 the time. Simply make the RegEx itself a property of your application, window, class, module. That saves the creation, assignment per iteration. Then change the function to read:

Function IsValidURL(url As String) As Boolean Return (IsValidURL_RegEx.Search(url) <> Nil) End Function

This reduces the time per iteration to 0.0096ms vs. 0.0198ms.

Thanks for the replies Jeremy - really appreciate it but the code doesn’t work.

Your method thinks the following are valid:

www.googlecom (no period before com) http://localhost:8888. (note the period at the end) http://...dkdjkdjkd.com (clearly wrong)

It’s annoying that you can’t just cut and paste one of the StackOverflow search patterns but they seem to big for the RegEx class to handle?

[quote=16224:@Dave S]Here is what I use… without REGEX

FUNCTION isValidURL(URL as string) as boolean
  Dim err_flag As Boolean
  Dim i As Integer
  Dim s As String
  err_flag=False
  If Left(url,7)="HTTP://" Then
    url=Mid(url,8)
  Elseif Left(url,8)="HTTPS://" Then
    url=Mid(url,9)
  End If
  url=ReplaceAll(url,"\",".")
  url=ReplaceAll(url,"/",".")
  err_flag=(url.Len=0)
  If Not err_flag Then
    For i=1 To url.Len
      Select Case Mid(url,i,1)
      Case "A" To "Z","a" To "z","0" To "9","_","-",".","~"
      Case "%" ' hex char
        s="&H"+Mid(url,i+1,2)
        If (Val(s)=0 And s<>"&H00") Then err_flag=True
      Case Else
        err_flag=True
      End Select
      If err_flag=True Then Exit For
    Next i
  End If
  Return Not err_flag
END FUNCTION

It returns TRUE if URL is VALID (not if URL exists)… just it meets the patter requirement[/quote]

Again thanks for this Dave but there are flaws with this method too. It incorrectly thinks the following are valid URLs:

www.googlecom .com kjskjkdjkdjd

Whereas the following are rejected:

localhost:8888 http://localhost:8888

Clearly this is a very tricky problem!

Sometimes finding the right regex is the hard part :-/

I did find, what appears to be the perfect one, but I receive the same error you do, character value in \x{…} sequence is too large… Not sure where that is coming from.

Take a peek at: http://mathiasbynens.be/demo/url-regex, @diegoperini’s seems to be the best (given the test cases), but at first copy/paste, no luck.

_^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!10(?:\\.\\d{1,3}){3})(?!127(?:\\.\\d{1,3}){3})(?!169\\.254(?:\\.\\d{1,3}){2})(?!192\\.168(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\x{00a1}-\\x{ffff}0-9]+-?)*[a-z\\x{00a1}-\\x{ffff}0-9]+)(?:\\.(?:[a-z\\x{00a1}-\\x{ffff}0-9]+-?)*[a-z\\x{00a1}-\\x{ffff}0-9]+)*(?:\\.(?:[a-z\\x{00a1}-\\x{ffff}]{2,})))(?::\\d{2,5})?(?:/[^\\s]*)?$_iuS

Dang, that is a thorough blog post!

Do you think it’s a bug in the RegEx class? Perhaps we should file a Feedback case?

Does anyone have the MBS plugin installed? I think Christian makes a RegEx class. If that handles the search pattern correctly it would confirm a Xojo bug I guess?

I’m not certain, someone who knows the internals of Xojo’s regexp implementation should chime in on Unicode characters in the actual expression. Stripping the unicode ranges from diegoperini’s expression makes it work in Xojo:

"^(?:(?:https?|ftp):\\/\\/)(?:\\S+(?::\\S*)?@)?(?:(?!10(?:\\.\\d{1,3}){3})(?!127(?:\\.\\d{1,3}){3})(?!169\\.254(?:\\.\\d{1,3}){2})(?!192\\.168(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z0-9]+-?)*[a-z0-9]+)(?:\\.(?:[a-z0-9]+-?)*[a-z0-9]+)*(?:\\.(?:[a-z]{2,})))(?::\\d{2,5})?(?:/[^\\s]*)?$"

However, it does not do tld validation, for example, www.googlecom is a valid URL semantically. Same as john.com, i.e. host name and TLD. Now, is com valid vs. googlecom? or com vs. ca, or ca vs biz, etc… What about us.edu vs. .kao.edu, etc…

Oh, you could give the Regex’s a go in another language, like Ruby, Python or Perl. That would be a good test also.

The RegEx class uses a rather old version of PCRE. There’s probably a Feedback case about upgrading PCRE that you can sign on to, but I don’t know it offhand.

feedback://showreport?report_id=23227, 23227 - Please update (RegEx) PCRE library, 7.7 to latest version 8.31

Favourited.
What a pain.

Please see if this is of any help.

If you allow the omission of the ‘http://’ or ‘https://’ in a URL, I can’t see why this one is not valid? I obviously see that there is a mistake, but this one can be perfectly valid. What you’re expecting from your code is more than telling if an URL is valid, you’re asking to determine if this URL looks valid… You won’t get that with a simple REGEX pattern. You may need a bit of A.I. to do this…