A project I’m working on has a need to validate URLs that users enter. Thinking that this would be just a straightforward exercise in regular expressions, I hit Google to find out who’d already done the hard work for me. The true spirit of reuse
A couple of false starts later and I’d found url_validation_improved. This seemed to be just the ticket, it has a regex for checking the URL format and even tests the connection.
To get started with my own validation I just wanted the regular expression part as my own project currently only needs to validate the format of the URL. Here’s the regex from url validation improved:
/^(http|https)://[a-z0-9]+([-.]{1}[a-z0-9]+)*
.[a-z]{2,5}(([0-9]{1,5})?/.*)?$/ix
Looking good. I’m not overly familiar with using regular expressions, so I plugged it into my model with validates_format_of and whipped up a unit test to throw a bunch of URLs at it. Everything was going fine, until I added the URL for a test server the application will be interfacing with to the unit test. As it’s a locally hosted rails app, the base URL is http://127.0.0.1:3000. Suddenly, my tests imploded. It turns out that this regex doesn’t allow IP speficied URLs or port numbers. Back to the drawing board Google I went.
Not finding anything much within Google, I started to wonder whether any of the built-in Ruby classes or libararies could help. It wasn’t long before URI caught my eye, flirting with me and giggling as it showed off its parse method, which takes a uri string and returns an appropriate URI subclass representing the URI. The hussy. Not only does it do that, but it raises a URI::InvalidURIError if the uri given is, well, invalid.
Ripping the disappointing validates_format_of from my model, in went a shiny new validate method. All it has to do is check wether a URI::InvalidURIError has been raised, and also ensure that the returned URI subclass is for a protocol that’s acceptable. Here’s the whole thing:
def validate
begin
uri = URI.parse(url)
if uri.class != URI::HTTP
errors.add(:url, 'Only HTTP protocol addresses can be used')
end
rescue URI::InvalidURIError
errors.add(:url, 'The format of the url is not valid.')
end
end
As there was now a possibility of two different error messages appearing on my model I had to update my unit test. Once that was done, everything passes. The balance of the universe is restored.
There’s just a couple of caveats. Sometimes, URI.parse returns a URI::Generic, which is the parent of the other URI types. I’ve not looked deeply into why this is, but it seems to happen when URI is sure enough that the string you’ve passed really does represent a URI, but can’t actually identify a protocol. Since I know I only want to deal with valid HTTP addresses, I restrict my code to only accept those as valid.
It should also be noted that there are some subtle differences between URIs and URLs (URLs are a subset of URIs) but finding out what that means in practical terms seems to be tricky. I’m making the assumption that if a string passes this validation that I can use it as what I would think of as an URL.
Interestingly, the url_validation_improved code calls URI.parse about 5 lines after it checks URLs against the regex. I wonder why it don’t use that as the test of URL validity…?
October 27, 2006 at 9:19 pm
very helpful, thanks!
December 20, 2006 at 6:04 am
yeh it’s really helpfull thanx..
January 18, 2007 at 10:02 am
I’ve just used this and it’s mostly great. However I’ve found that it does parse urls without a domain at the end (I did my domain - http://www.draigwen” and it passes).
March 2, 2007 at 9:06 pm
I think some of the backslashes went missing when you tried to put it online
In my humble opinion, the “.” character at the beginning of the second line needs to be escaped with a backslash. It represents the dot in “.com” (no pun intended).
the slashes also need to be escaped.
Dunno if this will show properly on the webpage but here’s the escaped string:
/^(https?:\/\/)?[a-z0-9]+([-.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix
March 22, 2007 at 12:19 am
http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/lib/uri/common.rb?view=markup
July 20, 2007 at 4:11 am
I think any expression like this is potentially flawed. A case of too much software.
Basically because it’s impossible to anticipate all the possible exceptions, such as international domain names, addresses entered without http:// for example http://www.bbc.co.uk they may not be correct urls but from a usability standpoint the code should be able to accommodate lazy users.
Also strange subdomains like wwwwwwww.domain.com or .info domains or links to files that non tech savy users have uploaded ie http://www.domain.com/My%20Document.pdf
July 24, 2007 at 7:37 pm
Thats good
August 21, 2007 at 7:53 pm
90 Blog Themes
download 90 themes for your blog
November 14, 2007 at 5:31 am
Hi …Thanks,
I have decided to go with URI
shyl -
January 16, 2008 at 11:55 pm
I’m not sure what this catches… it allows: http://blah
January 23, 2008 at 5:03 pm
Ah Google. It led me here, but I have found that URI is very, very picky about URLs.
For example, this one from target.com cannot be parsed:
http://www.target.com/gp/detail.html/602-4045909-4263801?ASIN=B000NPCK3W&AFID=Froogle&LNM=B000NPCK3W|Lexmark_AllInOne_Printer_with_Scanner_and_Copier__X1240&ci_src=14110944&ci_sku=B000NPCK3W&ref=tgt_adv_XSG10001
I think it is the vertical bar in the URL, but we have found numerous other characters (e.g. carat) that URI wont accept.
Seeking alternatives…
January 23, 2008 at 5:38 pm
Didn’t find any alternatives: the problem in the URL above is the vertical bar. If I escape it with %7D URI is happy as a clam.
So reading the URI specification, it does appear that this character falls into a group that while specifically excluded are in a class that should not be used in URLs (or escaped if they are). So I would be fine with Ruby’s picky URI class, except that in Rails, URLs are sometimes generated with ids in [123] square brackets, which are also not allowed by the spec, but which URI seems fine with.
So anyway, if anyone runs into this, just gsub replace out any characters like carat, backtick, tilde and possibly others with their CGI encoded variants.
March 7, 2008 at 2:31 am
This is the best regex i’ve found so far
/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/ix
It handles both http, https, ip addresses, domain names, port numbers, domain names up to 5 characters, and even domains like the one Tom Harrison said could not be matched, were matched correctly by the above regex.
If you want to try it yourself
go into the console mode:
ruby ./script/console
url = “http://www.target.com/gp/detail.html/602-4045909-4263801?ASIN=B000NPCK3W&AFID=Froogle&LNM=B000NPCK3W|Lexmark_AllInOne_Printer_with_Scanner_and_Copier__X1240&ci_src=14110944&ci_sku=B000NPCK3W&ref=tgt_adv_XSG10001″
reg = /^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/ix
reg.match(url) ? true : false
you’ll see it’ll return true
March 7, 2008 at 2:32 am
When I said domain names up to 5 characters I meant domain extensions, like .info, .com, .org, .tv etc
March 11, 2008 at 8:58 pm
in a description text i have to find and replace urls…
hi how can i search for reg in a string? How should the pattern look like?
thank you
March 27, 2008 at 11:25 pm
Check this model, towards the bottom there is a pretty good regex. It only needs to recognise urls with http|https, but that is very easy to do.
http://sample.caboo.se/weed2/app/models/domain.rb
March 27, 2008 at 11:26 pm
in case its hard to see…
PORT = /(([:]\d+)?)/
DOMAIN = /([a-z0-9\-]+\.?)*([a-z0-9]{2,})\.[a-z]{2,}/
NUMERIC_IP = /(?>(?:1?\d?\d|2[0-4]\d|25[0-5])\.){3}(?:1?\d?\d|2[0-4]\d|25[0-5])(?:\/(?:[12]?\d|3[012])|-(?>(?:1?\d?\d|2[0-4]\d|25[0-5])\.){3}(?:1?\d?\d|2[0-4]\d|25[0-5]))?/
validates_format_of :name, :with => /^((localhost)|#{DOMAIN}|#{NUMERIC_IP})#{PORT}$/
April 10, 2008 at 4:13 am
Very informative.
June 23, 2008 at 5:38 am
f4hvYk dfv078fnw8f934ndvkg2l
July 16, 2008 at 1:32 am
Excellent. Thanks for the info!