A project I’m working on has a need to validate URLs that users enter. Thinking that this would be just a straightforward exercise in regular expressions, I hit Google to find out who’d already done the hard work for me. The true spirit of reuse
A couple of false starts later and I’d found url_validation_improved. This seemed to be just the ticket, it has a regex for checking the URL format and even tests the connection.
To get started with my own validation I just wanted the regular expression part as my own project currently only needs to validate the format of the URL. Here’s the regex from url validation improved:
/^(http|https)://[a-z0-9]+([-.]{1}[a-z0-9]+)*
.[a-z]{2,5}(([0-9]{1,5})?/.*)?$/ix
Looking good. I’m not overly familiar with using regular expressions, so I plugged it into my model with validates_format_of and whipped up a unit test to throw a bunch of URLs at it. Everything was going fine, until I added the URL for a test server the application will be interfacing with to the unit test. As it’s a locally hosted rails app, the base URL is http://127.0.0.1:3000. Suddenly, my tests imploded. It turns out that this regex doesn’t allow IP speficied URLs or port numbers. Back to the drawing board Google I went.
Not finding anything much within Google, I started to wonder whether any of the built-in Ruby classes or libararies could help. It wasn’t long before URI caught my eye, flirting with me and giggling as it showed off its parse method, which takes a uri string and returns an appropriate URI subclass representing the URI. The hussy. Not only does it do that, but it raises a URI::InvalidURIError if the uri given is, well, invalid.
Ripping the disappointing validates_format_of from my model, in went a shiny new validate method. All it has to do is check wether a URI::InvalidURIError has been raised, and also ensure that the returned URI subclass is for a protocol that’s acceptable. Here’s the whole thing:
def validate
begin
uri = URI.parse(url)
if uri.class != URI::HTTP
errors.add(:url, 'Only HTTP protocol addresses can be used')
end
rescue URI::InvalidURIError
errors.add(:url, 'The format of the url is not valid.')
end
end
As there was now a possibility of two different error messages appearing on my model I had to update my unit test. Once that was done, everything passes. The balance of the universe is restored.
There’s just a couple of caveats. Sometimes, URI.parse returns a URI::Generic, which is the parent of the other URI types. I’ve not looked deeply into why this is, but it seems to happen when URI is sure enough that the string you’ve passed really does represent a URI, but can’t actually identify a protocol. Since I know I only want to deal with valid HTTP addresses, I restrict my code to only accept those as valid.
It should also be noted that there are some subtle differences between URIs and URLs (URLs are a subset of URIs) but finding out what that means in practical terms seems to be tricky. I’m making the assumption that if a string passes this validation that I can use it as what I would think of as an URL.
Interestingly, the url_validation_improved code calls URI.parse about 5 lines after it checks URLs against the regex. I wonder why it don’t use that as the test of URL validity…?
Posted by Jonathan
