Kirsle.net logo Kirsle.net

Query String Delimiter?

April 22, 2009 by Noah
Some length of time ago, I got the idea to start using the semicolon (;) instead of the ampersand (&) as a delimiting character in query strings. For instance:
<a href="/index.cgi?p=blog;tag=General">
as opposed to
<a href="/index.cgi?p=blog&tag=General">
The reasoning behind it was something along these lines:
  • The W3C HTML Validator refused to validate my pages as HTML 4.01 Strict, citing that all those "&" characters should be fully typed out as "&amp;", because HTML 4.01 no longer allows a single & without any kind of escape sequence following it.
  • YaBB Forum, my favorite forum software for being Perl and not dependent on SQL, was using semicolons instead of ampersands.
  • The Perl CGI module supports both forms of delimiters, so what's it matter?
And then, I was poking at some of my incoming referring URLs, many of which were Google search results that matched some of my blog entries. And this is where the semicolon-as-delimiter idea falls apart: apparently, in Firefox at least, those semicolons have a tendency to be escaped with the URI sequence %3B.

So, http://www.cuvou.com/?p=blog;id=36 looks right in the Google search results, but after it gets chewed up with Google's outgoing statistic gathering and finally accessed by the browser, the latter part of that request comes to my site looking more like this: /?p=blog%3Bid=36. CGI.pm has no idea what to make of this and it can't be blamed. I've tried substituting it in $ENV{QUERY_STRING} before CGI.pm can get its hands on it, but it doesn't help.

So effectively the user is greeted with a "Forbidden" page of mine, which was fired because the value of "p=" contains some invalid character (notably, that % symbol there).

So there's a conundrum here: semicolons as delimiters works as far as CGI is concerned, and it perfectly validates as HTML 4.01 Strict, and you don't need to write "&amp;" all the time inside your internal site links. I mean seriously, how ugly is this HTML code?

<a href="/index.cgi?p=blog&amp;id=36">
It validates, it works as expected provided you're using it "properly", however it breaks your links in Google and possibly other search engines, at least in Firefox.

My temporary hack of a solution:

For my CMS, none of my links are "properly" written to begin with. They're like <a href="$link:blog;id=36"> which is translated on-the-fly, so it was fairly trivial to change the code to fix these things on the way out the door.

For the W3C's HTML validator, my links are translated to include the full and proper &amp; text. It's ugly and I'm only glad I don't have to write the links like that directly; my Perl code does it for me.

The other half of the dirty hack is to detect when a troublesome URL has been linked to: particularly if %3B is found. If so, the CGI fixes the query string and sends an HTTP 301 redirect to the proper version of the URL, using the real semicolons (I could replace them with &'s here, but, why? The CGI module takes care of it anyway ;-) ).

I'll have to investigate what other web developers do with their query string delimiters...

Tags:

Comments

There is 1 comment on this page. Add yours.

Avatar image
Sean Brunnock posted on May 2, 2009 @ 14:53 UTC

Using a semicolon as a delimiter has been recommended by the W3C since 1994 (see http://www.ietf.org/rfc/rfc1738.txt ). It's insane for Google to escape it.

There's an open trouble ticket at Google at http://www.google.com/support/forum/p/Webmasters/thread?tid=661e8286964e2195&hl=en .

Add a Comment

Used for your Gravatar and optional thread subscription. Privacy policy.
You may format your message using GitHub Flavored Markdown syntax.