Oct 7, 2013

f Comment

Should I URL Encode or Convert to HTML Entities? Find Out in SECONDS!

Amazon When you are developing a website sometimes you may be confused whether to URL encode a text or to convert it to HTML entities. This article will help you make the right decision each and every time!

If you URL encode an ASCII character you convert it to % followed by its corresponding HEX code. For example the URL encoded result of the character # is %23.

If you URL encode a UTF-8 character you convert it to a string of % followed by one byte of its HEX representation until you are done. For example the UTF-8 character 聖 would be URL encoded to %E8%81%96.

Questions?

By the way if you use PHP you may be confused whether to use urlencode() or htmlentity() at a given situation. By answering the question of when to URL encode and when to use HTML entities this question will be answered.

1. If it's meant to be a URL, URL encode it.

Basically if the text you are using is meant to be a URL you should always URL encode it. What is the exact list of characters you should URL encode? The answer is easy. Think about what characters in the URL have special meaning and what characters, if left URL un-encoded, would lead to confusion when placed in a href of an anchor tag.

Read on to see a comprehensive list of ASCII characters that should be URL encoded.

1A. Here's a list of ASCII characters that should be URL encoded.

Here a complete list of ASCII characters that should be URL encoded and why. The 1st column is the ASCII character. The 2nd column is the URL encoded result. The 3rd column is the reason.
space   %20   can easily be confused with a delimiter
#       %23   used as anchor to navigate to a bookmark in the webpage
%       %25   reserved as the first character of the URL encoded result
&       %26   used to chain multiple URL parameters
+       %2B   reserved to indicate a space in a URL
"       %22   may be used to enclose a value in an HTML tag
'       %27   may be used to enclose a value in an HTML tag
<       %3C   may confuse browser as to whether you are starting an HTML tag
>       %3E   may confuse browser as to whether you are ending an HTML tag
?       %3F   used to indicate the start of URL query string
Here's an example of a URL that follows this rule.


Questions?

1B. Non-ASCII characters should always be URL encoded.

If you handle UTF-8 characters very often like me, you would know sometimes it's a pain to deal with UTF-8 characters. The problem with UTF-8 characters is that not every browser in the world can handle them correctly. For example although my latest Chrome browser handles URLs with UTF-8 characters perfectly, my friend's Android smartphone fails to do so.

The real problem is not every browser would interpret a UTF-8 webpage correctly. So your best bet is always URL encode non-ASCII characters when they are meant to be part of a URL on a webpage.

For example the href attribute of <a> tag should have a URL, and therefore it should be URL encoded. Instead of

<a href='http://www.chtoen.com/低頭族的英文怎麼說'>低頭族的英文怎麼說</a>

You should have

<a href='http://www.chtoen.com/%E4%BD%8E%E9%A0%AD%E6%97%8F%E7%9A%84%E8%8B%B1%E6%96%87%E6%80%8E%E9%BA%BC%E8%AA%AA'>低頭族的英文怎麼說</a>

Questions?

1C. Caveat

Don't underestimate the power of this rule. The key is meant to be a URL. For example the content of some meta tag who is meant to be used as a URL should be URL encoded too. You should have

<meta property="og:image" content="http://www.chtoen.com/image/%E8%A3%9D%E7%BD%AE%E7%9A%84%E6%8F%92%E5%AD%94.jpg"/>

Instead of

<meta property="og:image" content="http://www.chtoen.com/image/裝置的插孔.jpg"/>

Questions?

Note that if the non-ASCII character is meant to be displayed to the visitors of your website, you should not URL encode it. More about it in the next section.

Use HTML entities when it is meant as readable text.

If the text you are using is meant to be read by the visitors of your website you should use HTML entities. Some examples include meta description, title tag, h1 tag. So instead of

<h1>Food & Drink</h1>

You should have

<h1>Food &amp Drink</h1>

If your webpage has UTF-8 characters you should leave them as is. You should have proper meta tag to inform the browser that this webpage is UTF-8 encoded.

Having all this knowledge down, it means in an img tag, src attribute should be URL encoded, and alt attribute should be HTML entities.

Now it's not so confusing anymore is it?

Questions?

A big example..

Let's see a bigger example below to make sure you've got everything.

...
<meta name="Description" content="想知道 &quot;插孔&quot; 的英文怎麼說嗎?" />
<meta property="og:image" content="http://www.chtoen.com/image/%E8%A3%9D%E7%BD%AE%E7%9A%84%E6%8F%92%E5%AD%94.jpg"/>
<title>&quot;插孔&quot; 的英文怎麼說 - 中英物語</title>
插孔的英文怎麼說? 插孔的英文是 jack, plug, connector.
...

If you have any questions let me know and I will do my best to help you!
Please leave a comment here!
One Minute Information - by Michael Wen
ADVERTISING WITH US - Direct your advertising requests to Michael