Cycymomo
4/30/2014 - 10:22 PM

Character encodings on the Web

Character encodings on the Web

Introduction

Every single component of the Web handles strings, and especially Unicode characters encoding, differently.

This document aims to gather all the important information about character encodings on the Web.

File charsets

  • HTML, CSS and JS files can theoretically be encoded with any charset registered at the IANA. (source).
  • UTF-8 and ISO-8859-1 are generally used for their large browser support.
  • It is recommended to use UTF-8 without BOM.

File charsets, as seen by the browsers

  • Browsers use complex rules to guess the encoding of Web pages (HTML, XML, ...), unless a charset is defined manually. It is recommended to do so via a HTTP header: Content-Type: text/html; charset=UTF-8 and/or by mentioning it explicitly in the file (<meta charset="utf-8"> for HTML, <?xml version="1.0" encoding="UTF-8"?> for XML). (source)
  • Browsers parse JavaScript and CSS files using the same encoding as the document including them, unless a different charset is defined manually. It is recommended to do so via a HTTP header and/or a charset attribute on link / script tags (<script charset="utf-8" src="...">, <link rel="stylesheet" charset="utf-8" href="...">).

Internal charsets

  • JavaScript uses a mix of two encodings (UTF-16 in the JS engine, UCS-2 in the language itself). (source)
  • URLs can only use a subset of the ASCII characters: 0-9a-zA-Z$_.+!*'(),-. (source)
  • HTTP headers can only contain ASCII characters. (source)

Unicode support

  • HTML and XML support amlost all Unicode characters (source), except some that are illegal: C0 and C1 control characters, DEL character, UTF-16 surrogate halves and BOM-related characters. (source)
  • CSS and JS files can use all Unicode characters.

Displaying unicode

By default, each browser, on each OS, is able to display some Unicode characters, but not all of them. Custom fonts can be used to fill the gaps. (source)

Escaping

  • In HTML and XML files, all the legal Unicode characters that are not part of the markup may be escaped. Some of them can be written as HTML entities (source) and all of them can use the forms &#DDDD; or &#xXXXX;. The only characters that officially need to be escaped, in theory, are <, > and &, as well as quotes (" and ') inside attributes surrounded by the same quotes. In practice, escaping > is never necessary.
  • In CSS selectors, all the characters may be escaped using the form \xxxx. In practise, only special characters (`!"#$%&'()*+,-./:;<=>?@[]^{|}~``) and leading digits / hyphens / underscores actually need to be escaped. (source)
  • In CSS @font-face rules, unicode ranges are marked like this: unicode-range: U+E000-U+E005. (source)
  • In JS strings, all the characters that are represented on two bytes in UTF-16 may be escaped using the forms \uXXXX, \xXX or \OOO. The four-byte characters can be encoded using a surrogate pair \uXXXX\uXXXX (on ES5-) (source) and can use a new format (on ES6+): \u{XXXXX}. (source)
  • The only escaping required in JavaScript is in querySelector() and querySelectorAll(), like in CSS selectors, except that the backslashes need to be escaped too: "\\xxxx " (source)
  • In URLs, all the illegal characters are converted in bytes sequences, and each byte of these sequences are escaped with a percent (%XX)
  • In HTTP headers, other encodings can be used as byte-sequences, with quoted-printable escaping: =?UTF-8?Q?=XX=XX=?= or base64 =?UTF-8?B?.......?= (source)

Escaping with JavaScript

  • For HTML, we can use a shadow Option element: link
  • For URLs/URIs, we can use the functions escape() / encodeURI() / encodeURIComponent(), and reuse the result for other byte-sequence escapings. link
  • For JavaScript (converting surrogate pairs charcodes into Unicode codepoints and vice-versa), here is the formula: link
  • Converting UTF-8 code points in ISO-8859-1 code points and vice-versa, can be done with the following table: link

Existing escaping tools in JS