Character encodings on the Web
Every single component of the Web handles strings, and especially Unicode characters encoding, differently.
This document aims to gather all the important information about character encodings on the Web.
Content-Type: text/html; charset=UTF-8
and/or by mentioning it explicitly in the file (<meta charset="utf-8">
for HTML, <?xml version="1.0" encoding="UTF-8"?>
for XML). (source)<script charset="utf-8" src="...">
, <link rel="stylesheet" charset="utf-8" href="...">
).0-9a-zA-Z$_.+!*'(),-
. (source)By default, each browser, on each OS, is able to display some Unicode characters, but not all of them. Custom fonts can be used to fill the gaps. (source)
&#DDDD;
or &#xXXXX;
. The only characters that officially need to be escaped, in theory, are <
, >
and &
, as well as quotes ("
and '
) inside attributes surrounded by the same quotes. In practice, escaping >
is never necessary.\xxxx
. In practise, only special characters (`!"#$%&'()*+,-./:;<=>?@[]^{|}~``) and leading digits / hyphens / underscores actually need to be escaped. (source)unicode-range: U+E000-U+E005
. (source)\uXXXX
, \xXX
or \OOO
. The four-byte characters can be encoded using a surrogate pair \uXXXX\uXXXX
(on ES5-) (source) and can use a new format (on ES6+): \u{XXXXX}
. (source)"\\xxxx "
(source)%XX
)=?UTF-8?Q?=XX=XX=?=
or base64 =?UTF-8?B?.......?=
(source)