|
A character encodingmaps
each character in a character set to a numeric value that a computer
can represent. These numbers can be represented by a single byte
or multiple bytes. For example, the ASCII encoding uses 7 bits to
represent the Latin alphabet, punctuation, and control characters.
You use Japanese encodings, such as Shift-JIS, EUC-JP, and ISO-2022-JP,
to represent Japanese text. These encodings can vary slightly, but
they include a common set of approximately 10,000 characters used
in Japanese.
The following terms apply to character encodings:
- SBCS
- Single-byte
character set; a character set encoded in one byte per character,
such as ASCII or ISO 8859-1.
- DBCS
- Double-byte
character set; a method of encoding a character set in no more than
2 bytes, such as Shift-JIS. Many character encoding schemes that
are referred to as double-byte, including Shift-JIS, allow mixing
of single-byte and double-byte encoded characters. Others, such
as UCS-2, use 2 bytes for all characters.
- MBCS
- Multiple-byte
character set; a character set encoded with a variable number of
bytes per character, such as UTF-8.
The following table lists
some common character encodings; however, there are many additional
character encodings that browsers and web servers support:
Encoding
|
Type
|
Description
|
ASCII
|
SBCS
|
7-bit encoding used by English and Indonesian
Bahasa languages
|
Latin-1
(ISO
8859-1)
|
SBCS
|
8-bit encoding used for many Western European
languages
|
Shift_JIS
|
DBCS
|
16-bit Japanese encoding
Note: Use
an underscore character (_), not a hyphen (-) in the name in CFML
attributes.
|
EUC-KR
|
DBCS
|
16-bit Korean encoding
|
UCS-2
|
DBCS
|
Two-byte Unicode encoding
|
UTF-8
|
MBCS
|
Multibyte Unicode encoding. ASCII is 7-bit;
non-ASCII characters used in European and many Middle Eastern languages
are two-byte; and most Asian characters are three-byte
|
The World Wide Web Consortium maintains
a list of all character encodings supported by the Internet. You
can find this information at www.w3.org/International/O-charset.html.
Computers
must often convert between character encodings. In particular, the character
encodings most commonly used on the Internet are not used by Java or
Windows. Character sets used on the Internet are typically single-byte
or multiple-byte (including DBCS character sets that allow single-byte
characters). These character sets are most efficient for transmitting
data, because each character takes up the minimum necessary number
of bytes. Currently, Latin characters are most frequently used on
the web, and most character encodings used on the web represent
those characters in a single byte.
Computers, however, process
data most efficiently if each character occupies the same number
of bytes. Therefore, Windows and Java both use double-byte encoding
for internal processing.
The Java Unicode character encodingColdFusion
uses the Java Unicode Standard for representing character data internally.
This standard corresponds to UCS-2 encoding of the Unicode character
set. The Unicode character set can represent many languages, including
all major European and Asian character sets. Therefore, ColdFusion
can receive, store, process, and present text from all languages
supported by Unicode.
The Java Virtual Machine (JVM) that is used to processes ColdFusion
pages converts between the character encoding used on a ColdFusion
page or other source of information to UCS-2. The page or data encodings
that ColdFusion supports depend on the specific JVM, but include
most encodings used on the web. Similarly, the JVM converts between
its internal UCS-2 representation and the character encoding used
to send the response to the client.
By default, ColdFusion uses UTF-8 to represent text data sent
to a browser. UTF-8 represents the Unicode character set using a
variable-length encoding. ASCII characters are sent using a single
byte. Most European and Middle Eastern characters are sent as 2
bytes, and Japanese, Korean, and Chinese characters are sent as
3 bytes. One advantage of UTF-8 is that it sends ASCII character
set data in a form that is recognized by systems designed to process
only single-byte ASCII characters, while it is flexible enough to
handle multiple-byte character representations.
While the default format of text data returned by ColdFusion
is UTF-8, you can have ColdFusion return a page to any character
set supported by Java. For example, you can return text using the
Japanese language Shift-JIS character set. Similarly, ColdFusion
can handle data that is in many different character sets. For more
information, see Determining the page encoding of server output.
Character encoding conversion issuesBecause different character
encodings support different character sets, you can encounter errors
if your application gets text in one encoding and presents it in another
encoding. For example, the Windows Latin-1 character encoding, Windows-1252,
includes characters with hexadecimal representations in the range
80-9F, while ISO 8859-1 does not include characters in that range.
As a result, under the following circumstances, characters in the
range 80-9F, such as the euro symbol (Ä), are not displayed properly:
A file encoded in Windows-1252 includes characters in
the range 80-9F.
ColdFusion reads the file, specifying the Windows-1252 encoding
in the cffile tag.
ColdFusion displays the file contents, specifying ISO-8859
in the cfcontent tag.
Similar issues can arise if you convert between other character
encodings; for example, if you read files encoded in the Japanese
Windows default encoding and display them using Shift-JIS. To prevent
these problems, ensure that the display encoding is the same as
the input encoding.
|
|
|