"when transforming from (byte) strings to Unicode, you are decoding your data" O...

sp332 · on June 22, 2009

You're probably thinking of UCS-2, which is very similar to UTF-16, but doesn't have "surrogate pairs" to represent codepoints beyond the 16-bit limit. http://en.wikipedia.org/wiki/UTF-16/UCS-2

Tichy · on June 22, 2009

I wasn't, but thanks anyway ;-) I think Java uses UTF-16 and all those years as a Java developer I was under the impression that it can only use two bytes per "letter". Thanks for the clarification.

It's significant because it messes up the length (or size) property of strings, doesn't it?

dionidium · on June 23, 2009

You were probably under that impression because it used to be true. Java started supporting surrogate pairs -- i.e. it made the switch from UCS-2 to UTF-16 -- in J2SE 5.0

http://en.wikipedia.org/wiki/UTF-16#Use_in_major_operating_s...

sp332 · on June 22, 2009

Yeah, big-time. Guaranteed fixed-size encodings either take 4 bytes per codepoint, or give up some codepoints as unencodeable.

gizmo · on June 22, 2009

It's true. UTF-16, like UTF-8 is a variable length encoding. UTF-8 uses less memory, and with UTF-16 it's much easier to determine the string length.

jrockway · on June 22, 2009

How is it easier in UTF-16? Most "normal" Unicode characters fit in two bytes, sure, but you can't just count bytes and divide by two if you want the right answer. It is just as difficult as UTF-8 to implement.

warp · on June 22, 2009

UTF-8 uses less memory if your string happens to contain mostly characters that UTF-8 can represent in a single byte (like english text). I expect UTF-8 will actually use more memory if you're working with e.g. japanese, arabic, etc..