"when transforming from (byte) strings to Unicode, you are decoding your data"
Oh, so in memory they are not bytes anymore, but "code sequences"? Fair enough to attempt to clarify a point, but please don't make it even more confusing than it actually is.
I guess (what I take away from the article, even though it is not written in it) the actual "transforming" stage only applies to single letters then - "unicode" would be the mapping of a number to a letter, and the encodings (utf-8 and so on) are different ways to represent the number?
Also, is it true that utf-16 can represent all of unicode? Because I was under the impression that it can't?
You're probably thinking of UCS-2, which is very similar to UTF-16, but doesn't have "surrogate pairs" to represent codepoints beyond the 16-bit limit. http://en.wikipedia.org/wiki/UTF-16/UCS-2
I wasn't, but thanks anyway ;-) I think Java uses UTF-16 and all those years as a Java developer I was under the impression that it can only use two bytes per "letter". Thanks for the clarification.
It's significant because it messes up the length (or size) property of strings, doesn't it?
You were probably under that impression because it used to be true. Java started supporting surrogate pairs -- i.e. it made the switch from UCS-2 to UTF-16 -- in J2SE 5.0
How is it easier in UTF-16? Most "normal" Unicode characters fit in two bytes, sure, but you can't just count bytes and divide by two if you want the right answer. It is just as difficult as UTF-8 to implement.
UTF-8 uses less memory if your string happens to contain mostly characters that UTF-8 can represent in a single byte (like english text). I expect UTF-8 will actually use more memory if you're working with e.g. japanese, arabic, etc..
Oh, so in memory they are not bytes anymore, but "code sequences"? Fair enough to attempt to clarify a point, but please don't make it even more confusing than it actually is.
I guess (what I take away from the article, even though it is not written in it) the actual "transforming" stage only applies to single letters then - "unicode" would be the mapping of a number to a letter, and the encodings (utf-8 and so on) are different ways to represent the number?
Also, is it true that utf-16 can represent all of unicode? Because I was under the impression that it can't?