A Charset is a class that is in charge of serializing and deserializing characters. It mimics the Java java.nio.charset.Charset class, and it’s equivalent in .NET would be the Encoding class.
Some example of charsets could be UTF-8
, ASCII
, ISO-8859-1
(LATIN1
), or SHIFT_JIS
, among others.
kotlinx-io
has an object called Charsets
with some common used charsets (right now only UTF-8
):
val charset = Charsets.UTF_8
It is also possible to construct a charset from its name:
val charset = Charset.forName("utf-8")
The easiest way for encoding/decoding Strings using charsets is to use the extension methods available for String
and ByteArray
.
For example, if we have a ByteArray
representing a UTF-8 encoded string, you can:
val utf8Encoded: ByteArray = byteArrayOf(0xE4.toByte(), 0xBB.toByte(), 0x8A.toByte(), 0xE6.toByte(), 0x97.toByte(), 0xA5.toByte(), 0xE3.toByte(), 0x81.toByte(), 0xAF.toByte(), 0xEF.toByte(), 0xBC.toByte(), 0x81.toByte())
val string = utf8Encoded.toString(Charsets.UTF_8)
You can also use an extension constructor for it:
val string = String(utf8Encoded, charset = Charsets.UTF_8)
val substring = String(utf8Encoded, 0, 10, charset = Charsets.UTF_8)
And from a String
, you can generate a ByteArray representing it in the specific charset:
val string = "今日は!"
val utf8Encoded = string.toByteArray(Charsets.UTF_8)
In order to be able to reduce copying, reuse of object and memory and reduce garbage collection, the Charset API
offers a way to create encoders and decoders offering APIs to encode from CharSequence
to Output
, and vice versa.
For encoding a slice of a String
without slicing it into an Output
, you can:
val output: Output
val charset = Charsets.UTF_8
val charsetEncoder = charset.newEncoder()
charsetEncoder.encode("HELLO", 1, 2, output)
Some charsets represent characters as single-byte, other represents characters with several fixed bytes per characters, and others are variable. Not all the charsets can represent the whole Unicode character set.
The most widely used charset is UTF-8, which allows to represent most 7-bit ASCII characters as single byte,
but allows to represent the whole Unicode character set. By being encoded by a variable length of bytes,
getting the length of a String
represented as UTF-8 requires decoding the whole content,
thus requiring linear time for seeking, slicing and getting length.
JavaScript exposes Strings with 32-bit CodePoints, while Java and Native exposes 16-bit Chars using Surrogate Pairs for higher values.
For now, kotlinx-io only implements UTF_8
, but allows to add custom charsets by extending the Charset
class.
object Charsets {
val UTF_8: Charset
}
class Charset {
companion object {
fun forName(name: String): Charset
}
val name: String
fun newEncoder(): CharsetEncoder
fun newDecoder(): CharsetDecoder
}
class CharsetEncoder {
val charset: Charset
fun encode(input: CharSequence, fromIndex: Int, toIndex: Int, dst: Output)
fun encodeToByteArray(input: CharSequence, fromIndex: Int = 0, toIndex: Int = input.length): ByteArray
fun encodeUTF8(input: ByteReadPacket, dst: Output) // Return type. Why UTF8 ??
fun encode(input: CharSequence, fromIndex: Int = 0, toIndex: Int = input.length) // Return type. Why UTF8 ??
fun encodeUTF8(input: ByteReadPacket) // Return type. Why UTF8 ??
}
class CharsetDecoder {
val charset: Charset
fun decode(input: Input, max: Int = Int.MAX_VALUE): String
fun decode(input: Input, dst: Appendable, max: Int): Int
fun decodeExactBytes(input: Input, inputLength: Int): String
}
class MalformedInputException(message: String) : Throwable
// Missing:
// - String.toByteArray(charset): ByteArray
// - ByteArray.toString(charset): String
// Missing registering new charsets available using Charset.forName?