Characters 101
Tl;dr A string is a linear sequence of characters.
A character is a symbol representing a digit or letter say 7 or @. Computers in essence see these as sequences of 0s and 1s and thus we need a way to translate them from what we humans know and interpret them as to what computers can understand and this is called character encoding
.
In computing, you've definitely come across the most popular of these encodings -- ASCII (American Standard Code for Information Interchange). As a standard, it gives us a way to encode (represent) upper and lower case English letters, numbers, and punctuation symbols mapped to 7-bit numbers and thus it can only encode a maximum of 128 characters.
On a side note, the folks at ANSI who came up with this encoding did a clever thing. They set the MSB to 1
and started counting from 1
to represent upper case A
and these follow in sequence till Z
. The same was done for lower case letters but with a 11
prefix. This makes it easy to know the position of the character in the alphabet given its LSB bits. See the illustration below:
To make things interesting, we've evolved to create more and more encodings to suite our communication needs separately and this bred incompatibilities when communicating across language or region boundaries. Therefore a standard was adopted - Unicode
maintained by the Unicode Consortium. This encoding supports a large character set (ASCII got the privilege of being a subset of this set encoding the first 128 characters). As of this writing, the specification can represent 143,859 characters!
Unicode is often defined as UTF-8, UTF-16 or UTF-32 where UTF stands for Unicode Transformation Format
and the number for the number of digits used to represent each character. Being a large character set, Unicode therefore facilitates encoding of the ever evolving characters like emojis ๐. Specification 13.0.0 added 55 new emoji characters!
The Kotlin Char
As at version 1.3.72, the Kotlin language guarantees the following character sets to be available on every implementation of the JVM platform.
- UTF_8: (Eight-bit UCS Transformation Format.)
- UTF_16: (Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark.)
- UTF_16BE: (Sixteen-bit UCS Transformation Format, big-endian byte order.)
- UTF_16LE: (Sixteen-bit UCS Transformation Format, little-endian byte order.)
- US_ASCII: (Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set.)
- ISO_8859_1: (ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1.)
- UTF_32: (32-bit Unicode (or UCS) Transformation Format, byte order identified by an optional byte-order mark)
- UTF_32LE: (32-bit Unicode (or UCS) Transformation Format, little-endian byte order.)
- UTF_32BE: (32-bit Unicode (or UCS) Transformation Format, big-endian byte order.)
Kotlin characters are represented by the Char
type. Literals are surrounded by single quotes e.g. 'F' or '\u1F069'. With that primer on characters, let's look at Strings.
The Kotlin String
This is represented by the type String
and is a classic example of an immutable type i.e. when you form a sequence of characters and need to alter their order, insert, remove or any kind of mutation, a new instance is created to reflect this.
The String
type in Kotlin conforms to the CharSequence
interface. This interface gives us 3 useful members.
public interface CharSequence {
public val length: Int
public operator fun get(index: Int): Char
public fun subSequence(startIndex: Int, endIndex: Int): CharSequence
}
These are self documenting. Highlighting the second one, get(index:)
is defined as an operator and allows us to index Char
s in a String via a subscript with a 0-based index:
val name = "SenseiDev"
println(name[6]) // D
The Kotlin String definition on the JVM is as follows:
public class String : Comparable<String>, CharSequence {
companion object {}
public operator fun plus(other: Any?): String
public override val length: Int
public override fun get(index: Int): Char
public override fun subSequence(startIndex: Int, endIndex: Int): CharSequence
public override fun compareTo(other: String): Int
}
Remember all classes are final in Kotlin, hence enforcing the immutability highlighted earlier. As seen above, there are some additions to what the CharSequence
interface defines that are particular to the String
type.
plus
operator. This allows us to concatenate strings with the+
operator e.g."SenseiDev" + 5
. (NB: the definition of the+
operator is defined on theString
type and hence it is not commutative i.e. it's a compile error to define5 + "SenseiDev"
-- unless you have such an extension function on Int)!- A
String
can be compared to anotherString
as allowed by theComparable
interface and thecompareTo
method. - An empty companion object
The Kotlin compiler has other neat tricks e.g it allows you to:
- Create
String
instances from literals e.g.val language = "Kotlin"
- Express literal raw
String
using triple quotes: """ - Add template expressions to
String
literals e.g.val displayAmount = "Ksh. ${quantity * basePrice}"
The JVM Bridge
You might be wondering, "Wait a minute. That's weird. Those are very few members on the String class while we have much more functionality from the class in our JVM programs e.g. toUpperCase(), replace(), etc." Well, you're right my friend. Kotlin has a whole hundreds-of-lines file called StringsJVM.kt
in the package kotlin.text
that defines all your favorite methods as extension functions on the CharSequence
interface, the String
class or its companion object
and this is the file where most (if not all) of the JVM bridging is defined.
One fascinating highlight in the StringsJVM.kt
file are the constructor-seeming function calls that are single line expressions to construct a String
e.g. String(chars: CharArray)
. To remove function call overhead, these functions are declared as inline
. This depicts how Kotlin language features (top-level functions and inline functions) work together while improving developer productivity.
Conclusion
We use the String
class so often when writing our programs as it's useful in expressing a lot from our real world. String manipulation and the available operations are clearly important to understand as a developer. To dig deeper, I came across an article on Baeldung that discusses String interning in the JVM and some of the changes in regards to treating the String class. Check it out.
This article covered some fundamentals of the String class and the APIs exposed by the Kotlin Language in regards to the representing Strings. Happy coding!