A day with Kotlin Strings

Characters 101

Tl;dr A string is a linear sequence of characters.

A character is a symbol representing a digit or letter say 7 or @. Computers in essence see these as sequences of 0s and 1s and thus we need a way to translate them from what we humans know and interpret them as to what computers can understand and this is called character encoding.

In computing, you've definitely come across the most popular of these encodings -- ASCII (American Standard Code for Information Interchange). As a standard, it gives us a way to encode (represent) upper and lower case English letters, numbers, and punctuation symbols mapped to 7-bit numbers and thus it can only encode a maximum of 128 characters.

strings.001.jpeg

On a side note, the folks at ANSI who came up with this encoding did a clever thing. They set the MSB to 1 and started counting from 1 to represent upper case A and these follow in sequence till Z. The same was done for lower case letters but with a 11 prefix. This makes it easy to know the position of the character in the alphabet given its LSB bits. See the illustration below:

strings.002.jpeg

To make things interesting, we've evolved to create more and more encodings to suite our communication needs separately and this bred incompatibilities when communicating across language or region boundaries. Therefore a standard was adopted - Unicode maintained by the Unicode Consortium. This encoding supports a large character set (ASCII got the privilege of being a subset of this set encoding the first 128 characters). As of this writing, the specification can represent 143,859 characters!

Unicode is often defined as UTF-8, UTF-16 or UTF-32 where UTF stands for Unicode Transformation Format and the number for the number of digits used to represent each character. Being a large character set, Unicode therefore facilitates encoding of the ever evolving characters like emojis 😎. Specification 13.0.0 added 55 new emoji characters!

The Kotlin Char

As at version 1.3.72, the Kotlin language guarantees the following character sets to be available on every implementation of the JVM platform.

  • UTF_8: (Eight-bit UCS Transformation Format.)
  • UTF_16: (Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark.)
  • UTF_16BE: (Sixteen-bit UCS Transformation Format, big-endian byte order.)
  • UTF_16LE: (Sixteen-bit UCS Transformation Format, little-endian byte order.)
  • US_ASCII: (Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set.)
  • ISO_8859_1: (ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1.)
  • UTF_32: (32-bit Unicode (or UCS) Transformation Format, byte order identified by an optional byte-order mark)
  • UTF_32LE: (32-bit Unicode (or UCS) Transformation Format, little-endian byte order.)
  • UTF_32BE: (32-bit Unicode (or UCS) Transformation Format, big-endian byte order.)

Kotlin characters are represented by the Char type. Literals are surrounded by single quotes e.g. 'F' or '\u1F069'. With that primer on characters, let's look at Strings.

The Kotlin String

This is represented by the type String and is a classic example of an immutable type i.e. when you form a sequence of characters and need to alter their order, insert, remove or any kind of mutation, a new instance is created to reflect this.

strings.001.jpeg

The String type in Kotlin conforms to the CharSequence interface. This interface gives us 3 useful members.

public interface CharSequence {

    public val length: Int

    public operator fun get(index: Int): Char

    public fun subSequence(startIndex: Int, endIndex: Int): CharSequence
}

These are self documenting. Highlighting the second one, get(index:) is defined as an operator and allows us to index Chars in a String via a subscript with a 0-based index:

val name = "SenseiDev"

println(name[6]) // D

The Kotlin String definition on the JVM is as follows:

public class String : Comparable<String>, CharSequence {
    companion object {}

    public operator fun plus(other: Any?): String

    public override val length: Int

    public override fun get(index: Int): Char

    public override fun subSequence(startIndex: Int, endIndex: Int): CharSequence

    public override fun compareTo(other: String): Int
}

Remember all classes are final in Kotlin, hence enforcing the immutability highlighted earlier. As seen above, there are some additions to what the CharSequence interface defines that are particular to the String type.

  1. plus operator. This allows us to concatenate strings with the + operator e.g. "SenseiDev" + 5. (NB: the definition of the + operator is defined on the String type and hence it is not commutative i.e. it's a compile error to define 5 + "SenseiDev" -- unless you have such an extension function on Int)!
  2. A String can be compared to another String as allowed by the Comparable interface and the compareTo method.
  3. An empty companion object

The Kotlin compiler has other neat tricks e.g it allows you to:

  • Create String instances from literals e.g. val language = "Kotlin"
  • Express literal raw String using triple quotes: """
  • Add template expressions to String literals e.g. val displayAmount = "Ksh. ${quantity * basePrice}"

The JVM Bridge

You might be wondering, "Wait a minute. That's weird. Those are very few members on the String class while we have much more functionality from the class in our JVM programs e.g. toUpperCase(), replace(), etc." Well, you're right my friend. Kotlin has a whole hundreds-of-lines file called StringsJVM.kt in the package kotlin.text that defines all your favorite methods as extension functions on the CharSequence interface, the String class or its companion object and this is the file where most (if not all) of the JVM bridging is defined.

One fascinating highlight in the StringsJVM.kt file are the constructor-seeming function calls that are single line expressions to construct a String e.g. String(chars: CharArray). To remove function call overhead, these functions are declared as inline. This depicts how Kotlin language features (top-level functions and inline functions) work together while improving developer productivity.

Conclusion

We use the String class so often when writing our programs as it's useful in expressing a lot from our real world. String manipulation and the available operations are clearly important to understand as a developer. To dig deeper, I came across an article on Baeldung that discusses String interning in the JVM and some of the changes in regards to treating the String class. Check it out.

This article covered some fundamentals of the String class and the APIs exposed by the Kotlin Language in regards to the representing Strings. Happy coding!

References

No Comments Yet