Unicode sorting in SQLite

Date: 2024-08-18

SQLite does not come with Unicode sorting tables. If you want case-aware and locale-aware sorting, you have to provide your own collator.

In Rust, a nice way to do this is with icu4x.

[dependencies]
icu_collator = "1.4.0"
icu_locid = "1.4.0"

Then use the create_collation() method.

use icu_collator::Collator;
use icu_collator::CollatorOptions;
use icu_locid::locale;

let collator = Collator::try_new(&locale!("en")?, CollatorOptions::new())?;

let mut conn: rusqlite::Connection = todo!();
conn.create_collation("UNICODE", move |a, b| collator.compare(a, b))?;

You can then use unicode-aware sort in your queries by:

SELECT *
FROM users
ORDER BY name COLLATE UNICODE;

If you have a big table, using a custom collator can be slow, because SQLite has to scan the whole table. You can create an index with the desired collation:

CREATE INDEX users_name_unicode_aware ON users (name COLLATE UNICODE);

Thread-safety

ICU4X includes locale data internally. By default, its locale provider isn't thread-safe. If you are using SQLite from different threads, for example with a connection pool, you probably want to share the collator between all threads (otherwise, you'd be lugging around multiple copies of that data).

To do so, you can enable the sync feature for icu_provider. It's a transitive dependency of the other ICU crates. You won't need to reference icu_provider in your code, only in Cargo.toml:

[dependencies]
icu_provider = { version = "1.4.0", features = ["sync"] }

To create a single globally shared collator, I like to use a function that has a OnceLock inside. It'll create the collator the first time it's called, and return a shared reference to that same collator from then on:

use std::sync::OnceLock;
use icu_collator::Collator;
use icu_collator::CollatorOptions;
use icu_locid::locale;

fn collator() -> &'static Collator {
    static COLLATOR: OnceLock<Collator> = OnceLock::new();

    COLLATOR.get_or_init(|| {
        Collator::try_new(
            &locale!("en").into(),
            CollatorOptions::new(),
        ).unwrap()
    })
}

// You can now do this whenever you open a connection!
let mut conn: rusqlite::Connection = todo!();

let collator = collator();
conn.create_collation("UNICODE", move |a, b| collator.compare(a, b))?;