tree: 38380e9022e24bc88718847770fc38fc569c104b [path history] [tgz]
  1. src/
  2. .travis.yml
  3. Cargo.lock
  4. Cargo.toml
  5. LICENSE
  6. LICENSE-UCD
  7. perf-config.json
  8. prepare.sh
  9. README.md
src/tools/rustc-perf/collector/compile-benchmarks/ucd/README.md

libucd

Build Status crates.io

This library extends the inbuilt char type with the Codepoint trait, which implements 100 properties of the UCD (Unicode Character Database). It aims to be fast and compact, and to have minimal dependencies (it does not require the rust standard library so only needs rust's core crate).

extern crate ucd;
use ucd::Codepoint;

fn main() {
    let salawat: char = 'ﷺ';
    let decomp: String = salawat.decomposition_map().collect();
    println!("{} -> {}", salawat, decomp);
    // ﷺ -> صلى الله عليه وسلم
}

Though the library is fairly extensive, it is not complete. The properties it lacks are:

  • Character names and aliases
  • Unihan properties
  • Tangut source data
  • Named sequences
  • Standardised variants
  • CJK Radicals
  • Emoji source data

Data was compressed as arrays of ranges of codepoints having each value for each property, excluding the most common value. Lookup is then implemented by a binary search which should provide O(1) access. A very unscientific test suggests each lookup takes around 100ns on a core i5 processor.

Disclaimer

Please note that this data has been derived from the flat XML version of the UCD. Though this is published by Unicode, it is not the official version of the database. As such, it cannot be guaranteed that this library is entirely consistent with the official database. The unit tests do however suggest consistency with the XML database at least.

Reference

Note, in most cases enum values are simply the full name of the property value converted to camelcase. Perhaps I will eventually create a rustdoc version that will be more helpful.

In the tables below, the first column is the name of the property, and the second column is the method name. The methods are implemented via the ucd::Codepoint trait on char.

GeneralMethodReturn TypeNote
AgeageOption<(u8,u8)>Return None if the character is unassigned, else a tuple of major and minor age
BlockblockOption<UnicodeBlock>
General_CategorycategoryUnicodeCategoryReturn Unassigned if unassigned
ISO_Comment (deprecated, stabilized)iso_comment&strAll characters are null, i.e. "" is returned
Function and AppearanceMethodReturn TypeNote
AlphabeticCodepoint::is_alphabetic or is_alphaboolRust's char already defines a method is_alphabetic which shadows this implementation and provides slightly different results (possibly by using an outdated version of the UCD). Use the namespace, or the is_alpha alias instead.
ASCII_Hex_Digitis_hex_digit_asciibool
Dashis_dashbool
Default_Ignorable_Code_Pointis_default_ignorablebool
Deprecatedis_deprecatedbool
Diacriticis_diacriticbool
Extenderis_extenderbool
Hex_Digitis_hex_digitbool
Hyphen (deprecated, stabilized)is_hyphenbool
Logical_Order_Exceptionis_logical_order_exceptionbool
Mathis_mathbool
Noncharacter_Code_Pointis_noncharacterbool
Other_Alphabeticis_alphabetic_otherbool
Other_Default_Ignorable_Code_Pointis_default_ignorable_otherbool
Other_Mathis_math_otherbool
Prepended_Concatenation_Markis_prepended_concatentation_markbool
Quotation_Markis_quotation_markbool
Sentence_Terminalis_sentence_terminalbool
Soft_Dottedis_soft_dottedbool
Terminal_Punctuationis_terminal_punctuationbool
Variation_Selectoris_variation_selectorbool
White_SpaceCodepoint::is_whitespace or is_whiteboolAgain, char shadows this with its own implementation so use the namespace or alias.
NumericMethodReturn TypeNote
Numeric_Typenumeric_typeOption<NumericType>
Numeric_Valuenumeric_valueOption<Number>The Number type is an enum of Integer(i64) and Rational(i32,u32), to cover values as large as 10^12 and as small as 1/160.
Identifiers and SyntaxMethodReturn TypeNote
ID_Continueis_id_continuebool
ID_Startis_id_startbool
Other_ID_Continueis_id_continue_otherbool
Other_ID_Startis_id_start_otherbool
Pattern_Syntaxis_pattern_syntaxbool
Pattern_White_Spaceis_pattern_whitespacebool
XID_Continueis_id_continue_nfkcbool
XID_Startis_id_start_nfkcbool
ScriptsMethodReturn TypeNote
East_Asian_Widtheast_asian_widthEastAsianWidth
Hangul_Syllable_Typehangul_syllable_typeOption<HangulSyllableType>
Ideographicis_ideographbool
IDS_Binary_Operatoris_ideograph_description_sequence_binary_operatorbool
IDS_Trinary_Operatoris_ideograph_description_sequence_trinary_operatorbool
Indic_Positional_Categoryindic_positional_categoryOption<IndicPositionalCategory>
Indic_Syllabic_Categoryindic_syllabic_categoryOption<IndicSyllabicCategory>
Jamo_Short_Namejamo_short_nameOption<&str>
Join_Controljoin_controlbool
Joining_Groupjoining_groupJoiningGroup
Joining_Typejoining_typeJoiningType
Radicalis_ideograph_description_sequence_radicalbool
ScriptscriptOption<Script>
Script_Extensionsscript_extensionsOption<&[Script]>
Unified_Ideographis_ideograph_unifiedbool
BidirectionalityMethodReturn TypeNote
Bidi_Classbidi_classBidiClass
Bidi_Controlbidi_is_controlbool
Bidi_Mirroredbidi_is_mirroredbool
Bidi_Mirroring_Glyphbidi_mirrorOption<char>
Bidi_Paired_Bracketbidi_paired_bracletchar
Bidi_Paired_Bracket_Typebidi_paired_bracket_typeOption<BidiPairedBracketType>
CaseMethodReturn TypeNote
Case_FoldingcasefoldCharIterCharIter is an iterator over a series of chars, and is used because the library makes no use of std and thus cannot dynamically allocate memory.
Case_Ignorableis_case_ignorablebool
Casedis_casedbool
Changes_When_Casefoldedchanges_when_casefoldedbool
Changes_When_Casemappedchanges_when_casemappedbool
Changes_When_Lowercasedchanges_when_lowercasedbool
Changes_When_NFKC_Casefoldedchanges_when_casefolded_nfkcbool
Changes_When_Titlecasedchanges_when_titlecasedbool
Changes_When_Uppercasedchanges_when_uppercasedbool
FC_NFKC_Closure (deprecated)casefold_nfkc_closureCharIter
Lowercaseis_lowercaseboolAgain, char shadows this with its own implementation so use the namespace or alias.
Lowercase_MappinglowercaseCharIter
NFKC_Casefoldcasefold_nfkcCharIter
Other_Lowercaseis_lowercase_otherbool
Other_Uppercaseis_uppercase_otherbool
Simple_Case_Foldingcasefold_simplechar
Simple_Lowercase_Mappinglowercase_simplechar
Simple_Titlecase_Mappingtitlecase_simplechar
Simple_Uppercase_Mappinguppercase_simplechar
Titlecase_MappingtitlecaseCharIter
Uppercaseis_uppercaseboolAgain, char shadows this with its own implementation so use the namespace or alias.
Uppercase_MappinguppercaseCharIter
NormalisationMethodReturn TypeNote
Canonical_Combining_Classcanonical_combining_classu8
Composition_Exclusionexcluded_from_compositionbool
Decomposition_Mappingdecomposition_mapCharIter
Decomposition_Typedecomposition_typeOption<DecompositionType>
Expands_On_NFC (deprecated)expands_on_nfcbool
Expands_On_NFD (deprecated)expands_on_nfdbool
Expands_On_NFKC (deprecated)expands_on_nfkcbool
Expands_On_NFKD (deprecated)expands_on_nfkdbool
Full_Composition_Exclusionexcluded_from_composition_fullybool
NFC_Quick_Checkquick_check_nfcTrileanReturns one of Trilean::True, Trilean::Maybe or Trilean::False
NFD_Quick_Checkquick_check_nfdbool
NFKC_Quick_Checkquick_check_nfkcTrilean
NFKD_Quick_Checkquick_check_nfkdbool
SegmentationMethodReturn TypeNote
Grapheme_Baseis_grapheme_basebool
Grapheme_Cluster_Breakgrapheme_cluster_breakGraphemeClusterBreak
Grapheme_Extendis_grapheme_extendbool
Grapheme_Link (deprecated)is_grapheme_linkbool
Line_Breaklinebreak_classOption<LinebreakClass>
Other_Grapheme_Extendis_grapheme_extend_otherbool
Sentence_Breaksentence_breakSentenceBreak
Word_Breakword_breakWordBreak