Blame - docs/usermanual-shaping-concepts.xml - platform/external/harfbuzz_ng

blob: a95b0cbf6a78397c7540f47a2ada4f300e03fd89 [file] [log] [blame]

Nathan Willis	9f4b375	2018-10-29 17:10:53 -0500	[diff] [blame]	1	<?xml version="1.0"?>
				2	<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
				3	"http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
				4	<!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
				5	<!ENTITY version SYSTEM "version.xml">
				6	]>
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	7	<chapter id="shaping-concepts">
				8	<title>Shaping concepts</title>
				9	<section id="text-shaping-concepts">
				10	<title>Text shaping</title>
				11	<para>
				12	Text shaping is the process of transforming a sequence of Unicode
				13	codepoints that represent individual characters (letters,
				14	diacritics, tone marks, numbers, symbols, etc.) into the
				15	orthographically and linguistically correct two-dimensional layout
				16	of glyph shapes taken from a specified font.
				17	</para>
				18	<para>
				19	For some writing systems (or <emphasis>scripts</emphasis>) and
				20	languages, the process is simple, requiring the shaper to do
				21	little more than advance the horizontal position forward by the
				22	correct amount for each successive glyph.
				23	</para>
				24	<para>
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	25	But, for other scripts (often unceremoniously called <emphasis>complex scripts</emphasis>), any combination of
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	26	several shaping operations may be required, and the rules for how
				27	and when they are applied vary from script to script. HarfBuzz and
				28	other shaping engines implement these rules.
				29	</para>
				30	<para>
				31	The exact rules and necessary operations for a particular script
				32	constitute a shaping <emphasis>model</emphasis>. OpenType
				33	specifies a set of shaping models that covers all of
				34	Unicode. Other shaping models are available, however, including
				35	Graphite and Apple Advanced Typography (AAT).
				36	</para>
				37	</section>
				38
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	39	<section id="script-specific-shaping">
				40	<title>Script-specific shaping</title>
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	41	<para>
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	42	In many scripts, transforming the input
				43	sequence into the final layout often requires some combination of
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	44	operations—such as context-dependent substitutions,
				45	context-dependent mark positioning, glyph-to-glyph joining,
				46	glyph reordering, or glyph stacking.
				47	</para>
				48	<para>
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	49	In some scripts, the shaping rules require that a text
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	50	run be divided into syllables before the operations can be
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	51	applied. Other scripts may apply shaping operations over
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	52	entire words or over the entire text run, with no subdivision
				53	required.
				54	</para>
				55	<para>
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	56	Other scripts, do not require these
				57	operations. However, correctly shaping a text run in
				58	any script may still involve Unicode normalization,
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	59	ligature substitutions, mark positioning, kerning, and applying
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	60	other font features.
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	61	</para>
				62	</section>
				63
				64	<section id="shaping-operations">
				65	<title>Shaping operations</title>
				66	<para>
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	67	Shaping a text run involves transforming the
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	68	input sequence of Unicode codepoints with some combination of
				69	operations that is specified in the shaping model for the
				70	script.
				71	</para>
				72	<para>
				73	The specific conditions that trigger a given operation for a
				74	text run varies from script to script, as do the order that the
				75	operations are performed in and which codepoints are
				76	affected. However, the same general set of shaping operations is
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	77	common to all of the script shaping models.
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	78	</para>
				79
				80	<itemizedlist>
				81	<listitem>
				82	<para>
				83	A <emphasis>reordering</emphasis> operation moves a glyph
				84	from its original ("logical") position in the sequence to
				85	some other ("visual") position.
				86	</para>
				87	<para>
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	88	The shaping model for a given script might involve
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	89	more than one reordering step.
				90	</para>
				91	</listitem>
				92
				93	<listitem>
				94	<para>
				95	A <emphasis>joining</emphasis> operation replaces a glyph
				96	with an alternate form that is designed to connect with one
				97	or more of the adjacent glyphs in the sequence.
				98	</para>
				99	</listitem>
				100
				101	<listitem>
				102	<para>
				103	A contextual <emphasis>substitution</emphasis> operation
				104	replaces either a single glyph or a subsequence of several
				105	glyphs with an alternate glyph. This substitution is
				106	performed when the original glyph or subsequence of glyphs
				107	occurs in a specified position with respect to the
				108	surrounding sequence. For example, one substitution might be
				109	performed only when the target glyph is the first glyph in
				110	the sequence, while another substitution is performed only
				111	when a different target glyph occurs immediately after a
				112	particular string pattern.
				113	</para>
				114	<para>
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	115	The shaping model for a given script might involve
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	116	multiple contextual-substitution operations, each applying
				117	to different target glyphs and patterns, and which are
				118	performed in separate steps.
				119	</para>
				120	</listitem>
				121
				122	<listitem>
				123	<para>
				124	A contextual <emphasis>positioning</emphasis> operation
				125	moves the horizontal and/or vertical position of a
				126	glyph. This positioning move is performed when the glyph
				127	occurs in a specified position with respect to the
				128	surrounding sequence.
				129	</para>
				130	<para>
				131	Many contextual positioning operations are used to place
				132	<emphasis>mark</emphasis> glyphs (such as diacritics, vowel
				133	signs, and tone markers) with respect to
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	134	<emphasis>base</emphasis> glyphs. However, some
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	135	scripts may use contextual positioning operations to
				136	correctly place base glyphs as well, such as
				137	when the script uses <emphasis>stacking</emphasis> characters.
				138	</para>
				139	</listitem>
				140
				141	</itemizedlist>
				142	</section>
				143
				144	<section id="unicode-character-categories">
				145	<title>Unicode character categories</title>
				146	<para>
				147	Shaping models are typically specified with respect to how
				148	scripts are defined in the Unicode standard.
				149	</para>
				150	<para>
				151	Every codepoint in the Unicode Character Database (UCD) is
				152	assigned a <emphasis>Unicode General Category</emphasis> (UGC),
				153	which provides the most fundamental information about the
				154	codepoint: whether the codepoint represents a
				155	<emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
				156	<emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
				157	<emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
				158	or something else (<emphasis>Other</emphasis>).
				159	</para>
				160	<para>
				161	These UGC properties are "Major" categories. Each codepoint is
				162	further assigned to a "minor" category within its Major
				163	category, such as "Letter, uppercase" (<literal>Lu</literal>) or
				164	"Letter, modifier" (<literal>Lm</literal>).
				165	</para>
				166	<para>
				167	Shaping models are concerned primarily with Letter and Mark
				168	codepoints. The minor categories of Mark codepoints are
				169	particularly important for shaping. Marks can be nonspacing
				170	(<literal>Mn</literal>), spacing combining
				171	(<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
				172	</para>
				173	<para>
				174	In addition to the UGC property, codepoints in the Indic and
				175	Southeast Asian scripts are also assigned
				176	<emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
				177	<emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
Nathan Willis	ed13cad	2018-11-28 13:48:38 -0600	[diff] [blame]	178	properties that provide more detailed information needed for
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	179	shaping.
				180	</para>
				181	<para>
				182	The UISC property sub-categorizes Letters and Marks according to
				183	common script-shaping behaviors. For example, UISC distinguishes
				184	between consonant letters, vowel letters, and vowel marks. The
Nathan Willis	ed13cad	2018-11-28 13:48:38 -0600	[diff] [blame]	185	UIPC property sub-categorizes Mark codepoints by the relative visual
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	186	position that they occupy (above, below, right, left, or in
				187	multiple positions).
				188	</para>
				189	<para>
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	190	Some scripts require that the text run be split into
Nathan Willis	ed13cad	2018-11-28 13:48:38 -0600	[diff] [blame]	191	syllables. What constitutes a valid syllable in these
				192	scripts is specified in regular expressions, formed from the
				193	Letter and Mark codepoints, that take the UISC and UIPC
				194	properties into account.
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	195	</para>
				196
				197	</section>
				198
				199	<section id="text-runs">
				200	<title>Text runs</title>
				201	<para>
				202	Real-world text usually contains codepoints from a mixture of
				203	different Unicode scripts (including punctuation, numbers, symbols,
				204	white-space characters, and other codepoints that do not belong
				205	to any script). Real-world text may also be marked up with
				206	formatting that changes font properties (including the font,
				207	font style, and font size).
				208	</para>
				209	<para>
				210	For shaping purposes, all real-world text streams must be first
				211	segmented into runs that have a uniform set of properties.
				212	</para>
				213	<para>
				214	In particular, shaping models always assume that every codepoint
				215	in a text run has the same <emphasis>direction</emphasis>,
				216	<emphasis>script</emphasis> tag, and
				217	<emphasis>language</emphasis> tag.
				218	</para>
				219	</section>
				220
				221	<section id="opentype-shaping-models">
				222	<title>OpenType shaping models</title>
				223	<para>
				224	OpenType provides shaping models for the following scripts:
				225	</para>
				226
				227	<itemizedlist>
				228	<listitem>
				229	<para>
				230	The <emphasis>default</emphasis> shaping model handles all
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	231	scripts with no script-specific shaping model, and may also be used as a fallback for
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	232	handling unrecognized scripts.
				233	</para>
				234	</listitem>
				235
				236	<listitem>
				237	<para>
				238	The <emphasis>Indic</emphasis> shaping model handles the Indic
				239	scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
David Corbett	78c5ae3	2022-06-25 13:32:04 -0400	[diff] [blame]	240	Malayalam, Oriya, Tamil, and Telugu.
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	241	</para>
				242	<para>
				243	The Indic shaping model was revised significantly in
				244	2005. To denote the change, a new set of <emphasis>script
				245	tags</emphasis> was assigned for Bengali, Devanagari,
				246	Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
				247	Telugu. For the sake of clarity, the term "Indic2" is
				248	sometimes used to refer to the current, revised shaping
				249	model.
				250	</para>
				251	</listitem>
				252
				253	<listitem>
				254	<para>
				255	The <emphasis>Arabic</emphasis> shaping model supports
				256	Arabic, Mongolian, N'Ko, Syriac, and several other connected
				257	or cursive scripts.
				258	</para>
				259	</listitem>
				260
				261	<listitem>
				262	<para>
				263	The <emphasis>Thai/Lao</emphasis> shaping model supports
				264	the Thai and Lao scripts.
				265	</para>
				266	</listitem>
				267
				268	<listitem>
				269	<para>
				270	The <emphasis>Khmer</emphasis> shaping model supports the
				271	Khmer script.
				272	</para>
				273	</listitem>
				274
				275	<listitem>
				276	<para>
				277	The <emphasis>Myanmar</emphasis> shaping model supports the
				278	Myanmar (or Burmese) script.
				279	</para>
				280	</listitem>
				281
				282	<listitem>
				283	<para>
				284	The <emphasis>Tibetan</emphasis> shaping model supports the
				285	Tibetan script.
				286	</para>
				287	</listitem>
				288
				289	<listitem>
				290	<para>
				291	The <emphasis>Hangul</emphasis> shaping model supports the
				292	Hangul script.
				293	</para>
				294	</listitem>
				295
				296	<listitem>
				297	<para>
				298	The <emphasis>Hebrew</emphasis> shaping model supports the
				299	Hebrew script.
				300	</para>
				301	</listitem>
				302
				303	<listitem>
				304	<para>
				305	The <emphasis>Universal Shaping Engine</emphasis> (USE)
Khaled Hosny	8d36300	2022-06-03 21:00:08 +0200	[diff] [blame]	306	shaping model supports scripts not covered by one of
Nathan Willis	3a27e8f	2018-10-12 18:23:26 -0500	[diff] [blame]	307	the above, script-specific shaping models, including
				308	Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
				309	Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
				310	Viet, and many others.
				311	</para>
				312	</listitem>
				313
				314	<listitem>
				315	<para>
				316	Text runs that do not fall under one of the above shaping
				317	models may still require processing by a shaping engine. Of
				318	particular note is <emphasis>Emoji</emphasis> shaping, which
				319	may involve variation-selector sequences and glyph
				320	substitution. Emoji shaping is handled by the default
				321	shaping model.
				322	</para>
				323	</listitem>
				324
				325	</itemizedlist>
				326
				327	</section>
				328
				329	<section id="graphite-shaping">
				330	<title>Graphite shaping</title>
				331	<para>
				332	In contrast to OpenType shaping, Graphite shaping does not
				333	specify a predefined set of shaping models or a set of supported
				334	scripts.
				335	</para>
				336	<para>
				337	Instead, each Graphite font contains a complete set of rules that
				338	implement the required shaping model for the intended
				339	script. These rules include finite-state machines to match
				340	sequences of codepoints to the shaping operations to perform.
				341	</para>
				342	<para>
				343	Graphite shaping can perform the same shaping operations used in
				344	OpenType shaping, as well as other functions that have not been
				345	defined for OpenType shaping.
				346	</para>
				347	</section>
				348
				349	<section id="aat-shaping">
				350	<title>AAT shaping</title>
				351	<para>
				352	In contrast to OpenType shaping, AAT shaping does not specify a
				353	predefined set of shaping models or a set of supported scripts.
				354	</para>
				355	<para>
				356	Instead, each AAT font includes a complete set of rules that
				357	implement the desired shaping model for the intended
				358	script. These rules include finite-state machines to match glyph
				359	sequences and the shaping operations to perform.
				360	</para>
				361	<para>
				362	Notably, AAT shaping rules are expressed for glyphs in the font,
				363	not for Unicode codepoints. AAT shaping can perform the same
				364	shaping operations used in OpenType shaping, as well as other
				365	functions that have not been defined for OpenType shaping.
				366	</para>
				367	</section>
				368	</chapter>