<HTML> | |
<HEAD><TITLE>APR Canonical Filenames</TITLE></HEAD> | |
<BODY> | |
<h1>APR Canonical Filename</h1> | |
<h2>Requirements</h2> | |
<p>APR porters need to address the underlying discrepancies between | |
file systems. To achieve a reasonable degree of security, the | |
program depending upon APR needs to know that two paths may be | |
compared, and that a mismatch is guarenteed to reflect that the | |
two paths do not return the same resource</p>. | |
<p>The first discrepancy is in volume roots. Unix and pure deriviates | |
have only one root path, "/". Win32 and OS2 share root paths of | |
the form "D:/", D: is the volume designation. However, this can | |
be specified as "//./D:/" as well, indicating D: volume of the | |
'this' machine. Win32 and OS2 also may employ a UNC root path, | |
of the form "//server/share/" where share is a share-point of the | |
specified network server. Finally, NetWare root paths are of the | |
form "server/volume:/", or the simpler "volume:/" syntax for 'this' | |
machine. All these non-Unix file systems accept volume:path, | |
without a slash following the colon, as a path relative to the | |
current working directory, which APR will treat as ambigious, that | |
is, neither an absolute nor a relative path per se.</p> | |
<p>The second discrepancy is in the meaning of the 'this' directory. | |
In general, 'this' must be eliminated from the path where it occurs. | |
The syntax "path/./" and "path/" are both aliases to path. However, | |
this isn't file system independent, since the double slash "//" has | |
a special meaning on OS2 and Win32 at the start of the path name, | |
and is invalid on those platforms before the "//server/share/" UNC | |
root path is completed. Finally, as noted above, "//./volume/" is | |
legal root syntax on WinNT, and perhaps others.</p> | |
<p>The third discrepancy is in the context of the 'parent' directory. | |
When "parent/path/.." occurs, the path must be unwound to "parent". | |
It's also critical to simply truncate leading "/../" paths to "/", | |
since the parent of the root is root. This gets tricky on the | |
Win32 and OS2 platforms, since the ".." element is invalid before | |
the "//server/share/" is complete, and the "//server/share/../" | |
seqence is the complete UNC root "//server/share/". In relative | |
paths, leading ".." elements are significant, until they are merged | |
with an absolute path. The relative form must only retain the ".." | |
segments as leading segments, to be resolved once merged to another | |
relative or an absolute path.</p> | |
<p>The fourth discrepancy occurs with acceptance of alternate character | |
codes for the same element. Path seperators are not retained within | |
the APR canonical forms. The OS filesystem and APR (slashed) forms | |
can both be returned as strings, to be used in the proper context. | |
Unix, Win32 and Netware all accept slashes and backslashes as the | |
same path seperator symbol, although unix strictly accepts slashes. | |
While the APR form of the name strictly uses slashes, always consider | |
that there could be a platform that actually accepts slashes as a | |
character within a segment name.</p> | |
<p>The fifth and worst discrepancy plauges Win32, OS2, Netware, and some | |
filesystems mounted in Unix. Case insensitivity can permit the same | |
file to slip through in both it's proper case and alternate cases. | |
Simply changing the case is insufficient for any character set beyond | |
ASCII, since various dilectic forms of characters suffer from one to | |
many or many to one translations. An example would be u-umlaut, which | |
might be accepted as a single character u-umlaut, a two character | |
sequence u and the zero-width umlaut, the upper case form of the same, | |
or perhaps even a captial U alone. This can be handled in different | |
ways depending on the purposes of the APR based program, but the one | |
requirement is that the path must be absolute in order to resolve these | |
ambiguities. Methods employed include comparison of device and inode | |
file uniqifiers, which is a fairly fast operation, or quering the OS | |
for the true form of the name, which can be much slower. Only the | |
acknowledgement of the file names by the OS can validate the equality | |
of two different cases of the same filename.</p> | |
<p>The sixth discrepancy, illegal or insignificant characters, is especially | |
significant in non-unix file systems. Trailing periods are accepted | |
but never stored, therefore trailing periods must be ignored for any | |
form of comparison. And all OS's have certain expectations of what | |
characters are illegal (or undesireable due to confusion.)</p> | |
<p>A final warning, canonical functions don't transform or resolve case | |
or character ambiguity issues until they are resolved into an absolute | |
path. The relative canonical path, while useful, while useful for URL | |
or similar identifiers, cannot be used for testing or comparison of file | |
system objects.</p> | |
<hr> | |
<h2>Canonical API</h2> | |
Functions to manipulate the apr_canon_file_t (an opaque type) include: | |
<ul> | |
<li>Create canon_file_t (from char* path and canon_file_t parent path) | |
<li>Merged canon_file_t (from path and parent, both canon_file_t) | |
<li>Get char* path of all or some segments | |
<li>Get path flags of IsRelative, IsVirtualRoot, and IsAbsolute | |
<li>Compare two canon_file_t structures for file equality | |
</ul> | |
<p>The path is corrected to the file system case only if is in absolute | |
form. The apr_canon_file_t should be preserved as long as possible and | |
used as the parent to create child entries to reduce the number of expensive | |
stat and case canonicalization calls to the OS.</p> | |
<p>The comparison operation provides that the APR can postpone correction | |
of case by simply relying upon the device and inode for equivilance. The | |
stat implementation provides that two files are the same, while their | |
strings are not equivilant, and eliminates the need for the operating | |
system to return the proper form of the name.</p> | |
<p>In any case, returning the char* path, with a flag to request the proper | |
case, forces the OS calls to resolve the true names of each segment. Where | |
there is a penality for this operation and the stat device and inode test | |
is faster, case correction is postponed until the char* result is requested. | |
On platforms that identify the inode, device, or proper name interchangably | |
with no penalities, this may occur when the name is initially processed.</p> | |
<hr> | |
<h2>Unix Example</h2> | |
<p>First the simplest case:</p> | |
<pre> | |
Parse Canonical Name | |
accepts parent path as canonical_t | |
this path as string | |
Split this path Segments on '/' | |
For each of this path Segments | |
If first Segment | |
If this Segment is Empty ([nothing]/) | |
Append this Root Segment (don't merge) | |
Continue to next Segment | |
Else is relative | |
Append parent Segments (to merge) | |
Continue with this Segment | |
If Segment is '.' or empty (2 slashes) | |
Discard this Segment | |
Continue with next Segment | |
If Segment is '..' | |
If no previous Segment or previous Segment is '..' | |
Append this Segment | |
Continue with next Segment | |
If previous Segment and previous is not Root Segment | |
Discard previous Segment | |
Discard this Segment | |
Continue with next Segment | |
Append this Relative Segment | |
Continue with next Segment | |
</pre> | |
</BODY> | |
</HTML> |