| # URL syntax and their use in curl |
| |
| ## Specifications |
| |
| The official "URL syntax" is primarily defined in these two different |
| specifications: |
| |
| - [RFC 3986](https://tools.ietf.org/html/rfc3986) (although URL is called "URI" in there) |
| - [The WHATWG URL Specification](https://url.spec.whatwg.org/) |
| |
| RFC 3986 is the earlier one, and curl has always tried to adhere to that one |
| (since it shipped in January 2005). |
| |
| The WHATWG URL spec was written later, is incompatible with the RFC 3986 and |
| changes over time. |
| |
| ## Variations |
| |
| URL parsers as implemented in browsers, libraries and tools usually opt to |
| support one of the mentioned specifications. Bugs, differences in |
| interpretations and the moving nature of the WHATWG spec does however make it |
| very unlikely that multiple parsers treat URLs the exact same way! |
| |
| ## Security |
| |
| Due to the inherent differences between URL parser implementations, it is |
| considered a security risk to mix different implementations and assume the |
| same behavior! |
| |
| For example, if you use one parser to check if a URL uses a good host name or |
| the correct auth field, and then pass on that same URL to a *second* parser, |
| there will always be a risk it treats the same URL differently. There is no |
| right and wrong in URL land, only differences of opinions. |
| |
| libcurl offers a separate API to its URL parser for this reason, among others. |
| |
| Applications may at times find it convenient to allow users to specify URLs |
| for various purposes and that string would then end up fed to curl. Getting a |
| URL from an external untrusted party and using it with curl brings several |
| security concerns: |
| |
| 1. If you have an application that runs as or in a server application, getting |
| an unfiltered URL can trick your application to access a local resource |
| instead of a remote resource. Protecting yourself against localhost accesses is very |
| hard when accepting user provided URLs. |
| |
| 2. Such custom URLs can access other ports than you planned as port numbers |
| are part of the regular URL format. The combination of a local host and a |
| custom port number can allow external users to play tricks with your local |
| services. |
| |
| 3. Such a URL might use other schemes than you thought of or planned for. |
| |
| ## "RFC3986 plus" |
| |
| curl recognizes a URL syntax that we call "RFC 3986 plus". It is grounded on |
| the well established RFC 3986 to make sure previously written command lines and |
| curl using scripts will remain working. |
| |
| curl's URL parser allows a few deviations from the spec in order to |
| inter-operate better with URLs that appear in the wild. |
| |
| ### spaces |
| |
| In particular `Location:` headers that indicate to the client where a resource |
| has been redirected to, sometimes contain spaces. This is a violation of RFC |
| 3986 but is fine in the WHATWG spec. curl handles these by re-encoding them to |
| `%20`. |
| |
| ### non-ASCII |
| |
| Byte values in a provided URL that are outside of the printable ASCII range |
| are percent-encoded by curl. |
| |
| ### multiple slashes |
| |
| An absolute URL always starts with a "scheme" followed by a colon. For all the |
| schemes curl supports, the colon must be followed by two slashes according to |
| RFC 3986 but not according to the WHATWG spec - which allows one to infinity |
| amount. |
| |
| curl allows one, two or three slashes after the colon to still be considered a |
| valid URL. |
| |
| ### "scheme-less" |
| |
| curl supports "URLs" that do not start with a scheme. This is not supported by |
| any of the specifications. This is a shortcut to entering URLs that was |
| supported by browsers early on and has been mimicked by curl. |
| |
| Based on what the host name starts with, curl will "guess" what protocol to |
| use: |
| |
| - `ftp.` means FTP |
| - `dict.` means DICT |
| - `ldap.` means LDAP |
| - `imap.` means IMAP |
| - `smtp.` means SMTP |
| - `pop3.` means POP3 |
| - all other means HTTP |
| |
| ### globbing letters |
| |
| The curl command line tool supports "globbing" of URLs. It means that you can |
| create ranges and lists using `[N-M]` and `{one,two,three}` sequences. The |
| letters used for this (`[]{}`) are reserved in RFC 3986 and can therefore not |
| legitimately be part of such a URL. |
| |
| They are however not reserved or special in the WHATWG specification, so |
| globbing can mess up such URLs. Globbing can be turned off for such occasions |
| (using `--globoff`). |
| |
| # URL syntax details |
| |
| A URL may consist of the following components - many of them are optional: |
| |
| [scheme][divider][userinfo][hostname][port number][path][query][fragment] |
| |
| Each component is separated from the following component with a divider |
| character or string. |
| |
| For example, this could look like: |
| |
| http://user:[email protected]:80/index.hmtl?foo=bar#top |
| |
| ## Scheme |
| |
| The scheme specifies the protocol to use. A curl build can support a few or |
| many different schemes. You can limit what schemes curl should accept. |
| |
| curl supports the following schemes on URLs specified to transfer. They are |
| matched case insensitively: |
| |
| `dict`, `file`, `ftp`, `ftps`, `gopher`, `gophers`, `http`, `https`, `imap`, |
| `imaps`, `ldap`, `ldaps`, `mqtt`, `pop3`, `pop3s`, `rtmp`, `rtmpe`, `rtmps`, |
| `rtmpt`, `rtmpte`, `rtmpts`, `rtsp`, `smb`, `smbs`, `smtp`, `smtps`, `telnet`, |
| `tftp` |
| |
| When the URL is specified to identify a proxy, curl recognizes the following |
| schemes: |
| |
| `http`, `https`, `socks4`, `socks4a`, `socks5`, `socks5h`, `socks` |
| |
| ## Userinfo |
| |
| The userinfo field can be used to set user name and password for |
| authentication purposes in this transfer. The use of this field is discouraged |
| since it often means passing around the password in plain text and is thus a |
| security risk. |
| |
| URLs for IMAP, POP3 and SMTP also support *login options* as part of the |
| userinfo field. They're provided as a semicolon after the password and then |
| the options. |
| |
| ## Hostname |
| |
| The hostname part of the URL contains the address of the server that you want |
| to connect to. This can be the fully qualified domain name of the server, the |
| local network name of the machine on your network or the IP address of the |
| server or machine represented by either an IPv4 or IPv6 address (within |
| brackets). For example: |
| |
| http://www.example.com/ |
| |
| http://hostname/ |
| |
| http://192.168.0.1/ |
| |
| http://[2001:1890:1112:1::20]/ |
| |
| ### "localhost" |
| |
| Starting in curl 7.77.0, curl will use loopback IP addresses for the name |
| `localhost`: `127.0.0.1` and `::1`. It will not try to resolve the name using |
| the resolver functions. |
| |
| This is done to make sure the host accessed is truly the localhost - the local |
| machine. |
| |
| ### IDNA |
| |
| If curl was built with International Domain Name (IDN) support, it can also |
| handle host names using non-ASCII characters. |
| |
| When built with libidn2, curl uses the IDNA 2008 standard. This is equivalent |
| to the WHATWG URL spec, but differs from certain browsers that use IDNA 2003 |
| Transitional Processing. The two standards have a huge overlap but differ |
| slightly, perhaps most famously in how they deal with the German "double s" |
| (`ß`). |
| |
| When winidn is used, curl uses IDNA 2003 Transitional Processing, like the rest |
| of Windows. |
| |
| ## Port number |
| |
| If there's a colon after the hostname, that should be followed by the port |
| number to use. 1 - 65535. curl also supports a blank port number field - but |
| only if the URL starts with a scheme. |
| |
| If the port number is not specified in the URL, curl will used a default port |
| based on the provide scheme: |
| |
| DICT 2628, FTP 21, FTPS 990, GOPHER 70, GOPHERS 70, HTTP 80, HTTPS 443, |
| IMAP 132, IMAPS 993, LDAP 369, LDAPS 636, MQTT 1883, POP3 110, POP3S 995, |
| RTMP 1935, RTMPS 443, RTMPT 80, RTSP 554, SCP 22, SFTP 22, SMB 445, SMBS 445, |
| SMTP 25, SMTPS 465, TELNET 23, TFTP 69 |
| |
| # Scheme specific behaviors |
| |
| ## FTP |
| |
| The path part of an FTP request specifies the file to retrieve and from which |
| directory. If the file part is omitted then libcurl downloads the directory |
| listing for the directory specified. If the directory is omitted then the |
| directory listing for the root / home directory will be returned. |
| |
| FTP servers typically put the user in its "home directory" after login, which |
| then differs between users. To explicitly specify the root directory of an FTP |
| server start the path with double slash `//` or `/%2f` (2F is the hexadecimal |
| value of the ascii code for the slash). |
| |
| ## FILE |
| |
| When a `FILE://` URL is accessed on Windows systems, it can be crafted in a |
| way so that Windows attempts to connect to a (remote) machine when curl wants |
| to read or write such a path. |
| |
| curl only allows the hostname part of a FILE URL to be one out of these three |
| alternatives: `localhost`, `127.0.0.1` or blank ("", zero characters). |
| Anything else will make curl fail to parse the URL. |
| |
| ### Windows-specific FILE details |
| |
| curl accepts that the FILE URL's path starts with a "drive letter". That's a |
| single letter `a` to `z` followed by a colon or a pipe character (`|`). |
| |
| The Windows operating system itself will convert some file accesses to perform |
| network accesses over SMB/CIFS, through several different file path patterns. |
| This way, a `file://` URL passed to curl *might* be converted into a network |
| access inadvertently and unknowingly to curl. This is a Windows feature curl |
| cannot control or disable. |
| |
| ## IMAP |
| |
| The path part of an IMAP request not only specifies the mailbox to list or |
| select, but can also be used to check the `UIDVALIDITY` of the mailbox, to |
| specify the `UID`, `SECTION` and `PARTIAL` octets of the message to fetch and |
| to specify what messages to search for. |
| |
| A top level folder list: |
| |
| imap://user:[email protected] |
| |
| A folder list on the user's inbox: |
| |
| imap://user:[email protected]/INBOX |
| |
| Select the user's inbox and fetch message with uid = 1: |
| |
| imap://user:[email protected]/INBOX/;UID=1 |
| |
| Select the user's inbox and fetch the first message in the mail box: |
| |
| imap://user:[email protected]/INBOX/;MAILINDEX=1 |
| |
| Select the user's inbox, check the `UIDVALIDITY` of the mailbox is 50 and |
| fetch message 2 if it is: |
| |
| imap://user:[email protected]/INBOX;UIDVALIDITY=50/;UID=2 |
| |
| Select the user's inbox and fetch the text portion of message 3: |
| |
| imap://user:[email protected]/INBOX/;UID=3/;SECTION=TEXT |
| |
| Select the user's inbox and fetch the first 1024 octets of message 4: |
| |
| imap://user:[email protected]/INBOX/;UID=4/;PARTIAL=0.1024 |
| |
| Select the user's inbox and check for NEW messages: |
| |
| imap://user:[email protected]/INBOX?NEW |
| |
| Select the user's inbox and search for messages containing "shadows" in the |
| subject line: |
| |
| imap://user:[email protected]/INBOX?SUBJECT%20shadows |
| |
| For more information about the individual components of an IMAP URL please see |
| RFC 5092. |
| |
| ## LDAP |
| |
| The path part of a LDAP request can be used to specify the: Distinguished |
| Name, Attributes, Scope, Filter and Extension for a LDAP search. Each field is |
| separated by a question mark and when that field is not required an empty |
| string with the question mark separator should be included. |
| |
| Search for the DN as `My Organisation`: |
| |
| ldap://ldap.example.com/o=My%20Organisation |
| |
| the same search but will only return postalAddress attributes: |
| |
| ldap://ldap.example.com/o=My%20Organisation?postalAddress |
| |
| Search for an empty DN and request information about the |
| `rootDomainNamingContext` attribute for an Active Directory server: |
| |
| ldap://ldap.example.com/?rootDomainNamingContext |
| |
| For more information about the individual components of a LDAP URL please |
| see [RFC 4516](https://tools.ietf.org/html/rfc4516). |
| |
| ## POP3 |
| |
| The path part of a POP3 request specifies the message ID to retrieve. If the |
| ID is not specified then a list of waiting messages is returned instead. |
| |
| ## SCP |
| |
| The path part of an SCP URL specifies the path and file to retrieve or |
| upload. The file is taken as an absolute path from the root directory on the |
| server. |
| |
| To specify a path relative to the user's home directory on the server, prepend |
| `~/` to the path portion. |
| |
| ## SFTP |
| |
| The path part of an SFTP URL specifies the file to retrieve or upload. If the |
| path ends with a slash (`/`) then a directory listing is returned instead of a |
| file. If the path is omitted entirely then the directory listing for the root |
| / home directory will be returned. |
| |
| ## SMB |
| The path part of a SMB request specifies the file to retrieve and from what |
| share and directory or the share to upload to and as such, may not be omitted. |
| If the user name is embedded in the URL then it must contain the domain name |
| and as such, the backslash must be URL encoded as %2f. |
| |
| curl supports SMB version 1 (only) |
| |
| ## SMTP |
| |
| The path part of a SMTP request specifies the host name to present during |
| communication with the mail server. If the path is omitted, then libcurl will |
| attempt to resolve the local computer's host name. However, this may not |
| return the fully qualified domain name that is required by some mail servers |
| and specifying this path allows you to set an alternative name, such as your |
| machine's fully qualified domain name, which you might have obtained from an |
| external function such as gethostname or getaddrinfo. |
| |
| The default smtp port is 25. Some servers use port 587 as an alternative. |
| |
| ## RTMP |
| |
| There's no official URL spec for RTMP so libcurl uses the URL syntax supported |
| by the underlying librtmp library. It has a syntax where it wants a |
| traditional URL, followed by a space and a series of space-separated |
| `name=value` pairs. |
| |
| While space is not typically a "legal" letter, libcurl accepts them. When a |
| user wants to pass in a `#` (hash) character it will be treated as a fragment |
| and get cut off by libcurl if provided literally. You will instead have to |
| escape it by providing it as backslash and its ASCII value in hexadecimal: |
| `\23`. |