libtld: /home/snapwebsites/snapcpp/contrib/libtld/src/tld.c File Reference

libtld  1.5.13
A library to determine the Top-Level Domain name of any URL.
tld.c File Reference

Implementation of the TLD parser library. More...

#include "libtld/tld.h"
#include "tld_data.h"
#include <malloc.h>
#include <stdlib.h>
#include <limits.h>
#include <string.h>
#include <ctype.h>
Include dependency graph for tld.c:
This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Functions

static int cmp (const char *a, const char *b, int n)
 Compare two strings, one of which is limited by length. More...
 
static int h2d (int c)
 Internal function used to transform XX values. More...
 
int search (int i, int j, const char *domain, int n)
 Search for the specified domain. More...
 
enum tld_result tld (const char *uri, struct tld_info *info)
 Get information about the TLD for the specified URI. More...
 
enum tld_result tld_check_uri (const char *uri, struct tld_info *info, const char *protocols, int flags)
 Check that a URI is valid. More...
 
void tld_clear_info (struct tld_info *info)
 Clear the info structure. More...
 
const char * tld_version ()
 Return the version of the library. More...
 

Detailed Description

This file includes all the functions available in the C library of libtld that pertain to the parsing of URIs and extraction of TLDs.

Definition in file tld.c.

Function Documentation

static int cmp ( const char *  a,
const char *  b,
int  n 
)
static

This internal function was created to handle a simple string (no locale) comparison with one string being limited in length.

The comparison does not require locale since all characters are ASCII (a URI with Unicode characters encode them in UTF-8 and changes all those bytes with XX.)

The length applied to the string in b. This allows us to make use of the input string all the way down to the cmp() function. In other words, we avoid a copy of the string.

The string in a is 'nul' (\0) terminated. This means a may be longer or shorter than b. In other words, the function is capable of returning the correct result with a single call.

If parameter a is "*", then it always matches b.

Parameters
[in]aThe pointer in an f_tld field of the tld_descriptions.
[in]bPointer directly in referencing the user domain string.
[in]nThe number of characters that can be checked in b.
Returns
-1 if a < b, 0 when a == b, and 1 when a > b

Definition at line 328 of file tld.c.

Referenced by search().

static int h2d ( int  c)
static

This function transforms an hexadecimal (h) character to (2) a decimal number (d).

Parameters
[in]cThe hexadecimal character to transform
Returns
The number the hexadecimal character represents (0 to 15)

Definition at line 686 of file tld.c.

Referenced by tld_check_uri().

int search ( int  i,
int  j,
const char *  domain,
int  n 
)

This function executes one search for one domain. The search is binary, which means the tld_descriptions are expected to be 100% in order at all levels.

The i and j parameters represent the boundaries of the current level to be checked. Know that for a given TLD, there is a start and end boundary that is used to define i and j. So except for the top level, the bounds are limited to one TLD, sub-TLD, etc. (for example, .uk has a sub-layer with .co, .ac, etc. and that ground is limited to the second level entries accepted within the .uk TLD.)

This search does one search at one level. If sub-levels are available for that TLD, then it is the responsibility of the caller to call the function again to find out whether one of those sub-domain name is in use.

When the TLD cannot be found, the function returns -1.

Parameters
[in]iThe start point of the search (included.)
[in]jThe end point of the search (excluded.)
[in]domainThe domain name to search.
[in]nThe length of the domain name.
Returns
The offset of the domain found, or -1 when not found.

Definition at line 396 of file tld.c.

References cmp(), tld_description::f_tld, tld(), and tld_descriptions.

Referenced by tld().

enum tld_result tld ( const char *  uri,
struct tld_info info 
)

The tld() function searches for the specified URI in the TLD descriptions. The results are saved in the info parameter for later interpretetation (i.e. extraction of the domain name, sub-domains and the exact TLD.)

The function extracts the last extension of the URI. For example, in the following:

1 example.co.uk

the function first extracts ".uk". With that extension, it searches the list of official TLDs. If not found, an error is returned and the info parameter is set to unknown.

When found, the function checks whether that TLD (".uk" in our previous example) accepts sub-TLDs (second, third, forth and fifth level TLDs.) If so, it extracts the next TLD entry (the ".co" in our previous example) and searches for that second level TLD. If found, it again tries with the third level, etc. until all the possible TLDs were exhausted. At that point, it returns the last TLD it found. In case of ".co.uk", it returns the information of the ".co" TLD, second-level domain name.

All the comparisons are done in lowercase. This is because all the data is saved in lowercase and we expect the input of the tld() function to already be in lowercase. If you have a doubt and your input may actually be in uppercase, make sure to call the tld_domain_to_lowercase() function first. That function makes a duplicate of your domain name in lowercase. It understands the XX characters (since the URI is expected to still be encoded) and properly handles UTF-8 characters in order to define the lowercase characters of the input. Note that the function returns a newly allocated pointer that you are responsible to free once you are done with it.

Warning
If you call tld() with the pointer return by tld_domain_to_lowercase(), keep in mind that the tld() function saves pointers of the input string directly in the tld_info structure. In other words, you want to free() that string AFTER you are done with the tld_info structure.

The info structure includes:

  • f_category – the category of TLD, unless set to TLD_CATEGORY_UNDEFINED, it is considered valid
  • f_status – the status of the TLD, unless set to TLD_STATUS_UNDEFINED, it was defined from the tld_data.xml file; however, only those marked as TLD_STATUS_VALID are considered to currently be in use, all the other statuses can be used by your software, one way or another, but it should not be accepted as valid in a URI
  • f_country – if the category is set to TLD_CATEGORY_COUNTRY then this pointer is set to the name of the country
  • f_tld – is set to the full TLD of your domain name; this is a pointer WITHIN your uri string so make sure you keep your URI string valid if you intend to use this f_tld string
  • f_offset – the offset to the first period within the domain name TLD (i.e. in our previous example, it would be the offset to the first period in ".co.uk", so in "example.co.uk" the offset would be 7. Assuming you prepend "www." to have the URI "www.example.co.uk" then the offset would be 11.)
Note
In our previous example, the ".uk" TLD is properly used: it includes a second level domain name (".co".) The URI "example.uk" should have returned TLD_RESULT_INVALID since .uk by itself was not supposed to be acceptable. This changed a few years ago. The good thing is that it resolves some problems as some companies were given a simple ".uk" TLD and these were exceptions the library does not need to support anymore. There are still some countries, such as ".bd", which do not accept second level names, so "example.bd" does return an error (TLD_RESULT_INVALID).

Assuming that you always get valid URIs, you should get one of those results:

  • TLD_RESULT_SUCCESS – success! the URI is valid and the TLD was properly determined; use the f_tld or f_offset to extract the TLD domain and sub-domains
  • TLD_RESULT_INVALID – known TLD, but not currently valid; this result is returned when we know that the TLD is not to be accepted

Other results are returned when the input string is considered invalid.

Note
The function only accepts a bare URI, in other words: no protocol, no path, no anchor, no query string, and still URI encoded. Also, it should not start and/or end with a period or you are likely to get an invalid response. (i.e. don't use any of ".example.co.uk.", "example.co.uk.", nor ".example.co.uk")
/* TLD library -- TLD example
* Copyright (c) 2011-2019 Made to Order Software Corp. All Rights Reserved
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the
* "Software"), to deal in the Software without restriction, including
* without limitation the rights to use, copy, modify, merge, publish,
* distribute, sublicense, and/or sell copies of the Software, and to
* permit persons to whom the Software is furnished to do so, subject to
* the following conditions:
*
* The above copyright notice and this permission notice shall be included
* in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
#include "libtld/tld.h"
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
const char *uri = "WWW.Example.Co.Uk";
char *uri_lowercase;
struct tld_info info;
enum tld_result r;
if(argc > 1)
{
uri = argv[1];
}
// if your input may include uppercase characters and you
// do not have an easy way to compute the lowercase before
// calling tld(), call the tld_domain_to_lowercase() function
uri_lowercase = tld_domain_to_lowercase(uri);
r = tld(uri_lowercase, &info);
{
const char *s = uri_lowercase + info.f_offset - 1;
while(s > uri_lowercase)
{
if(*s == '.')
{
++s;
break;
}
--s;
}
// here uri_lowercase points to your sub-domains, the length is
// "s - uri_lowercase"
// if uri_lowercase == s then there are no sub-domains
// s points to the domain name, the length is "info.f_tld - s"
// and info.f_tld points to the TLD
//
// When TLD_RESULT_SUCCESS is returned the domain cannot be an
// empty string; also the TLD cannot be empty, however, there
// may be no sub-domains.
printf("Sub-domain(s): \"%.*s\"\n", (int)(s - uri_lowercase), uri_lowercase);
printf("Domain: \"%.*s\"\n", (int)(info.f_tld - s), s);
printf("TLD: \"%s\"\n", info.f_tld);
free(uri_lowercase);
return 0;
}
free(uri_lowercase);
return 1;
}
// vim: ts=4 sw=4 et
Parameters
[in]uriThe URI to be checked.
[out]infoA pointer to a tld_info structure to save the result.
Returns
One of the TLD_RESULT_... enumeration values.

Definition at line 555 of file tld.c.

References tld_description::f_category, tld_info::f_category, tld_info::f_country, tld_description::f_country, tld_description::f_end_offset, tld_description::f_exception_apply_to, tld_description::f_exception_level, tld_info::f_offset, tld_description::f_start_offset, tld_description::f_status, tld_info::f_status, tld_info::f_tld, search(), tld_clear_info(), tld_descriptions, tld_end_offset, tld_max_level, TLD_RESULT_BAD_URI, TLD_RESULT_INVALID, TLD_RESULT_NO_TLD, TLD_RESULT_NOT_FOUND, TLD_RESULT_NULL, TLD_RESULT_SUCCESS, tld_start_offset, TLD_STATUS_EXCEPTION, and TLD_STATUS_VALID.

Referenced by cat_ext(), snap::output_tlds(), tld_email_list::tld_email_t::parse(), PHP_FUNCTION(), snap::read_tlds(), search(), tld_object::set_domain(), tld_check_uri(), and tld_encode().

enum tld_result tld_check_uri ( const char *  uri,
struct tld_info info,
const char *  protocols,
int  flags 
)

This function very quickly parses a URI to determine whether it is valid.

Note that it does not (currently) support local naming conventions which means that a host such as "localhost" will fail the test.

The protocols variable can be set to a list of protocol names that are considered valid. For example, for HTTP protocol one could use "http,https". To accept any protocol use an asterisk as in: "*". The protocol must be only characters, digits, or underscores ([0-9A-Za-z_]+) and it must be at least one character.

The flags can be set to the following values, or them to set multiple flags at the same time:

  • VALID_URI_ASCII_ONLY – refuse characters that are not in the first 127 range (we expect the URI to be UTF-8 encoded and any byte with bit 7 set is considered invalid if this flag is set, including encoded bytes such as A0)
  • VALID_URI_NO_SPACES – refuse spaces whether they are encoded with + or %20 or verbatim.

The return value is generally TLD_RESULT_BAD_URI when an invalid character is found in the URI string. The TLD_RESULT_NULL is returned if the URI is a NULL pointer or an empty string. Other results may be returned by the tld() function. If a result other than TLD_RESULT_SUCCESS is returned then the info structure may or may not be updated.

Parameters
[in]uriThe URI which validity is being checked.
[out]infoThe resulting information about the URI domain and TLD.
[in]protocolsList of comma separated protocols accepted.
[in]flagsA set of flags to tell the function what is valid/invalid.
Returns
The result of the operation, TLD_RESULT_SUCCESS if the URI is valid.
See also
tld()
Todo:
The following is WRONG:
  • the domain %XX are not being checked properly, as it stands the characters following % can be anything!
  • the tld() function must be called with the characters still encoded; if you look at the data, you will see that I kept the data encoded (i.e. with the %XX characters)
  • what could be checked (which I guess could be for the entire domain name) is whether the entire string represents valid UTF-8; I don't think I'm currently doing so here. (I have such functions in the tld_domain_to_lowercase() now)

Definition at line 741 of file tld.c.

References tld_info::f_offset, tld_info::f_tld, h2d(), tld(), tld_clear_info(), TLD_RESULT_BAD_URI, TLD_RESULT_NULL, VALID_URI_ASCII_ONLY, and VALID_URI_NO_SPACES.

Referenced by check_uri(), and PHP_FUNCTION().

void tld_clear_info ( struct tld_info info)

This function initializes the info structure with defaults. The different TLD functions that make use of this structure will generally call this function first to represent a failure case.

Note that by default the category and status are set to undefined (TLD_CATEGORY_UNDEFINED and TLD_STATUS_UNDEFINED). Also the country and tld pointer are set to NULL and thus they cannot be used as strings.

Parameters
[out]infoThe tld_info structure to clear.

Definition at line 441 of file tld.c.

References tld_info::f_category, tld_info::f_country, tld_info::f_offset, tld_info::f_status, tld_info::f_tld, TLD_CATEGORY_UNDEFINED, and TLD_STATUS_UNDEFINED.

Referenced by tld(), and tld_check_uri().

const char* tld_version ( )

This functino returns the version of this library. The version is defined with three numbers: <major>.<minor>.<patch>.

You should be able to use the libversion to compare different libtld versions and know which one is the newest version.

Returns
A constant string with the version of the library.

Definition at line 1043 of file tld.c.

References LIBTLD_VERSION.

Referenced by cat_ext(), main(), and tld_encode().

This document is part of the Snap! Websites Project.

Copyright by Made to Order Software Corp.

Syndicate content

Snap! Websites
An Open Source CMS System in C++

Contact Us Directly