TLD Library (libtld)

Sun
11/20/11

Introduction

The libtld is a C/C++ library used to extract the TLD from a URI. This allows you to extract the exact domain name, sub-domains, and all the TLDs (top level, second level, third level, etc.)

The problem with TLDs is that you cannot know where the domain starts. Some domains can use one top-level domain, others use two, three, etc. (up to five at this time). However, it may be useful to know where the domain is to have the exact list of sub-domains. For example, if you want to force www. at the start of the domain name if no other sub-domains are specified, then you need to know exactly how many TLDs are defined in the URI.

The libtld offers one main function: tld(), which gives you a way to extract the TLD from any URI. The result is the offset where the TLD starts. This gives you enough information to extract everything else you need.

Download

You can download the TLD Library source code from GitHub (source and tarballs).

We also offer ready to install packages for Ubuntu on LaunchPad: Snap CPP (packages).

Support

You got a problem with the library? An idea to improve it? Please post a ticket in the Issue Tracker on Github.

You like the libtld library?
Please spread the word by sharing a post about libtld.

Features

The library offers:

A complete world wide list of all Top-Level Domain names.
A TLD compiler so you can update the list of TLDs at any time as required whenever updates occurs ("in real-time")
One C function to find the Top-Level Domain name (TLD).
One C++ class to handle sub-domains, domains, and TLDs.
A set of C functions to check email address according to RFC 5322.
A C++ class to check lists of email addresses.
One PHP extension to access the power of libtld directly in your PHP scripts (3 functions at this time).
A complete reference of all the C and C++ declarations, functions, etc.
A few manual pages of the important functions.
A full coverage test suite to ensure the validity of the library.

PHP Functions

The PHP extension includes the following functions:

$result = check_tld($domain)

This function generates an associative array from the $domain name.

$result["result"]

The TLD_RESULT_... returned by the C function. This is a number. You can include the php_libtld.php file to get the necessary definition. On success you get TLD_RESULT_SUCCESS.

$result["category"]

The category of the domain name such as TLD_CATEGORY_COUNTRY. You can find the list of categories in php_libtld.php.

$result["status"]

The status of the domain name such as TLD_STATUS_VALID. Some TLDs are marked as TLD_STATUS_PROPOSED and you may want to accept those since they may have been accepted and are functional now.

$result["offset"]

The offset where the TLD starts in your $domain string.

Here is a small sample code that shows how to extract the TLD and domain name without the TLD in PHP:

$domain = "some.tld.here.com";
$result = check_tld($domain);
// IMPORTANT: here test that $result is valid... do not use the offset otherwise
$tld = substr($domain, $result["offset"]);
$domain_name = substr($domain, 0, $result["offset"] - 1);

If you'd like to know of sub-domains, then you can explode $domain_name and check the resulting array:

$segments = explode(".", $domain_name);
if(count($segments) == 2 && $segments[0] == "www")
{
  // do something if sub-domain is "www"
}

Note that in a few cases some domain names require a specific sub-domain. This is particularly true for the domain that allows for creating domain names with people's name. They require a persons first and last name to be included. That means the first name appears as a sub-domain although in that situation it is kind of part of the domain name.

$result["country"]

When available, the country that this domain name is tied to. For example the .co.uk domain name it tied to the United Kingdom.

$result["tld"]

On success this parameter includes the $tld (i.e. ".com", ".lol", etc.)

$result = check_uri($uri, $protocols, $flags);

This function is used to verify the validity of $uri. A URI is a protocol, a domain name, a path, parameters, and an anchor.

The $protocol parameter can be an array defining a list of accepted protocol. For example:

array("http", "https");

The available $flags are defined in php_libtld.php:

TLD_VALID_URI_ASCII_ONLY — only accept ASCII URIs

TLD_VALID_URI_NO_SPACES — return an error if the URI includes one or more spaces

The $result parameter is the same associative array that the check_tld() function above returns.

$result = check_email($emails, $flags);

The check_email() function is very useful to verify that an email address is valid, not only syntactically, but also references a valid domain name with a valid TLD.

Further, it will canonicalize the email address and the $emails parameter can actually include many emails so you can check many all at once. See the reference for further information about the $result parameter of the check_email() function.

Requirements

The development environment required CMake to generate the Makefiles.

Version 2.x uses the tldc compiler to transform the TLD .ini files in a binary that is fast to search (as before). Especially, the strings are now compressed in one superstring so any repeat doesn't waste space. As in version 1.x, the strings do not get copied by the library.

Especially, this compiler allows you to update the list of TLDs without having to recompile the library.

Version 1.x used an XML parser which required you to have Qt 5.x to compile the library from scratch, although I would offer the tld_data.c file which meant you could compile, just not update the list of TLDs.

The library itself does not have any requirements (other than a C or C++ compiler, obviously). It comes with three headers, tld.h being the important one (the other two are for the compiler and the .tld file management if you want to do such). The tld.h header includes the version which is dynamically added by cmake when generating the Makefiles. If you don't have cmake available, then you can always edit this file and replace the version numbers manually and rename the file (mv tld.h.in tld.h).

The PHP extension requires the php5-dev environment (also called Zend) to compile.

Documentation

The libtld has a small documentation since it includes only a very few functions. The documentation is available in the References section of this website.

Since version 2.x I also added manual pages. You can try the following:

man tld
man tldc
man tld.ini

The first one lists the main TLD functions (found in the libtld-doc package). The tldc and tld.ini document how to use the compiler to add your own .ini files and update the existing files (found in the libtld-compiler package).

Compiling 2.x

Like 1.4.5 to 1.6, the library requires the snapcmakemodules package. However, it does not require Qt at all. That dependency was removed 100%.

Also I updated the one line of code that used boost. That way we also do not need boost at all (even though it was only in the tests, it was still an annoyance).

I also updated the test suite to work with the default cmake/scripts/coverage (a.k.a. ./mk -c) script and the ./mk -t option. That way there is no specialization which tends to break over time in a large project like Snap C++. Especially, the new coverage script makes sure that the code is compiled and run with the Sanitizer on top of Debug.

Compiling 1.4.5 to 1.6

Newer versions have been modified to make use of an external set of CMake modules. This version requires you to down the snapcmakemodules_x.y.z.tar.gz. This then needs to be specified on the cmake command line to make it all work:

tar xf snapcmakemodules_x.y.z.tar.gz
tar xf libtld-x.y.z.tar.gz
mkdir BUILD
cd BUILD
cmake -DCMAKE_MODULE_PATH=../snapCMakeModules/Modules ../libtld
make
sudo make install

The files that you extract may include a directory with a version. Adjust the paths accordingly:

cmake -DCMAKE_MODULE_PATH=../snapCMakeModules-1.0.8/Modules ../libtld-1.4.5

At this point, though, the Debian packager does not put a version so we leave it out.

IMPORTANT NOTE: As found in the dev/coverage script, you may need to use these two additional flags to define the location of the Snap! cmake modules:

-DSnapCMakeModules_DIR:PATH="$modules"
-DSnapDoxygen_DIR:PATH="$modules"

I'm not too sure why the CMAKE_MODULE_PATH variable is not always enough for that purpose.

Compiling 1.4.2 and older

To compile the library, you need cmake. If this is your first time with cmake, do not be afraid, it is very easy to use. There are the few steps to get the library compiled:

cd to/directory/with/libtld-1.4.0.tar.gz
tar xf libtld-1.4.0.tar.gz
mkdir build
cd build
cmake ../libtld-1.4.0
make
sudo make install

The last line is not required if you do not want to install the library on your system. To change the installation path, you can use cmake this way:

cmake -DCMAKE_INSTALL_PREFIX=/absolute/path/to/somewhere/else ../libtld-1.4.0

and now the files will go under /absolute/path/tp/somwhere/else.

If you have warnings appearing, edit the main CMakeLists.txt and change the lines setting up the C and CXX flags. I already offer the ones with just -O3. Note that I will look into a way for future versions to offer a better solution (i.e. a cmake flag that can be used to switch the warnings ON instead of having them ON by default.)

Other TLD Projects

Public Suffix List

The Mozilla Foundation keeps a list of top-level domain names as a text file including comments. The project is called Public Suffix List.

I'm thinking to add a test that uses that link to check the libtld against URLs generated using this list. That way we can easily find discrepancies. From a quick look, it seems more complete, but at the same time, it looks like it may include valid URLs.

The list is very specifically used to handle cookies and prevent users from assigning a cookie at the wrong level (i.e. you may assign snapwebsites.org as the domain of a cookie, but not just .org; see Supercookie in wikipedia.)

idn library

The idn library includes support for checking TLDs in a string. Their interface is 100% in C and it includes the possibility to have additional definitions (overrides) of the TLD data.

On an Ubuntu system you can install the development library with:

sudo apt-get install libidn11-dev

At this time I did not test that library, however, I changed the name of the libtld library header from just tld.h to libtld/tld.h to avoid the header conflict (because libidn also called their TLD header tld.h).

You can find manual pages by looking up the existing functions in the /usr/include/tld.h from the libidn11-dev package and use man to find the corresponding pages. For example:

man tld_get_z

It feels like they offer a check of the characters of a domain as per each TLD defined rules.

Coverage Information

For other versions of libtld with coverage that I generated.

Changes

1.6.0

Fixed the search of TLDs with an "*" at the start.
Updated to the latest known TLD from the Public Suffix List.
Renamed master branch as main.

1.5.14

Fixed an issue with QString() & QChar().

1.5.13

Updated the doxy file.

1.5.12

Removed the dependency on libaddr, it's not required.

1.5.11

Updated to the latest known extension from the Public Suffix List.
Updated the tests to include the hex.c to the cmake.
Updated the internal test to work with the new data.
Enhanced my tld_names test so the line number can be written on errors.

1.5.10

Made code -Weffc++ compatible.

1.5.9 (see also)

Fixed the debian/copyright about the public_suffic_list.dat file (the name had changed.)
Fixed the dev/coverage so it works with the new environment.
Updated our tld_data.xml file with all the new domain names.
Fixed our tests so they all pass with the latest version of the library and we get 100% coverage.

1.5.8

Fixed spaces issues in email address names.

1.5.7

Removed space/lf/cr/tab case and now return invalid.

1.5.6

Various clean ups

1.5.5

Renamed effective_tld_names.dat with its new name: public_suffix_list.dat

1.5.4

Updated the TLDs with the newest available information.
Enhanced the test so we can support entries like "*.jp" at any level.
Now the top CMakeLists.txt reads the version from changelog.
Updated the tests accordingly.

1.5.3

Bumped copyright dates.

1.5.2

Bumped version to get my test system to upgrade properly. No changes otherwise.

1.5.1

Added the new .cpp file to dev/libtld-only-CMakeLists.txt

1.5.0

Added the tld_domain_to_lowercase() function so one can test domain names that are upper or lower case
Added a test to make sure that the version is the same in all three files that that have a copy (CMakeLists.txt, debian/changelog, and dev/libtld-only-CMakeLists.txt
Enhanced the documentation.

1.4.22

Added many more new gTLDs and updated quite a few country gTLDs that changed (Poland and Japan, and a few others got their TLD in their language.)
Parser now accepts two more characters that are not normally expected in domain names (as per RFC which I guess is not really respected.)
Updated the documentation with the newest version, although nothing changed in the code, but I wasn't sure which version I had uploaded.

1.4.21

Applied a fix to email handling.

1.4.20

Added even more new gTLDs and updated country gTLDs that changed.
Updated the tests accordingly.
Note: we changed our internal build system to avoid incrementing the patch version of the library, so again we will be moving one version at a time (i.e. next version will be 1.4.21)

1.4.17

Added new gTLDs as defined by the ICANN.
Updated various documents accordingly.
Note: intermediate versions are generated by our build system and do not include additional changes.

1.4.6

Added new gTLDs as defined by the ICANN.
Added a "brand" category.
Added a test to verify all the extensions offered by GoDaddy.
Added a script to build the source using the Debian tool dpkg-source.
Added some documentation in different files (INSTALL.txt, README...)
Updated the coverage test to work in the new environment.
Bumped the version in the CMakeLists.txt to the correct version.
Bumped version for publication to Sourceforge.net.

1.4.5

Got this version in the PPA.

1.4.4

Free the allocated memory as otherwise expected, avoiding memory leaks.

1.4.3

Got the debian directory to work with launchpad

1.4.2

Fixed the licenses in source files to MIT as it was supposed to be.
Moved the ChangeLog file to debian/changelog.
Added a debian/copyright file.
Added the validate_tld command line tool to check TLDs in shell scripts.
Changed the cmake scripts that get included by the main CMakeLists.txt.

1.4.1

Added new TLDs added in the last few months.

1.4.0 (see also)

Added the debian folder to build packages for the PPA (not working yet).
Corrected the installation process with proper components.
Fixed the ChangeLog dates and indentation for Debian.
Reorganized the package with sub-folders.
Fixed a few things so the library compiles under MS-Windows with cl.
Added support to parse email addresses.
Enhanced several of the unit tests.
Added a script to run a full coverage over the library.
Added a command line tool for scripting.
Enhanced documentation as first version was really just about tld().
dev/libtld-only-CMakeLists.txt to compile libtld on other platforms.
Known Bugs:
- The code does not properly escape non-atom characters in the resulting email
- The canonicalized email also does not escape non-atom characters
- Note that these bugs prevent you from using the results as is, but they do not prevent you from validating email addresses.

1.3.0 (see also)

Added the ChangeLog file.
Added a function to check the validity of a complete URI.
Added a C++ class to easily handle URIs in C++.
Added a PHP extension so [check_]tld() can be used in PHP.
Added a static version of the library.
Updated the TLDs as of Feb 2013.
Updated copyright notices.
Updated the tests as required.
Enhanced the tests with better errors and added tests.
Added a target to run all the TLD tests at once.
Fixed the TLD exceptions which now return a valid answer.
Fixed the Doxygen generation so we get the proper documentation.
Fixed/enhanced the documentation as I was at it.
Fixed the references to Qt through the CMakeLists.txt file.
Fixed data test so it doesn't crash if it can find its XML data file.

1.2.0

Added support for exceptions so we now properly support .uk for domains such as nic.uk, but forbid it for all the sites that are not exceptions.
Updated the tests accordingly.
Added a test for the XML file to make sure it respects the DTD.
Fixed the offsets of the data table, since these are unsigned short, -1 is not the best value to use to represent an invalid value. Instead we use USHRT_MAX now.
Completed the .us entries.

1.1.1

Added many TLDs as defined by the Public Suffix List.
Wrote a new test to check our data against the Public Suffix List.
Updated the existing tests to work with the new library.
Added a new category called Entrepreneurial (i.e. official domain names used to resale sub-domains; i.e. .uk.com.)

1.1.0

Added the version inside the tld.h header file.
Added a function to retrieve the version at run time (see tld_version().)
Added the debian packaging capability.
Fixed the documentation so it directly outputs in the correct folder.
Fixed the output folder name and file name for the documentation.
Added a printf() in each test to print out the library version.
Added a logo for the library. Nothing fancy at this point...
Added an SVG file presenting the functionality in one page.

1.0.0

First release.

Current Project

libtld padfile

Snap! Websites
An Open Source CMS System in C++

Contact Us Directly

Recent blog posts

more

Snap! A C++ Open Source CMS