TLD Library (libtld)

Introduction

The libtld is a library used to extract the TLD from a URI. This allows you to extract the exact domain name, sub-domains, and all the TLD (top level, second level, third level, etc.)

The problem with TLDs is that you cannot know where the domain starts. Some domains can use one top-level domain, others use two, etc. However, it may be useful to know where the domain is to have the exact list of sub-domains. For example, if you want to force www. at the start of the domain name if no other sub-domains are specified, then you need to know exactly how many TLD are defined in a URI.

The libtld offers one main function: tld(), which gives you a way to extract the TLD from any URI. The result is the offset where the TLD starts. This gives you enough information to extract everything else you need.

Download

You can download the TLD Library from SourceForge.net.

Features

The library offers:

  • A complete world wide list of all Top-Level Domain names.
  • One C function to find the Top-Level Domain name (TLD).
  • One C++ class to handle sub-domains, domains, and TLDs.
  • One PHP extension to access the power of the libtld directly in your PHP scripts.
  • A complete reference of all the C and C++ declarations.
  • A full coverage test suite to ensure the validity of the library.

Requirements

The development environment required CMake to generate the Makefiles.

The XML parser makes use of C++ and Qt4.

The library itself does not have any requirements (other than a C or C++ compiler, obviously.) It comes with one header (tld.h).

Documentation

The libtld has a small documentation since it includes only a very few functions. The documentation is available in the References section of this website.

Other TLD Projects

Public Suffix List

The Mozilla Foundation keeps a list of top-level domain names as a text file including comments. The project is called Public Suffix List.

I'm thinking to add a test that uses that link to check the libtld against URLs generated using this list. That way we can easily find discrepancies. From a quick look, it seems more complete, but at the same time, it looks like it may include valid URLs.

The list is very specifically used to handle cookies and prevent users from assigning a cookie at the wrong level (i.e. you may assign snapwebsites.org as the domain of a cookie, but not just .org; see Supercookie in wikipedia.)

idn library

The idn library includes support for checking TLDs in a string. Their interface is 100% in C and it includes the possibility to have additional definitions (overrides) of the TLD data.

On an Ubuntu system you can install the development library with:

apt-get install libidn11-dev

At this time I did not test that library, however, I changed the name of the libtld library header from just tld.h to libtld/tld.h to avoid the header conflict (because libidn also called their TLD header tld.h).

You can find manual pages by looking up the existing functions in the /usr/include/tld.h from the libidn11-dev package and use man to find the corresponding pages. For example:

man tld_get_z

It feels like they offer a check of the characters of a domain as per each TLD defined rules.

Changes

1.3.0 (see also)

  • Added the ChangeLog file.
  • Added a function to check the validity of a complete URI.
  • Added a C++ class to easily handle URIs in C++.
  • Added a PHP extension so [check_]tld() can be used in PHP.
  • Added a static version of the library.
  • Updated the TLDs as of Feb 2013.
  • Updated copyright notices.
  • Updated the tests as required.
  • Enhanced the tests with better errors and added tests.
  • Added a target to run all the TLD tests at once.
  • Fixed the TLD exceptions which now return a valid answer.
  • Fixed the Doxygen generation so we get the proper documentation.
  • Fixed/enhanced the documentation as I was at it.
  • Fixed the references to Qt through the CMakeLists.txt file.
  • Fixed data test so it doesn't crash if it can find its XML data file.

1.2.0

  • Added support for exceptions so we now properly support .uk for domains such as nic.uk, but forbid it for all the sites that are not exceptions.
  • Updated the tests accordingly.
  • Added a test for the XML file to make sure it respects the DTD.
  • Fixed the offsets of the data table, since these are unsigned short, -1 is not the best value to use to represent an invalid value. Instead we use USHRT_MAX now.
  • Completed the .us entries.

1.1.1

  • Added many TLDs as defined by the Public Suffix List.
  • Wrote a new test to check our data against the Public Suffix List.
  • Updated the existing tests to work with the new library.
  • Added a new category called Entrepreneurial (i.e. official domain names used to resale sub-domains; i.e. .uk.com.)

1.1.0

  • Added the version inside the tld.h header file.
  • Added a function to retrieve the version at run time (see tld_version().)
  • Added the debian packaging capability.
  • Fixed the documentation so it directly outputs in the correct folder.
  • Fixed the output folder name and file name for the documentation.
  • Added a printf() in each test to print out the library version.
  • Added a logo for the library. Nothing fancy at this point...
  • Added an SVG file presenting the functionality in one page.

1.0.0

  • First release.
Syndicate content

Snap! Websites
An Open Source CMS System in C++