How I went from trying to find the Alexa "top 1 million sites" list to writing an API for the Tranco list instead.
Trying to find the Alexa list
Some time ago I stumbled across a presentation about the usage of Rust at Cloudflare, and one of the slides mentioned that a "Cache Warmer" for 1.1.1.1 was among other things using the "Alexa 1m" list to preload the cache as a performance optimization.
While I was vaguely familiar with the existence of the Alexa top sites list I had never really had a reason for checking it out in more detail but this triggered my interest.
So, where does one get the file that contains the top-1-million sites? This turned out to be more tricky than I had anticipated. I could not find any mentions of the file on the official site, where the most likely FAQ entry (Does Alexa have a list of its top-ranked websites?) has no information regarding such a file.
Turning to Google returns a few results, like this GitHub gist where the URL http://s3.amazonaws.com/alexa-static/top-1m.csv.zip is referenced, which does exist, but how do I verify it is the real thing? There was also this blog post from 2016 called Alexa top 1 Million sites is retired. Here’s the Majestic Million. telling me the Alexa list had been retired, and how to use their list instead.
Armed with the filename "top-1m.csv.zip" I could find several pages referencing the same s3 URL but no mention on the main Alexa site. Weird.
So what happened? Well it appears that in 2016 the file was indeed removed and some time later restored again due to customer feedback, though stating it was only "temporarily available again". Trying to properly validate the file URL I then turned to the Wayback Machine and it turns out the FAQ article mentioned above seen at 2016-04-04 does reference the URL in question.
For now it appears the file is still available from Alexa, even if it is not linked from their own documentation. If this lack of a link is due to an oversight during the restoration, or a long term plan to push people to the paid API so the file can be dropped more easily in the future I do not know, but it appears that a tweet asking for the restoration of the docs went unanswered.
All this left me with a bad feeling. Using an undocumented file that had already been dropped in the past did not feel like a solid choice going forward.
Finding Tranco
At this point I was looking for another list that felt more future proof, maybe I could use the Majestic Million or maybe the Cisco Umbrella list? Which one would be the best one? It turns out I was not the first person asking such questions, and I soon landed on the following article: The Tale of Website Popularity Rankings: An Extensive Analysis
Basically some researchers have combined several top lists and from this constructed their own list named Tranco. This felt like a reasonable choice for me since even if the Alexa file (or any other public list) would be dropped in the future I would still be able to get the combined world view of any remaining lists.
So now that I had settled on a list to use the only thing missing was a way to query it without having to download and parse the file every time I wanted to check a ranking. This felt like a fun opportunity to write some Go!
The Tranco list API stack
What I wanted to do was to periodically check for the existence of a new list, once it appeared I would ingest it locally and then I wanted to be able to query the list contents using HTTP.
Local storage
When it came to local storage I was sure some sort of SQL backend was the appropriate solution, but which one? My first idea was to use SQLite since it is great for a self-contained service like this with very little operational overhead. After thinking about it some more I decided on PostgreSQL mostly because it would allow me to move the database to another machine in the future if I wanted. Also my current favourite blog post regarding database backends is this one.
I was initially pondering if I should keep multiple versions of the list in the database, so the table layout has been designed with support for this in mind, but at this point I only store the latest version to keep disk usage low.
Loading the list into the database
With the database selection out of the way the next thing to solve was how to
get the upstream list into it. The result is tldbwriter
which is a
process that periodically checks for the existence of a new Tranco list. When a
new list appears it fetches the ZIP file and loads it into PostgreSQL.
Basically it just sits in a loop, checking https://tranco-list.eu/top-1m-id and
comparing that ID to what is available in the database to decide if an update is
necessary. I have added some jitter to the sleep timer, so that two identically
configured tldbwriter
instances started at the same time should not hammer the
upstream site simultaneously.
You could argue that jitter is implicitly introduced by the start time of the process and adding randomness to the sleep schedule is only relevant when firing off a request at specific times of day, such as from a cron job. I still feel that spreading the load for the upstream service as much as possible is the decent thing to do for something that is running continuously.
When the updates do occur I perform these inside an SQL transaction to not cause concurrent reads from the HTTP service to receive partial views of the data when serving requests from users.
Since an update is basically bulk loading a completely new list and then
removing the old one I saw a significant performance increase by replacing
INSERT
statements with COPY FROM
via pq.CopyIn() as is described
here.
The HTTP API
With a database full of interesting data the final step was to make the
contents available via HTTP. The solution to this is tlapid
which connects to
the database and listens for client requests which it responds to with JSON
documents given that the requests are valid.
I do not do any pretty-print formatting of the JSON, because I feel it is better to save the extra bytes of data and avoid discussions regarding what amount of whitespace should be used for indentation to qualify as "pretty". I would much rather make people aware of tools such as jq for visually inspecting the information.
The API server has rate limit support based on golang.org/x/time/rate, so going past the configured limits will result in an HTTP 429 response. This is mainly to protect the database backend from a misbehaving client.
For dealing with TLS certificates I found the golang.org/x/crypto/acme/autocert module that allows me to easily utilize Let's Encrypt certificates with a minimal amount of work.
Since a client request results in multiple SQL queries the server also uses
transactions to not see any changes to the database committed during a given
request. Specifically the transaction is requested with the Repeatable Read
isolation level, which has additional guarantees in PostgreSQL as described
here.
Monitoring
To get some whitebox monitoring going for the API server I use the expvar module which is natively supported by Datadog.
Testing out the API
The code is currently running at https://tlapi.sigterm.se, here are some example queries:
# Fetching the current top 10 list:
$ curl -s https://tlapi.sigterm.se/sites | jq .
# Fetching a top list with specific starting point and number of results:
$ curl -s 'https://tlapi.sigterm.se/sites?start=10&count=2' | jq .
# Fetching information based on site name:
$ curl -s https://tlapi.sigterm.se/site/google.com | jq .
# Fetching information based on rank placement:
$ curl -s https://tlapi.sigterm.se/rank/1 | jq .
Code
All the code is available on GitHub at eest/tranco-list-api.