I wanted to build something non-trivial in Go to get comfortable with its concurrency model, and a web crawler felt like the right scope — complex enough to be interesting, contained enough to actually finish. Let’s get into what I came up with.
Let’s start by defining web crawlers, MDN says it is “a program, often called a bot or robot, which systematically browses the Web to collect data from webpages. Typically search engines (e.g., Google, Bing, etc.) use crawlers to build indexes.”
Let’s take a look at a simple algorithm of the execution loop of our crawler:
The crawl loop
- Load seeds
- Normalize and dedup seeds
- Create worker pool
- Pop from frontier first link that passes host politeness check
- Fetch robots.txt
- Check popped link against robots.txt
- Parse and enqueue links from sitemaps found in robots.txt
- Send job to workers
- Worker fetches the URL following redirects
- Validate resource type (HTML only)
- Parse body and extract links
- Push unseen links to frontier
- Store page + metadata + outlinks
- Repeat from step 4 until termination condition met (either no more links or crawl fetch limit hit)
Choosing what to crawl
We can’t possibly crawl the whole web, not even the public part of it so let’s start with our selection policy, we need some “metric of importance” to choose which pages we visit first, Google, for example uses the PageRank algorithm to rank pages which in simple terms counts the number (and quality) of links pointing to a website to determine how important it is, to keep things simple we’ll stick to breadth-first search (BFS), plainly, we will visit a link (seed or parent), find outlinks (children), and then repeat the process on the children, BFS means that all the children will be visited before the grandchildren.
We will try (the try will make sense later) to crawl in order of appearance by using a queue as frontier:
type Frontier struct {
queue []Link
seen map[string]struct{}
mu sync.Mutex
}
func NewFrontier() *Frontier {
return &Frontier{
queue: []Link{},
seen: map[string]struct{}{},
}
}
func (f *Frontier) Push(link Link) bool {
// add to end of queue
}
func (f *Frontier) Pop() *Link {
// remove from top of queue
}
func (f *Frontier) Len() int {
// queue's length
}
func (f *Frontier) Peek() *Link {
// check top of queue
}
Seeds and URL normalization
The next step is the loading, normalization and deduplication of seeds. To deduplicate our seeds we need to be able to compare URLs, this naturally leads us to URL normalization Link struct that holds the original and normalized URL, something like:
type Link struct {
Original string
Referrer string
Normalized string
Host string
}
We also store the referrer and host as these will be useful later for politeness checks and to be saved in DB for future reference. We also use the New pattern to normalize links on creation:
func NewLink(original string, opts ...Option) (Link, error) {
l := &Link{
Original: original,
}
// ...
normalized, err := normalizeURL(original)
if err != nil {
return Link{}, err
}
l.Normalized = normalized
// ...
return *l, nil
}
As for the normalization itself we need to keep in mind that some pages offer many different ways to access content. A clear example of this is that many sites use combinations of query parameters to present content, these can vary in serialization, e.g. ?color=red&color=blue&color=green vs ?color[]=red&color[]=blue&color[]=green or even ?color=red,blue,green. Citing RFC 3986
“Because URIs exist to identify resources, presumably they should be considered equivalent when they identify the same resource. However, this definition of equivalence is not of much practical use, as there is no way for an implementation to compare two resources unless it has full knowledge or control of them. For this reason, determination of equivalence or difference of URIs is based on string comparison, perhaps augmented by reference to additional rules provided by URI scheme definitions. (…) Even though it is possible to determine that two URIs are equivalent, URI comparison is not sufficient to determine whether two URIs identify different resources.”
Normalization rules are divided in 3 groups, there are rules that: always preserve semantics, usually preserve semantics and change semantics. Even with all normalizations applied to URLs this isn’t a guarantee that we won’t fetch duplicate content, so a lot of this process is trusting that the webmaster of the resource you’re visiting did their due diligence, so we’ll stick to the “safe” normalization rules:
func normalizeURL(url string) (string, error) {
flags := purell.FlagLowercaseScheme |
purell.FlagLowercaseHost |
purell.FlagRemoveDefaultPort |
purell.FlagRemoveFragment |
purell.FlagDecodeUnnecessaryEscapes |
purell.FlagSortQuery |
purell.FlagRemoveDuplicateSlashes |
purell.FlagRemoveDotSegments
return purell.NormalizeURLString(url, flags)
}
Fetching concurrently
Now that we have our frontier let’s move on to our fetching logic, we want to take advantage of go concurrency to visit multiple pages at once, so we’ll go for a coordinator-worker (fan-in fan-out) pool, one coordinator dispatching jobs and collecting results and n workers fetching pages.
coordinator
var pending, pagesProcessed int
for {
limitReached := c.crawlLimit > 0 && pagesProcessed >= c.crawlLimit
if (c.frontier.Len() == 0 || limitReached) && pending == 0 {
if limitReached {
//...
} else {
//...
}
close(jobs)
return nil
}
var (
jobCh chan<- Link
next Link
)
if c.frontier.Len() > 0 && !limitReached {
link := c.frontier.PopElegible(c.politeness)
if link != nil && c.robotsCheck(ctx, link) {
c.hosts[link.Host] = time.Now()
jobCh = jobs
next = *link
} else {
retryTimer = time.After(c.crawlDelay)
}
}
select {
// if ctx is cancelled, end the crawl
case <-ctx.Done():
close(jobs)
return ctx.Err()
// if job ch is nil we skip, otherwise we send next job to workers
case jobCh <- next:
pending++
// if retryTimer ch is nil we skip, otherwise we wait time and loop
case <-retryTimer:
continue
// else we wait for results and add them to frontier
case outlinks := <-out:
pending--
if outlinks != nil {
pagesProcessed++
}
for _, ol := range outlinks {
if !c.frontier.Push(ol) {
continue
}
}
}
}
worker
jobs := make(chan Link, c.workers)
out := make(chan []Link, c.workers)
for range c.workers {
go func() {
for link := range jobs {
result := c.visit(ctx, link)
if result.Err != nil {
out <- nil
continue
}
if result.Body == nil {
out <- nil
continue
}
outstrs := make([]string, len(result.Outlinks))
for i, ol := range result.Outlinks {
outstrs[i] = ol.Normalized
}
if err := c.store.SavePage(ctx, Page{
URL: link.Normalized,
RawURL: link.Original,
Referrer: link.Referrer,
StatusCode: result.StatusCode,
HTML: string(result.Body),
Outlinks: outstrs,
FetchedAt: time.Now(),
Title: result.Title,
}); err != nil {
//...
}
out <- result.Outlinks
}
}()
}
Being polite
But what is PopElegible or robotsCheck you might ask, well we’ve arrived at our politeness policy, unless our goal was to get blocked for a DDoS attack by an angry webmaster, our crawler needs to respect the golden rule, we don’t want to overwhelm a server that isn’t ours with our (now concurrent) requests, therefore we need rules, websites and their owners have a way to express these rules via a robots.txt file like which pages are to be visited by bots and which aren’t (Allow: , Disallow: ), how often (Crawl-delay: ), and what behaviour is acceptable (Content-Signal: ) , they may also offer sitemaps (Sitemap: ) to guide you through their site. However, these policies aren’t enforced by servers but rely on compliance from visitors that is completely voluntary.
We will stick to the allowance rules and find sitemaps to add more links to our crawl, if we can’t find a robots.txt file or we fail to fetch it we’ll assume we’re allowed to visit, but we’ll add a crawl delay of our choosing per-host, that’s were PopElegible comes in, we iterate over our frontier until a link satisfies the elegible fn passed, allowing us to easily extend our politeness policy in the future, but for now limited to the following:
//...
func (f *Frontier) PopElegible(isElegible func(Link) bool) *Link {
f.mu.Lock()
defer f.mu.Unlock()
if len(f.queue) < 1 {
return nil
}
for i, link := range f.queue {
if isElegible(link) {
f.queue = append(f.queue[:i], f.queue[i+1:]...)
return &link
}
}
return nil
}
//...
//...
func (c *Crawler) politeness(link Link) bool {
if t, ok := c.hosts[link.Host]; ok {
if time.Since(t) < c.crawlDelay {
return false
}
}
return true
}
//...
Our crawler keeps a map of hosts with a timestamp of the last visit in memory and checks that it satisfies our crawl delay before proceeding, this of course means that we could find ourselves in a situation where no hosts are ready, or where we have to compare up to O(n) of the frontier every time we’re to dispatch a job, but these are edge cases we can tolerate for now, that aren’t likely to happen (at least not at a rate that will significantly impact performance) and we can minimize by choosing good seeds.
If a link passes this politeness check we move on to the robots check which is the following:
func (c *Crawler) robotsCheck(ctx context.Context, link *Link) bool {
robotsURL, err := robots.Locate(link.Normalized)
if err != nil {
// shouldn't fail because NewLink already fails if URL is invalid
return true
}
var r *robots.Robots
if cached, ok := c.robotsCache[robotsURL]; ok {
r = cached
} else {
r = c.fetchRobots(ctx, robotsURL)
c.robotsCache[robotsURL] = r
if r != nil {
c.sitemaps(ctx, r)
}
}
if r != nil && !r.Test(c.userAgent, link.Normalized) {
return false
}
return true
}
The robots standard has some rules, but generally there’s one robots file per domain at the root of the site’s hierarchy and each subdomain or scheme/port combination can have its own, therefore we fetch the robot once and cache it. So we locate the robots.txt, fetch, find the sitemaps if it has any and add the links found to the frontier and then we test the current link we’re trying to fetch against the robots to see if we’re allowed there. We’ll skip the sitemaps part as it isn’t as relevant for the crawler.
Parsing pages and extracting links
As per the actual fetching and parsing of pages and outlinks it happens as follows:
func (c *Crawler) visit(ctx context.Context, link Link) VisitResult {
start := time.Now()
resp, err := c.fetch(ctx, link.Normalized)
if err != nil {
return VisitResult{Err: err}
}
defer resp.Body.Close()
//...
ct := resp.Header.Get("Content-Type")
if ct != "" && !strings.Contains(strings.ToLower(ct), "text/html") {
return VisitResult{StatusCode: resp.StatusCode}
}
body, err := io.ReadAll(resp.Body)
if err != nil {
return VisitResult{Err: err}
}
title, outlinks, err := extract(bytes.NewReader(body), link)
if err != nil {
return VisitResult{Err: err}
}
return VisitResult{
StatusCode: resp.StatusCode,
Body: body,
Outlinks: outlinks,
Title: title,
}
}
Should be simple enough, we use our fetch function to make a request to the URL, validate the content type since we only want HTML content to extract outlinks from, and finally parse the content and extract said outlinks. Now let’s take a closer look at how we extract the links:
func extract(r io.Reader, referrer Link) (string, []Link, error) {
base, _ := url.Parse(referrer.Normalized)
z := html.NewTokenizer(r)
var links []Link
var title string
var inTitle bool
for {
tt := z.Next()
switch tt {
case html.ErrorToken:
if z.Err() == io.EOF {
return title, links, nil
}
return title, links, z.Err()
case html.TextToken:
if inTitle {
title = string(z.Text())
inTitle = false
}
case html.EndTagToken:
name, _ := z.TagName()
if string(name) == "title" {
inTitle = false
}
case html.StartTagToken, html.SelfClosingTagToken:
name, hasAttr := z.TagName()
tag := string(name)
if tag == "title" {
inTitle = true
}
if !hasAttr {
continue
}
for {
k, v, more := z.TagAttr()
if tag == "base" && string(k) == "href" {
if newBase, err := base.Parse(string(v)); err == nil {
base = newBase
}
}
if tag == "a" && string(k) == "href" {
nl, err := NewLink(resolve(string(v), base), WithReferrer(referrer.Normalized))
if err == nil {
links = append(links, nl)
}
}
if !more {
break
}
}
}
}
}
Instead of loading the complete tree using html.Parse we use the html.NewTokenizer and process each token as we go, we keep all anchor tags and get some metadata as well with <base> and <title> tags, we could also easily gather more metadata by targeting the <meta> tags. Then for each outlink found we create a new Link, updated with the information from the <base> tag and with the originally crawled link as referrer.
Storing and searching with Postgres
After obtaining all this data from a page, we can move on to storing it somewhere, and we’ve chosen Postgres because of its built-in full-text search capabilities. To achieve this we add a generated column of type tsvector, which Postgres docs define as “a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word.”, in short a way to normalize text into
CREATE TABLE IF NOT EXISTS pages (
id SERIAL PRIMARY KEY,
url TEXT UNIQUE NOT NULL,
raw_url TEXT NOT NULL,
title TEXT NOT NULL,
referrer TEXT,
status_code INTEGER,
html TEXT,
outlinks JSONB,
fetched_at TIMESTAMP WITH TIME ZONE,
last_modified TIMESTAMP WITH TIME ZONE
);
ALTER TABLE PAGES ADD COLUMN textsearch tsvector GENERATED ALWAYS AS (setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(url, '')), 'B') ||
setweight(to_tsvector('english', coalesce(html, '')), 'C')) STORED;
CREATE INDEX textsearch_idx ON pages USING GIN (textsearch);
In order to convert the text to tsvector effectively we need to specify the language its written in, were assuming that all the content we fetch is in english, we could extend our crawler to parse the lang property of the html tag or any content-language meta tag present in the head of the document, but we’ll skip it for now. We also assign weights to the different columns to prioritize them during relevance ranking, in this case, results matching the title of the document have higher relevance that ones that match the url or the body of the page. These queries are taken from the migration files, we won’t get into details of running the migrations and setting up the DB connection, these are mostly trivial, but we’re using a combination of
Now that we’ve reviewed the database layer, let’s look at the application layer. First, we decouple the store implementation from the crawler by defining a Store interface to interact with, that way we can have several implementations, and an easier time testing our code, I actually wrote this crawler twice — v1 without TDD, v2 restarting from scratch with TDD. The interface is as follows:
type Store interface {
SavePage(ctx context.Context, p Page) error
}
Simple enough, we pass a Context so we can control the execution of the operation, a Page object and expect an error back, if this was an API we would probably return the saved object to follow standards, for now we only expect an error to be returned. The Postgres implementation of this interface looks like this:
pkg/crawler/postgres_store.go:10
type PostgresStore struct {
db *sql.DB
}
func NewPostgresStore(db *sql.DB) *PostgresStore {
return &PostgresStore{db: db}
}
func (s *PostgresStore) SavePage(ctx context.Context, p Page) error {
outlinks, err := json.Marshal(p.Outlinks)
if err != nil {
return err
}
_, err = s.db.ExecContext(ctx, `
INSERT INTO pages (url, raw_url, title, referrer, status_code, html, outlinks, fetched_at)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
ON CONFLICT (url) DO UPDATE SET
raw_url = EXCLUDED.raw_url,
title = EXCLUDED.title,
referrer = EXCLUDED.referrer,
status_code = EXCLUDED.status_code,
html = EXCLUDED.html,
outlinks = EXCLUDED.outlinks,
fetched_at = EXCLUDED.fetched_at`,
p.URL, p.RawURL, p.Title, p.Referrer, p.StatusCode, p.HTML, outlinks, p.FetchedAt,
)
return err
}
Nothing too complicated here, we turn the outlinks into json, we pass the context to the query runner, and also make sure to add a strategy to resolve conflicts in case we run our crawler for a second time.
The search query
But how do we retrieve these results, you may ask, we implement another method on this store, not for the the crawler to use but our minimal search interface, it is as follows:
pkg/crawler/postgres_store.go:51
func (s *PostgresStore) Search(ctx context.Context, query string, limit int) (SearchResponse, error) {
// Get total count first
var totalCount int
err := s.db.QueryRowContext(ctx, `
SELECT COUNT(*)
FROM pages, websearch_to_tsquery('english', $1) query
WHERE textsearch @@ query`,
query,
).Scan(&totalCount)
if err != nil {
return SearchResponse{}, err
}
// Get limited results
rows, err := s.db.QueryContext(ctx, `
SELECT
url,
COALESCE(title, ''),
ts_headline('english', COALESCE(html, ''), query, 'StartSel=<mark>, StopSel=</mark>, MaxWords=50, MinWords=25') AS snippet,
ts_rank_cd(textsearch, query, 32) AS rank
FROM pages, websearch_to_tsquery('english', $1) query
WHERE textsearch @@ query
ORDER BY rank DESC
LIMIT $2`,
query, limit,
)
if err != nil {
return SearchResponse{}, err
}
defer rows.Close()
var results []SearchResult
for rows.Next() {
var r SearchResult
if err := rows.Scan(&r.URL, &r.Title, &r.Snippet, &r.Rank); err != nil {
return SearchResponse{}, err
}
results = append(results, r)
}
if err := rows.Err(); err != nil {
return SearchResponse{}, err
}
return SearchResponse{
Results: results,
TotalCount: totalCount,
}, nil
}
We use websearch_to_tsquery function to convert the user-provided search into a tsquery ts_headline, used to create a snippet of the content with the matches highlighted, and ts_rank_cd which we use to determine the rank of each row in terms of our query, we order the results using this rank.
Wrapping up
In closing, I used HTMX and the standard library to build a minimal web UI to search through the results, if you are interested in the details you can refer to the