Building a web crawler in Go

I wanted to build something non-trivial in Go to get comfortable with its concurrency model, and a web crawler felt like the right scope: complex enough to be interesting, contained enough to actually finish. Let’s get into what I came up with.

Let’s start by defining web crawlers, MDN says it is “a program, often called a bot or robot, which systematically browses the Web to collect data from webpages. Typically search engines (e.g., Google, Bing, etc.) use crawlers to build indexes.” ¹ They’ve become a topic of debate recently because of the AI scraping wars or what some law experts call the “Great Scrape” .

Here’s the execution loop at a high level:

The crawl loop

Load seeds
Normalize and dedup seeds
Create worker pool
Pop from frontier first link that passes host politeness check
Fetch robots.txt
Check popped link against robots.txt
Parse and enqueue links from sitemaps found in robots.txt
Send job to workers
Worker fetches the URL following redirects
Validate resource type (HTML only)
Parse body and extract links
Push unseen links to frontier
Store page + metadata + outlinks
Repeat from step 4 until termination condition met (either no more links or crawl fetch limit hit)

Choosing what to crawl

We can’t possibly crawl the whole web, not even the public part of it so let’s start with our selection policy, we need some “metric of importance” to choose which pages we visit first, Google, for example uses the PageRank algorithm to rank pages which in simple terms counts the number (and quality) of links pointing to a website to determine how important it is, to keep things simple we’ll stick to breadth-first search (BFS), plainly, we will visit a link (seed or parent), find outlinks (children), and then repeat the process on the children, BFS means that all the children will be visited before the grandchildren.

BFS visit order

We will try (the try will make sense later) to crawl in order of appearance by using a queue as frontier:

pkg/crawler/frontier.go:5

type Frontier struct {
	queue []Link
	seen  map[string]struct{}
	mu    sync.Mutex
}

func NewFrontier() *Frontier {
	return &Frontier{
		queue: []Link{},
		seen:  map[string]struct{}{},
	}
}

func (f *Frontier) Push(link Link) bool {
	// add to end of queue
}

func (f *Frontier) Pop() *Link {
	// remove from top of queue
}

func (f *Frontier) Len() int {
	// queue's length
}

func (f *Frontier) Peek() *Link {
	// check top of queue
}

Seeds and URL normalization

The next step is the loading, normalization and deduplication of seeds. To deduplicate our seeds we need to be able to compare URLs, this naturally leads us to URL normalization ² so we come up with a Link struct that holds the original and normalized URL, something like:

pkg/crawler/link.go:10

type Link struct {
	Original   string
	Referrer   string
	Normalized string
	Host       string
}

We also store the referrer and host as these will be useful later for politeness checks and to be saved in DB for future reference. We also use the New pattern to normalize links on creation:

pkg/crawler/link.go:25

func NewLink(original string, opts ...Option) (Link, error) {
	l := &Link{
		Original: original,
	}

	//	...

	normalized, err := normalizeURL(original)
	if err != nil {
		return Link{}, err
	}
	l.Normalized = normalized

	//	...

	return *l, nil
}

As for the normalization itself we need to keep in mind that some pages offer many different ways to access content. A clear example of this is that many sites use combinations of query parameters to present content, these can vary in serialization, e.g. ?color=red&color=blue&color=green vs ?color[]=red&color[]=blue&color[]=green or even ?color=red,blue,green. Citing RFC 3986 ³:

“Because URIs exist to identify resources, presumably they should be considered equivalent when they identify the same resource. However, this definition of equivalence is not of much practical use, as there is no way for an implementation to compare two resources unless it has full knowledge or control of them. For this reason, determination of equivalence or difference of URIs is based on string comparison, perhaps augmented by reference to additional rules provided by URI scheme definitions. (…) Even though it is possible to determine that two URIs are equivalent, URI comparison is not sufficient to determine whether two URIs identify different resources.”

Normalization rules are divided in 3 groups, there are rules that: always preserve semantics, usually preserve semantics and change semantics. Even with all normalizations applied to URLs this isn’t a guarantee that we won’t fetch duplicate content, so a big part of this process is trusting that the webmaster of the resource you’re visiting did their due diligence, so we’ll stick to the “safe” normalization rules:

pkg/crawler/link.go:49

func normalizeURL(url string) (string, error) {
	flags := purell.FlagLowercaseScheme |
		purell.FlagLowercaseHost |
		purell.FlagRemoveDefaultPort |
		purell.FlagRemoveFragment |
		purell.FlagDecodeUnnecessaryEscapes |
		purell.FlagSortQuery |
		purell.FlagRemoveDuplicateSlashes |
		purell.FlagRemoveDotSegments

	return purell.NormalizeURLString(url, flags)
}

Fetching concurrently

Now that we have our frontier let’s move on to fetching the pages, we want to take advantage of Go’s concurrency patterns to visit multiple pages at once, so we’ll implement a coordinator-worker (fan-in fan-out) pool, one coordinator dispatching jobs and collecting results and n workers fetching pages.

Coordinator-worker interaction

coordinator

pkg/crawler/crawler.go:148

func (c *Crawler) coordinator(ctx context.Context, jobs chan<- Link, out <-chan outcome) error {
	var retryTimer <-chan time.Time
	var pending, pagesProcessed int
	for {
		limitReached := c.crawlLimit > 0 && pagesProcessed >= c.crawlLimit

		if (c.frontier.Len() == 0 || limitReached) && pending == 0 {
			if limitReached {
				// crawl limit reached
			} else {
				// crawl complete
			}
			close(jobs)
			return nil
		}

		var (
			jobCh chan<- Link
			next  Link
		)

		if c.frontier.Len() > 0 && !limitReached {
			link := c.frontier.PeekEligible(c.politeness)
			if link != nil && c.robotsCheck(ctx, link) {
				jobCh = jobs
				next = *link
			} else {
				if link != nil {
					c.frontier.Remove(*link)
				}
				retryTimer = time.After(c.crawlDelay)
			}
		}

		select {
		case <-ctx.Done():
			close(jobs)
			return ctx.Err()

		case jobCh <- next:
			c.frontier.Remove(next)
			c.hosts[next.Host] = time.Now()
			pending++

		case <-retryTimer:
			continue

		case o := <-out:
			pending--
			if o.processed {
				pagesProcessed++
			}
			for _, ol := range o.outlinks {
				if !c.frontier.Push(ol) {
					continue
				}
			}
		}
	}
}

worker

pkg/crawler/crawler.go:211

func (c *Crawler) worker(ctx context.Context, jobs <-chan Link, out chan<- outcome) {
	for link := range jobs {
		result := c.visit(ctx, link)
		if result.Err != nil {
			//...
			out <- outcome{}
			continue
		}

		if result.Body == nil {
			//...
			out <- outcome{}
			continue
		}

		outstrs := make([]string, len(result.Outlinks))
		for i, ol := range result.Outlinks {
			outstrs[i] = ol.Normalized
		}

		if err := c.store.SavePage(ctx, Page{
			URL:        link.Normalized,
			RawURL:     link.Original,
			Referrer:   link.Referrer,
			StatusCode: result.StatusCode,
			HTML:       string(result.Body),
			Outlinks:   outstrs,
			FetchedAt:  time.Now(),
			Title:      result.Title,
		}); err != nil {
			//...
		}
		out <- outcome{outlinks: result.Outlinks, processed: true}
	}
}

Being polite

But what is PeekEligible or robotsCheck you might ask, well we’ve arrived at our politeness policy, unless our goal was to get blocked for a DDoS attack by an angry webmaster, our crawler needs to respect the golden rule, we don’t want to overwhelm a server that isn’t ours with our (now concurrent) requests, therefore we need rules, websites and their owners have a way to express these rules via a robots.txt file like which pages are to be visited by bots and which aren’t (Allow:, Disallow:), how often (Crawl-delay:), and what behaviour is acceptable (Content-Signal:) , they may also offer sitemaps (Sitemap:) to guide you through their site. However, these policies aren’t enforced by servers but rely on compliance from visitors that is completely voluntary. We will stick to the allowance rules and find sitemaps to add more links to our crawl, if we can’t find a robots.txt file or we fail to fetch it we’ll assume we’re allowed to visit, but we’ll add a crawl delay of our choosing per-host, that’s where PeekEligible comes in, we iterate over our frontier until a link satisfies the eligible fn passed, allowing us to easily extend our politeness policy in the future, but for now limited to the following:

pkg/crawler/crawler.go:260

//...
func (c *Crawler) politeness(link Link) bool {
	if t, ok := c.hosts[link.Host]; ok {
		if time.Since(t) < c.crawlDelay {
			return false
		}
	}

	return true
}
//...

PeekEligible doesn’t remove the link from the queue, the coordinator does that explicitly with Remove, only once a job is actually handed off to a worker, or once we know a link is permanently disallowed by robots.txt. If we removed it eagerly inside PeekEligible and the coordinator’s select happened to pick a different case that loop iteration, we’d lose the link for good instead of just retrying it next time around.

pkg/crawler/frontier.go:57

func (f *Frontier) PeekEligible(isEligible func(Link) bool) *Link {
	f.mu.Lock()
	defer f.mu.Unlock()

	for _, link := range f.queue {
		if isEligible(link) {
			l := link
			return &l
		}
	}
	return nil
}

pkg/crawler/frontier.go:70

func (f *Frontier) Remove(link Link) {
	f.mu.Lock()
	defer f.mu.Unlock()

	for i, l := range f.queue {
		if l.Normalized == link.Normalized {
			f.queue = append(f.queue[:i], f.queue[i+1:]...)
			return
		}
	}
}

Our crawler keeps a map of hosts with a timestamp of the last visit in memory and checks that it satisfies our crawl delay before proceeding, this of course means that we could find ourselves in a situation where no hosts are ready, or where we have to compare up to O(n) of the frontier every time we’re to dispatch a job, but these are edge cases we can tolerate for now, that aren’t likely to happen (at least not at a rate that will significantly impact performance) and we can minimize by choosing good seeds.

If a link passes this politeness check we move on to the robots check which is the following:

pkg/crawler/crawler.go:276

func (c *Crawler) robotsCheck(ctx context.Context, link *Link) bool {
	robotsURL, err := robots.Locate(link.Normalized)
	if err != nil {
		// shouldn't fail because NewLink already fails if URL is invalid
		return true
	}

	var r *robots.Robots
	if cached, ok := c.robotsCache[robotsURL]; ok {
		r = cached
	} else {
		r = c.fetchRobots(ctx, robotsURL)
		c.robotsCache[robotsURL] = r
		if r != nil {
			c.sitemaps(ctx, r)
		}
	}

	if r != nil && !r.Test(c.userAgent, link.Normalized) {
		return false
	}

	return true
}

The robots standard has some rules, but generally there’s one robots file per domain at the root of the site’s hierarchy and each subdomain or scheme/port combination can have its own, therefore we fetch the robot once and cache it. So we locate the robots.txt, fetch the sitemaps if it has any, add the sitemap links to the frontier and finally, test the current link we’re trying to fetch against the robots to see if we’re allowed there. We’ll skip the sitemaps parsing logic for the sake of brevity.

Parsing pages and extracting links

As for the actual fetching and parsing of pages and outlinks it happens as follows:

pkg/crawler/crawler.go:390

func (c *Crawler) visit(ctx context.Context, link Link) VisitResult {
	start := time.Now()
	resp, err := c.fetch(ctx, link.Normalized)
	if err != nil {
		return VisitResult{Err: err}
	}
	defer resp.Body.Close()

	//...

	ct := resp.Header.Get("Content-Type")
	if ct != "" && !strings.Contains(strings.ToLower(ct), "text/html") {
		return VisitResult{StatusCode: resp.StatusCode}
	}

	body, err := io.ReadAll(resp.Body)
	if err != nil {
		return VisitResult{Err: err}
	}

	title, outlinks, err := extract(bytes.NewReader(body), link)
	if err != nil {
		return VisitResult{Err: err}
	}

	return VisitResult{
		StatusCode:   resp.StatusCode,
		Body:         body,
		Outlinks:     outlinks,
		Title:        title,
	}
}

We use our fetch function to make a request to the URL, validate the content type since we only want HTML content to extract outlinks from, and finally parse the content and extract said outlinks. Now let’s take a closer look at how we extract the links:

pkg/crawler/crawler.go:432

func extract(r io.Reader, referrer Link) (string, []Link, error) {
	base, _ := url.Parse(referrer.Normalized)
	z := html.NewTokenizer(r)
	var links []Link
	var title string
	var inTitle bool

	for {
		tt := z.Next()
		switch tt {
		case html.ErrorToken:
			if z.Err() == io.EOF {
				return title, links, nil
			}
			return title, links, z.Err()

		case html.TextToken:
			if inTitle {
				title = string(z.Text())
				inTitle = false
			}

		case html.EndTagToken:
			name, _ := z.TagName()
			if string(name) == "title" {
				inTitle = false
			}

		case html.StartTagToken, html.SelfClosingTagToken:
			name, hasAttr := z.TagName()
			tag := string(name)

			if tag == "title" {
				inTitle = true
			}

			if !hasAttr {
				continue
			}

			for {
				k, v, more := z.TagAttr()
				if tag == "base" && string(k) == "href" {
					if newBase, err := base.Parse(string(v)); err == nil {
						base = newBase
					}
				}
				if tag == "a" && string(k) == "href" {
					nl, err := NewLink(resolve(string(v), base), WithReferrer(referrer.Normalized))
					if err == nil {
						links = append(links, nl)
					}
				}
				if !more {
					break
				}
			}
		}
	}
}

Instead of loading the complete tree using html.Parse we use the html.NewTokenizer and process each token as we go, we keep all anchor tags and get some metadata as well with <base> and <title> tags, we could also easily gather more metadata by targeting the <meta> tags. Then for each outlink found we create a new Link, updated with the information from the <base> tag and with the originally crawled link as referrer.

Storing and searching with Postgres

After obtaining all this data from a page, we can move on to storing it somewhere, and we’ve chosen Postgres because of its built-in capabilities for full-text search. To achieve this we add a generated column of type tsvector, which Postgres documentation defines as “a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word.”, in short a way to normalize text into lexemes , which collapse morphological variants of the same word into one symbol (run, ran, running, etc.), this alone doesn’t optimize for search, the difference maker is the Generalized Inverted Index (GIN) we create on the column, they function like an alphabetical list in the back of a book, there are words with the locations in the book where they can be found, in the case of GIN it uses the tokenizable data in the column and keeps references to the rows where they appear. Let’s take a look at our database schema:


CREATE TABLE IF NOT EXISTS pages (
    id         SERIAL PRIMARY KEY,
    url        TEXT UNIQUE NOT NULL,
    raw_url    TEXT NOT NULL,
    title TEXT NOT NULL,
    referrer   TEXT,
    status_code INTEGER,
    html       TEXT,
    outlinks   JSONB,
    fetched_at TIMESTAMP WITH TIME ZONE,
    last_modified TIMESTAMP WITH TIME ZONE
);

ALTER TABLE PAGES ADD COLUMN textsearch tsvector GENERATED ALWAYS AS (setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(url, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(html, '')), 'C')) STORED;

CREATE INDEX textsearch_idx ON pages USING GIN (textsearch);

In order to convert the text to tsvector effectively we need to specify the language its written in, we’re assuming that all the content we fetch is in english, we could extend our crawler to parse the lang property of the html tag or any content-language meta tag present in the head of the document, but we’ll skip it for now. We also assign weights to the different columns to prioritize them during relevance ranking, in this case, results matching the title of the document have higher relevance than ones that match the url or the body of the page. These queries are taken from the migration files, we won’t get into details of running the migrations and setting up the DB connection, these are mostly trivial, but we’re using a combination of golang-migrate, , lib/pq , and the std library sql, .

Now that we’ve reviewed the database layer, let’s look at the application layer. First, we decouple the store implementation from the crawler by defining a Store interface to interact with, that way we can have several implementations, and an easier time testing our code, I actually wrote this crawler twice: v1 without TDD, v2 restarting from scratch with TDD. The interface is as follows:

pkg/crawler/crawler.go:58

type Store interface {
	SavePage(ctx context.Context, p Page) error
}

Simple enough, we pass a Context so we can control the execution of the operation, a Page object and expect an error back, if this was an API we would probably return the saved object to follow standards, for now we only expect an error to be returned. The Postgres implementation of this interface looks like this:

pkg/crawler/postgres_store.go:10

type PostgresStore struct {
	db *sql.DB
}

func NewPostgresStore(db *sql.DB) *PostgresStore {
	return &PostgresStore{db: db}
}

func (s *PostgresStore) SavePage(ctx context.Context, p Page) error {
	outlinks, err := json.Marshal(p.Outlinks)
	if err != nil {
		return err
	}
	_, err = s.db.ExecContext(ctx, `
		INSERT INTO pages (url, raw_url, title, referrer, status_code, html, outlinks, fetched_at)
		VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
		ON CONFLICT (url) DO UPDATE SET
			raw_url = EXCLUDED.raw_url,
			title = EXCLUDED.title,
			referrer = EXCLUDED.referrer,
			status_code = EXCLUDED.status_code,
			html = EXCLUDED.html,
			outlinks = EXCLUDED.outlinks,
			fetched_at = EXCLUDED.fetched_at`,
		p.URL, p.RawURL, p.Title, p.Referrer, p.StatusCode, p.HTML, outlinks, p.FetchedAt,
	)

	return err
}

Nothing too complicated here, we turn the outlinks into json, we pass the context to the query runner, and also make sure to add a strategy to resolve conflicts in case we run our crawler multiple times.

The search query

But how do we retrieve these results, you may ask, we implement another method on this store, not for the crawler to use but our search interface, it is as follows:

pkg/crawler/postgres_store.go:51

func (s *PostgresStore) Search(ctx context.Context, query string, limit int) (SearchResponse, error) {
	// Get total count first
	var totalCount int
	err := s.db.QueryRowContext(ctx, `
		SELECT COUNT(*)
		FROM pages, websearch_to_tsquery('english', $1) query
		WHERE textsearch @@ query`,
		query,
	).Scan(&totalCount)
	if err != nil {
		return SearchResponse{}, err
	}

	// Get limited results
	rows, err := s.db.QueryContext(ctx, `
		SELECT
			url,
			COALESCE(title, ''),
			ts_headline('english', COALESCE(html, ''), query, 'StartSel=<mark>, StopSel=</mark>, MaxWords=50, MinWords=25') AS snippet,
			ts_rank_cd(textsearch, query, 32) AS rank
		FROM pages, websearch_to_tsquery('english', $1) query
		WHERE textsearch @@ query
		ORDER BY rank DESC
		LIMIT $2`,
		query, limit,
	)
	if err != nil {
		return SearchResponse{}, err
	}
	defer rows.Close()

	var results []SearchResult
	for rows.Next() {
		var r SearchResult
		if err := rows.Scan(&r.URL, &r.Title, &r.Snippet, &r.Rank); err != nil {
			return SearchResponse{}, err
		}
		results = append(results, r)
	}

	if err := rows.Err(); err != nil {
		return SearchResponse{}, err
	}

	return SearchResponse{
		Results:    results,
		TotalCount: totalCount,
	}, nil
}

We use websearch_to_tsquery function to convert the user-provided search into a tsquery , which is a data type that stores lexemes separated by operators, this simply enables us to search with syntax commonly used with search engines, e.g. the following search ‘“vegetarian” cafes -brooklyn’ would return all the rows that include vegetarian (quoted terms are still stemmed, not exact-matched), cafes and its stemmed variants like cafe, and exclude the ones that mention Brooklyn, you can learn more about it here . Also interesting are the ts_headline, used to create a snippet of the content with the matches highlighted, and ts_rank_cd which we use to determine the rank of each row in terms of our query, we order the results using this rank.

Wrapping up

I chose not to implement certain functionality myself, like URI normalization, HTML document parsing, and robots.txt parsing, because it’s already available and well tested, either in the std library or community packages, and instead chose to focus on the concurrency model, frontier traversal logic and the full-text search. For the search interface I used HTMX and the standard library to build a minimal web UI to search through the results, if you are interested in the details you can refer to the source code, specifically here . This was a fun experience, and my plan after this project is to get into more low-level systems programming, so I’m eyeing either xv6, or maybe write a compiler with either C or Rust. Stay tuned for more.