Building a web-scraper in Postgres

Published on Jan 16, 2025

For a small talk I wanted to demonstrate that you can pack more business logic into Postgres than some people would guess. It might not be a good idea for most cases but it's certainly fun.

The most important part is the Postgres extension pgsql-http, which allows us to do http calls directly from within Postgres. This does not come included by default so we'll have to build a custom Docker image with that dependency included. A simple Dockerfile like the following will do.

Install extension

FROM postgres:17
RUN apt-get update \
&& apt-get install -y curl \
&& apt-get -y install postgresql-17-http

Once you run the image you have to load the extension. You can check the pg_extension table if the extension was installed successfully.

CREATE EXTENSION http SCHEMA scrapedb;
SELECT * FROM pg_extension;

To implement a simple scraper we need a few things:

Entrypoint, a URL where the scrape starts.
Tasks, a list of tasks that the scraper should work on. Each of the tasks should have a status like "available", "in_flight", "finished" or "failed".
Title, a place to store the results (movie titles) we want to scrape.
A list of patterns and handlers so we know how to work on each task URL that has the status "available".

Create tables + triggers

We can start by setting up the Tasks and Results tables like the following:

CREATE TYPE status_enum AS ENUM ('failed', 'available', 'in_flight', 'finished');

CREATE TABLE task (
    id SERIAL PRIMARY KEY,
    url TEXT NOT NULL,
    status status_enum NOT NULL DEFAULT 'available',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE title (
    id SERIAL PRIMARY KEY,
    original_title TEXT,
    original_release_year INT,
    director TEXT,
    object_type TEXT NOT NULL
);

To start a scrape we want to call a simple function that then inserts a new initial task into the task table.

CREATE OR REPLACE FUNCTION start_scrape(url_input TEXT) 
RETURNS VOID AS $$
BEGIN
    INSERT INTO task (url, status) 
    VALUES (url_input, 'available');
END;
$$ LANGUAGE plpgsql;

This alone wouldn't do anything, so we need a trigger that acts on every new insert into the task table. If a task with the status "available" is inserted, we want to process it. This is the main part of the scraper where we fetch the results from the web, check the status code, parse the response and spawn additional tasks if needed.

Before we do that though, we'll have to set up some basic functions that we can then re-use. The first one is process_task which will accept a task URL, set the task to in_flight, perform the http request and then return the contents. If that was successful we'll set the task_status to "finished", so it won't get processed again.

CREATE OR REPLACE FUNCTION process_task(task_url TEXT)
RETURNS TABLE(content JSONB, status INT) AS $$
DECLARE
    rec RECORD;
BEGIN
    UPDATE task
    SET status = 'in_flight'
    WHERE url = task_url;

    FOR rec IN 
        SELECT *
        FROM http_get(task_url)
    LOOP
        IF rec.status = 200 THEN
            UPDATE task
            SET status = 'finished'
            WHERE url = task_url;

            RETURN QUERY SELECT rec.content::jsonb, rec.status;
        ELSE
            UPDATE task
            SET status = 'failed'
            WHERE url = task_url;

            RETURN QUERY SELECT '{}'::jsonb, rec.status;
        END IF;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

Now it's time to glue it all together, we need the function that looks at the available tasks in the task table and decides how to handle each of them. In this case we have two handlers, one that acts on the "listing" endpoint and the other one that fetches details for each id in the listing.

If we were to do this a bit more scalable the part where we match the urls and call a specific handler would probably be split off into a separate table / function and we'd just do some pattern matching to find the right handler to call.

CREATE OR REPLACE FUNCTION process_new_task()
RETURNS TRIGGER AS $$
DECLARE
    rec RECORD;
    payload JSONB;
BEGIN
    IF NEW.status = 'available' AND 
       NEW.url LIKE '%/apis.example.com/titles/top_x/object_type%' THEN
       
        WITH resp AS (
            SELECT content AS resp_body, status
            FROM process_task(NEW.url)
        ), extract_ids AS (
            SELECT array_agg((elem->>'id')::int) AS ids
            FROM (
                SELECT jsonb_array_elements(resp_body) AS elem
                FROM resp
            ) subquery
        ), tasks AS (
            SELECT 'https://apis.example.com/object_type/movie?id='|| unnest(ids) AS url
            FROM extract_ids
        )
        INSERT INTO task (url, status)
        SELECT url, 'available'::status_enum FROM tasks;

    ELSIF NEW.status = 'available' AND 
          NEW.url LIKE '%/apis.example.com/object_type%' THEN
       
        FOR rec IN 
            SELECT content::jsonb AS payload, status
            FROM process_task(NEW.url)
        LOOP
            payload := rec.payload;
            
            INSERT INTO title (
                original_title, original_release_year, director, object_type
            )
            SELECT 
                payload->>'original_title',
                (payload->>'original_release_year')::INT,
                payload->>'director',
                payload->>'object_type'
            WHERE NOT EXISTS (
                SELECT 1 
                FROM title
                WHERE object_type = payload->>'object_type'
                  AND id = (payload->>'id')::INT
            );
        END LOOP;
    END IF;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Now everything is set up so we can create the most important trigger. This one will fire every time there's a new entry in the task table, which will then call process_new_task. That function will then either get details for a ID and store them in the title table, or match the "listing" route and insert more tasks into the task table...which will then again call the process_new_task table until there's no "available" task any more.

CREATE TRIGGER new_available_task_trigger
AFTER INSERT ON task
FOR EACH ROW
EXECUTE FUNCTION process_new_task();

Starting the scrape

The only step remaining is to test it, you can now just call our start_scrape() and see the titles show up in the title table.

select start_scrape('https://apis.example.com/titles/top_x/object_type?genre=action');

This is obviously just a quick hack for a presentation, but there's a lot of logic we can pack into modern databases to keep things consistent and organized.