Building a web-scraper in Postgres
Published on Jan 16, 2025For a small talk I wanted to demonstrate that you can pack more business logic into Postgres than some people would guess. It might not be a good idea for most cases but it's certainly fun.
The most important part is the Postgres extension pgsql-http, which allows us to do http calls directly from within Postgres. This does not come included by default so we'll have to build a custom Docker image with that dependency included. A simple Dockerfile
like the following will do.
Install extension
FROM postgres:17
RUN apt-get update \
&& apt-get install -y curl \
&& apt-get -y install postgresql-17-http
Once you run the image you have to load the extension. You can check the pg_extension
table if the extension was installed successfully.
CREATE EXTENSION http SCHEMA scrapedb;
SELECT * FROM pg_extension;
To implement a simple scraper we need a few things:
Entrypoint, a URL where the scrape starts.
Tasks, a list of tasks that the scraper should work on. Each of the tasks should have a status like "available", "in_flight", "finished" or "failed".
Title, a place to store the results (movie titles) we want to scrape.
A list of patterns and handlers so we know how to work on each task URL that has the status "available".
Create tables + triggers
We can start by setting up the Tasks and Results tables like the following:
CREATE TYPE status_enum AS ENUM ('failed', 'available', 'in_flight', 'finished');
CREATE TABLE task (
id SERIAL PRIMARY KEY,
url TEXT NOT NULL,
status status_enum NOT NULL DEFAULT 'available',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE title (
id SERIAL PRIMARY KEY,
original_title TEXT,
original_release_year INT,
director TEXT,
object_type TEXT NOT NULL
);
To start a scrape we want to call a simple function that then inserts a new initial task into the task table.
CREATE OR REPLACE FUNCTION start_scrape(url_input TEXT)
RETURNS VOID AS $$
BEGIN
INSERT INTO task (url, status)
VALUES (url_input, 'available');
END;
$$ LANGUAGE plpgsql;
This alone wouldn't do anything, so we need a trigger that acts on every new insert into the task table. If a task with the status "available" is inserted, we want to process it. This is the main part of the scraper where we fetch the results from the web, check the status code, parse the response and spawn additional tasks if needed.
Before we do that though, we'll have to set up some basic functions that we can then re-use. The first one is process_task
which will accept a task URL, set the task to in_flight
, perform the http request and then return the contents. If that was successful we'll set the task_status to "finished", so it won't get processed again.
CREATE OR REPLACE FUNCTION process_task(task_url TEXT)
RETURNS TABLE(content JSONB, status INT) AS $$
DECLARE
rec RECORD;
BEGIN
UPDATE task
SET status = 'in_flight'
WHERE url = task_url;
FOR rec IN
SELECT *
FROM http_get(task_url)
LOOP
IF rec.status = 200 THEN
UPDATE task
SET status = 'finished'
WHERE url = task_url;
RETURN QUERY SELECT rec.content::jsonb, rec.status;
ELSE
UPDATE task
SET status = 'failed'
WHERE url = task_url;
RETURN QUERY SELECT '{}'::jsonb, rec.status;
END IF;
END LOOP;
END;
$$ LANGUAGE plpgsql;
Now it's time to glue it all together, we need the function that looks at the available tasks in the task table and decides how to handle each of them. In this case we have two handlers, one that acts on the "listing" endpoint and the other one that fetches details for each id in the listing.
If we were to do this a bit more scalable the part where we match the urls and call a specific handler would probably be split off into a separate table / function and we'd just do some pattern matching to find the right handler to call.
CREATE OR REPLACE FUNCTION process_new_task()
RETURNS TRIGGER AS $$
DECLARE
rec RECORD;
payload JSONB;
BEGIN
IF NEW.status = 'available' AND
NEW.url LIKE '%/apis.example.com/titles/top_x/object_type%' THEN
WITH resp AS (
SELECT content AS resp_body, status
FROM process_task(NEW.url)
), extract_ids AS (
SELECT array_agg((elem->>'id')::int) AS ids
FROM (
SELECT jsonb_array_elements(resp_body) AS elem
FROM resp
) subquery
), tasks AS (
SELECT 'https://apis.example.com/object_type/movie?id='|| unnest(ids) AS url
FROM extract_ids
)
INSERT INTO task (url, status)
SELECT url, 'available'::status_enum FROM tasks;
ELSIF NEW.status = 'available' AND
NEW.url LIKE '%/apis.example.com/object_type%' THEN
FOR rec IN
SELECT content::jsonb AS payload, status
FROM process_task(NEW.url)
LOOP
payload := rec.payload;
INSERT INTO title (
original_title, original_release_year, director, object_type
)
SELECT
payload->>'original_title',
(payload->>'original_release_year')::INT,
payload->>'director',
payload->>'object_type'
WHERE NOT EXISTS (
SELECT 1
FROM title
WHERE object_type = payload->>'object_type'
AND id = (payload->>'id')::INT
);
END LOOP;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
Now everything is set up so we can create the most important trigger. This one will fire every time there's a new entry in the task table, which will then call process_new_task
. That function will then either get details for a ID and store them in the title table, or match the "listing" route and insert more tasks into the task table...which will then again call the process_new_task table until there's no "available" task any more.
CREATE TRIGGER new_available_task_trigger
AFTER INSERT ON task
FOR EACH ROW
EXECUTE FUNCTION process_new_task();
Starting the scrape
The only step remaining is to test it, you can now just call our start_scrape()
and see the titles show up in the title table.
select start_scrape('https://apis.example.com/titles/top_x/object_type?genre=action');
This is obviously just a quick hack for a presentation, but there's a lot of logic we can pack into modern databases to keep things consistent and organized.