This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Corpus Extraction Pipeline
Relevant source files
Purpose and Scope
This document describes the corpus extraction pipeline—the Docker-based system that clones SQLite's official Fossil repository and extracts the sqllogictest corpus into a local filesystem directory. The pipeline consists of a Dockerfile that builds a specialized container image and a bash script (slt-extract) that performs the actual extraction.
For information about the automated workflow that orchestrates this pipeline on a schedule, see Automated Update Workflow. For details on how the extracted tests are organized, see Test Organization Structure.
Pipeline Architecture
The extraction pipeline follows a three-stage architecture: image build, repository cloning, and corpus extraction. The entire process is encapsulated within a Docker container to ensure reproducibility and isolation from the host environment.
Sources: Dockerfile:1-36
flowchart LR
subgraph "Build Stage"
DF["Dockerfile"]
BASE["debian:stable-slim\nbase image"]
DEPS["Dependencies:\nfossil, bash, tcl,\nbuild-essential"]
end
subgraph "Clone Stage"
FOSSIL_CMD["fossil clone\nwww.sqlite.org/sqllogictest"]
FOSSIL_OPEN["fossil open\nsqllogictest.fossil"]
SRC_DIR["/src/test/\ncloned corpus"]
end
subgraph "Extract Stage"
SCRIPT["/usr/local/bin/\nslt-extract"]
CP_CMD["cp -R /src/test\nto /work/test"]
DEST_DIR["/work/test/\nextracted corpus"]
end
DF --> BASE
DF --> DEPS
DEPS --> FOSSIL_CMD
FOSSIL_CMD --> FOSSIL_OPEN
FOSSIL_OPEN --> SRC_DIR
DF --> SCRIPT
SCRIPT --> CP_CMD
SRC_DIR --> CP_CMD
CP_CMD --> DEST_DIR
Docker Image Construction
The extraction pipeline uses a Debian-based Docker image defined in the Dockerfile. The image is built in multiple logical stages, though implemented as a single-stage Dockerfile for simplicity.
Base Image and Dependencies
The image starts from debian:stable-slim and installs the required packages:
| Package | Purpose |
|---|---|
fossil | Fossil SCM client for cloning the repository |
bash | Shell for executing the extraction script |
tcl | Required by some fossil operations |
build-essential | Compilation tools (legacy requirement) |
ca-certificates | HTTPS certificate validation |
curl | HTTP client utilities |
The package installation occurs at Dockerfile:5-12 using apt-get with --no-install-recommends to minimize image size. The /var/lib/apt/lists/* cache is removed after installation to further reduce the final image size.
Fossil Repository Cloning
The repository cloning process occurs during the Docker build phase, not at runtime. This design decision ensures that the cloned repository is baked into the image, eliminating the need for network access during extraction.
sequenceDiagram
participant Build as Docker Build
participant Fossil as fossil CLI
participant Remote as www.sqlite.org
participant FS as /src filesystem
Build->>FS: WORKDIR /src
Build->>Fossil: fossil clone https://www.sqlite.org/sqllogictest/
Fossil->>Remote: HTTP GET sqllogictest repository
Remote-->>Fossil: repository data
Fossil->>FS: write /src/sqllogictest.fossil
Build->>Fossil: fossil open sqllogictest.fossil
Fossil->>FS: extract to /src/test/
Build->>Fossil: fossil user default root
Fossil->>FS: set default user
The cloning sequence at Dockerfile:14-17 performs three operations:
fossil clone: Downloads the repository fromhttps://www.sqlite.org/sqllogictest/to/src/sqllogictest.fossilfossil open: Extracts the repository contents to the current working directory (/src)fossil user default root: Sets the default user torootfor subsequent operations
The --user root flag in both the clone and open commands ensures consistent user attribution within the Fossil repository.
Sources: Dockerfile:14-17
Extraction Script Implementation
The extraction logic is implemented in a bash script embedded directly in the Dockerfile as a heredoc. The script is written to /usr/local/bin/slt-extract at Dockerfile:20-31 and marked executable at Dockerfile33
Script Components
Script Variables:
| Variable | Default Value | Description |
|---|---|---|
src_root | /src/test | Source directory containing cloned corpus |
dest_root | ${1:-/work/test} | Destination directory (first argument or default) |
The script uses bash strict mode (set -euo pipefail) at Dockerfile22 to ensure:
-e: Exit immediately if any command fails-u: Treat unset variables as errors-o pipefail: Return the exit status of the last failed command in a pipe
Extraction Operation
The core extraction operation at Dockerfile28 uses cp -R with the trailing /. syntax:
cp -R "$src_root/." "$dest_root/"
This syntax copies the contents of /src/test/ (including hidden files) rather than the directory itself, resulting in the corpus files appearing directly under dest_root rather than in a nested test/ subdirectory.
Sources: Dockerfile:20-35
Container Execution Model
The container is configured with slt-extract as the ENTRYPOINT at Dockerfile35 This design allows the container to function as a single-purpose executable tool.
flowchart LR
subgraph "Host Filesystem"
HOST_TEST["$PWD/test/"]
end
subgraph "Container Filesystem"
CONTAINER_WORK["/work/test/"]
SRC_TEST["/src/test/\n(baked into image)"]
end
HOST_TEST -.->|-v mount| CONTAINER_WORK
SRC_TEST -->|cp -R| CONTAINER_WORK
CONTAINER_WORK -->|persists to| HOST_TEST
Volume Mounting Strategy
The extraction process relies on Docker volume mounting to persist the extracted corpus to the host filesystem:
When the container runs with -v "$PWD/test:/work/test", the host's test/ directory is mounted at /work/test inside the container. The slt-extract script copies from the image's /src/test to the mounted /work/test, making the files appear in the host's test/ directory.
Sources: README.md18 Dockerfile35
Build and Execution Workflow
The complete extraction workflow consists of building the image and running the container:
Image Build Process
The docker build command at README.md10 creates an image tagged as slt-gen. This tag is referenced in the GitHub Actions workflow (see Automated Update Workflow) and in local usage.
Build stages:
- Pull
debian:stable-slimbase image - Install system packages via
apt-get - Clone Fossil repository from
www.sqlite.org - Extract repository to
/src/test - Write
slt-extractscript to/usr/local/bin - Set script as executable
- Configure container entrypoint
Extraction Execution
The extraction sequence at README.md:16-18 performs:
- Remove existing directory :
rm -rf testensures a clean slate - Create empty directory :
mkdir testcreates the mount point - Run container :
docker runexecutes the extraction with:--rm: Remove container after execution-v "$PWD/test:/work/test": Mount host directory into containerslt-gen: Image name
The container executes slt-extract (the entrypoint), which copies corpus files from /src/test to /work/test, persisting them to the host's test/ directory via the volume mount.
Sources: README.md:6-19 Dockerfile35
graph TB
subgraph "Upstream Fossil Repository"
FOSSIL_ROOT["www.sqlite.org/sqllogictest/"]
FOSSIL_TEST["test/ directory"]
FOSSIL_EVIDENCE["test/evidence/"]
FOSSIL_INDEX["test/index/"]
end
subgraph "Docker Image /src"
IMG_SRC["/src/sqllogictest.fossil"]
IMG_TEST["/src/test/"]
IMG_EV["/src/test/evidence/"]
IMG_IDX["/src/test/index/"]
end
subgraph "Container /work"
WORK_TEST["/work/test/"]
WORK_EV["/work/test/evidence/"]
WORK_IDX["/work/test/index/"]
end
subgraph "Host Filesystem"
HOST_TEST["$PWD/test/"]
HOST_EV["$PWD/test/evidence/"]
HOST_IDX["$PWD/test/index/"]
end
FOSSIL_ROOT --> FOSSIL_TEST
FOSSIL_TEST --> FOSSIL_EVIDENCE
FOSSIL_TEST --> FOSSIL_INDEX
FOSSIL_ROOT -->|fossil clone| IMG_SRC
IMG_SRC -->|fossil open| IMG_TEST
IMG_TEST --> IMG_EV
IMG_TEST --> IMG_IDX
IMG_TEST -->|cp -R| WORK_TEST
IMG_EV -->|cp -R| WORK_EV
IMG_IDX -->|cp -R| WORK_IDX
WORK_TEST -->|volume mount| HOST_TEST
WORK_EV -->|volume mount| HOST_EV
WORK_IDX -->|volume mount| HOST_IDX
Directory Structure Mapping
The extraction pipeline maintains a direct mirror of the upstream repository structure:
The pipeline preserves the complete directory hierarchy from upstream, including:
- SQL language specification tests in
evidence/ - Query optimization tests in
index/ - All subdirectory structures and file naming conventions
Sources: Dockerfile:14-28 README.md21
Error Handling and Safety
The extraction script includes several safety mechanisms:
| Mechanism | Location | Purpose |
|---|---|---|
set -e | Dockerfile22 | Exit on any command failure |
set -u | Dockerfile22 | Exit on undefined variable usage |
set -o pipefail | Dockerfile22 | Propagate pipe failures |
mkdir -p | Dockerfile27 | Create directory if missing, no error if exists |
--rm flag | README.md18 | Clean up container after execution |
The bash strict mode ensures that any failure in the extraction process (e.g., missing source directory, write permission issues) causes immediate script termination with a non-zero exit code, which propagates to the Docker container exit status.
Sources: Dockerfile:20-31 README.md18
Performance Characteristics
The extraction pipeline exhibits the following performance characteristics:
Build Time: The docker build step includes network I/O to clone the Fossil repository. Repository size is typically 50-200 MB, making build time network-dependent (1-5 minutes on typical connections).
Extraction Time: The docker run step performs only filesystem copy operations from the image to the mounted volume. With typical corpus sizes (10,000+ test files), extraction completes in seconds.
Storage Efficiency: The Fossil repository is cloned once during image build and reused for all subsequent extractions. This avoids redundant network operations when running multiple extractions.
The pipeline's design optimizes for extraction speed at the cost of larger image size. This trade-off is appropriate for automated CI/CD environments where fast extraction is critical.
Sources: Dockerfile:15-17 Dockerfile28