Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Corpus Extraction Pipeline

Relevant source files

Purpose and Scope

This document describes the corpus extraction pipeline—the Docker-based system that clones SQLite's official Fossil repository and extracts the sqllogictest corpus into a local filesystem directory. The pipeline consists of a Dockerfile that builds a specialized container image and a bash script (slt-extract) that performs the actual extraction.

For information about the automated workflow that orchestrates this pipeline on a schedule, see Automated Update Workflow. For details on how the extracted tests are organized, see Test Organization Structure.

Pipeline Architecture

The extraction pipeline follows a three-stage architecture: image build, repository cloning, and corpus extraction. The entire process is encapsulated within a Docker container to ensure reproducibility and isolation from the host environment.

Sources: Dockerfile:1-36

flowchart LR
    subgraph "Build Stage"
        DF["Dockerfile"]
BASE["debian:stable-slim\nbase image"]
DEPS["Dependencies:\nfossil, bash, tcl,\nbuild-essential"]
end
    
    subgraph "Clone Stage"
        FOSSIL_CMD["fossil clone\nwww.sqlite.org/sqllogictest"]
FOSSIL_OPEN["fossil open\nsqllogictest.fossil"]
SRC_DIR["/src/test/\ncloned corpus"]
end
    
    subgraph "Extract Stage"
        SCRIPT["/usr/local/bin/\nslt-extract"]
CP_CMD["cp -R /src/test\nto /work/test"]
DEST_DIR["/work/test/\nextracted corpus"]
end
    
 
   DF --> BASE
 
   DF --> DEPS
 
   DEPS --> FOSSIL_CMD
 
   FOSSIL_CMD --> FOSSIL_OPEN
 
   FOSSIL_OPEN --> SRC_DIR
 
   DF --> SCRIPT
 
   SCRIPT --> CP_CMD
 
   SRC_DIR --> CP_CMD
 
   CP_CMD --> DEST_DIR

Docker Image Construction

The extraction pipeline uses a Debian-based Docker image defined in the Dockerfile. The image is built in multiple logical stages, though implemented as a single-stage Dockerfile for simplicity.

Base Image and Dependencies

The image starts from debian:stable-slim and installs the required packages:

PackagePurpose
fossilFossil SCM client for cloning the repository
bashShell for executing the extraction script
tclRequired by some fossil operations
build-essentialCompilation tools (legacy requirement)
ca-certificatesHTTPS certificate validation
curlHTTP client utilities

The package installation occurs at Dockerfile:5-12 using apt-get with --no-install-recommends to minimize image size. The /var/lib/apt/lists/* cache is removed after installation to further reduce the final image size.

Fossil Repository Cloning

The repository cloning process occurs during the Docker build phase, not at runtime. This design decision ensures that the cloned repository is baked into the image, eliminating the need for network access during extraction.

sequenceDiagram
    participant Build as Docker Build
    participant Fossil as fossil CLI
    participant Remote as www.sqlite.org
    participant FS as /src filesystem
    
    Build->>FS: WORKDIR /src
    Build->>Fossil: fossil clone https://www.sqlite.org/sqllogictest/
    Fossil->>Remote: HTTP GET sqllogictest repository
    Remote-->>Fossil: repository data
    Fossil->>FS: write /src/sqllogictest.fossil
    Build->>Fossil: fossil open sqllogictest.fossil
    Fossil->>FS: extract to /src/test/
    Build->>Fossil: fossil user default root
    Fossil->>FS: set default user

The cloning sequence at Dockerfile:14-17 performs three operations:

  1. fossil clone : Downloads the repository from https://www.sqlite.org/sqllogictest/ to /src/sqllogictest.fossil
  2. fossil open : Extracts the repository contents to the current working directory (/src)
  3. fossil user default root : Sets the default user to root for subsequent operations

The --user root flag in both the clone and open commands ensures consistent user attribution within the Fossil repository.

Sources: Dockerfile:14-17

Extraction Script Implementation

The extraction logic is implemented in a bash script embedded directly in the Dockerfile as a heredoc. The script is written to /usr/local/bin/slt-extract at Dockerfile:20-31 and marked executable at Dockerfile33

Script Components

Script Variables:

VariableDefault ValueDescription
src_root/src/testSource directory containing cloned corpus
dest_root${1:-/work/test}Destination directory (first argument or default)

The script uses bash strict mode (set -euo pipefail) at Dockerfile22 to ensure:

  • -e: Exit immediately if any command fails
  • -u: Treat unset variables as errors
  • -o pipefail: Return the exit status of the last failed command in a pipe

Extraction Operation

The core extraction operation at Dockerfile28 uses cp -R with the trailing /. syntax:

cp -R "$src_root/." "$dest_root/"

This syntax copies the contents of /src/test/ (including hidden files) rather than the directory itself, resulting in the corpus files appearing directly under dest_root rather than in a nested test/ subdirectory.

Sources: Dockerfile:20-35

Container Execution Model

The container is configured with slt-extract as the ENTRYPOINT at Dockerfile35 This design allows the container to function as a single-purpose executable tool.

flowchart LR
    subgraph "Host Filesystem"
        HOST_TEST["$PWD/test/"]
end
    
    subgraph "Container Filesystem"
        CONTAINER_WORK["/work/test/"]
SRC_TEST["/src/test/\n(baked into image)"]
end
    
 
   HOST_TEST -.->|-v mount| CONTAINER_WORK
 
   SRC_TEST -->|cp -R| CONTAINER_WORK
 
   CONTAINER_WORK -->|persists to| HOST_TEST

Volume Mounting Strategy

The extraction process relies on Docker volume mounting to persist the extracted corpus to the host filesystem:

When the container runs with -v "$PWD/test:/work/test", the host's test/ directory is mounted at /work/test inside the container. The slt-extract script copies from the image's /src/test to the mounted /work/test, making the files appear in the host's test/ directory.

Sources: README.md18 Dockerfile35

Build and Execution Workflow

The complete extraction workflow consists of building the image and running the container:

Image Build Process

The docker build command at README.md10 creates an image tagged as slt-gen. This tag is referenced in the GitHub Actions workflow (see Automated Update Workflow) and in local usage.

Build stages:

  1. Pull debian:stable-slim base image
  2. Install system packages via apt-get
  3. Clone Fossil repository from www.sqlite.org
  4. Extract repository to /src/test
  5. Write slt-extract script to /usr/local/bin
  6. Set script as executable
  7. Configure container entrypoint

Extraction Execution

The extraction sequence at README.md:16-18 performs:

  1. Remove existing directory : rm -rf test ensures a clean slate
  2. Create empty directory : mkdir test creates the mount point
  3. Run container : docker run executes the extraction with:
    • --rm: Remove container after execution
    • -v "$PWD/test:/work/test": Mount host directory into container
    • slt-gen: Image name

The container executes slt-extract (the entrypoint), which copies corpus files from /src/test to /work/test, persisting them to the host's test/ directory via the volume mount.

Sources: README.md:6-19 Dockerfile35

graph TB
    subgraph "Upstream Fossil Repository"
        FOSSIL_ROOT["www.sqlite.org/sqllogictest/"]
FOSSIL_TEST["test/ directory"]
FOSSIL_EVIDENCE["test/evidence/"]
FOSSIL_INDEX["test/index/"]
end
    
    subgraph "Docker Image /src"
        IMG_SRC["/src/sqllogictest.fossil"]
IMG_TEST["/src/test/"]
IMG_EV["/src/test/evidence/"]
IMG_IDX["/src/test/index/"]
end
    
    subgraph "Container /work"
        WORK_TEST["/work/test/"]
WORK_EV["/work/test/evidence/"]
WORK_IDX["/work/test/index/"]
end
    
    subgraph "Host Filesystem"
        HOST_TEST["$PWD/test/"]
HOST_EV["$PWD/test/evidence/"]
HOST_IDX["$PWD/test/index/"]
end
    
 
   FOSSIL_ROOT --> FOSSIL_TEST
 
   FOSSIL_TEST --> FOSSIL_EVIDENCE
 
   FOSSIL_TEST --> FOSSIL_INDEX
    
 
   FOSSIL_ROOT -->|fossil clone| IMG_SRC
 
   IMG_SRC -->|fossil open| IMG_TEST
 
   IMG_TEST --> IMG_EV
 
   IMG_TEST --> IMG_IDX
    
 
   IMG_TEST -->|cp -R| WORK_TEST
 
   IMG_EV -->|cp -R| WORK_EV
 
   IMG_IDX -->|cp -R| WORK_IDX
    
 
   WORK_TEST -->|volume mount| HOST_TEST
 
   WORK_EV -->|volume mount| HOST_EV
 
   WORK_IDX -->|volume mount| HOST_IDX

Directory Structure Mapping

The extraction pipeline maintains a direct mirror of the upstream repository structure:

The pipeline preserves the complete directory hierarchy from upstream, including:

  • SQL language specification tests in evidence/
  • Query optimization tests in index/
  • All subdirectory structures and file naming conventions

Sources: Dockerfile:14-28 README.md21

Error Handling and Safety

The extraction script includes several safety mechanisms:

MechanismLocationPurpose
set -eDockerfile22Exit on any command failure
set -uDockerfile22Exit on undefined variable usage
set -o pipefailDockerfile22Propagate pipe failures
mkdir -pDockerfile27Create directory if missing, no error if exists
--rm flagREADME.md18Clean up container after execution

The bash strict mode ensures that any failure in the extraction process (e.g., missing source directory, write permission issues) causes immediate script termination with a non-zero exit code, which propagates to the Docker container exit status.

Sources: Dockerfile:20-31 README.md18

Performance Characteristics

The extraction pipeline exhibits the following performance characteristics:

Build Time: The docker build step includes network I/O to clone the Fossil repository. Repository size is typically 50-200 MB, making build time network-dependent (1-5 minutes on typical connections).

Extraction Time: The docker run step performs only filesystem copy operations from the image to the mounted volume. With typical corpus sizes (10,000+ test files), extraction completes in seconds.

Storage Efficiency: The Fossil repository is cloned once during image build and reused for all subsequent extractions. This avoids redundant network operations when running multiple extractions.

The pipeline's design optimizes for extraction speed at the cost of larger image size. This trade-off is appropriate for automated CI/CD environments where fast extraction is critical.

Sources: Dockerfile:15-17 Dockerfile28