Getting Started

This page provides instructions for setting up and running the GovDoc Scanner project.

Prerequisites

Node.js: v18.x or newer (recommended: v20.x) - Download from nodejs.org or install via package manager:

# Ubuntu/Debian
sudo apt install nodejs npm

# macOS (Homebrew)
brew install node

# Windows (Chocolatey)
choco install nodejs

Git: For cloning the repository
Gemini API Key: Required for AI-powered document processing (Get one here)

Quick Start

1. Clone and Install

git clone https://github.com/flexivian/govdoc-scanner.git
cd govdoc-scanner
npm install

2. Environment Setup

cp .env.example .env
# Edit .env and add your Gemini API key:
# GEMINI_API_KEY=your_gemini_api_key_here

Project Applications

The project includes three main applications:

CLI Tool (govdoc): Complete end-to-end interactive workflow (recommended for most users)
Crawler (crawler): Search and download documents from GEMI portal with enhanced date extraction
Doc-Scanner (scanner): Process documents with AI-powered chronological analysis, representative tracking, and automatic change detection

Individual Application Usage

CLI Tool (Recommended)

npm start govdoc
# Interactive mode with guided prompts:
# - File input, manual entry, or random selection
# - Automated crawling and document processing
# - Progress tracking and comprehensive summaries
# - Output saved to ~/.govdoc/cli/output/

Command line mode for automation:

# Process from file
npm start govdoc -- --input ./companies.gds

# Process random companies
npm start govdoc -- --company-random 10

# Show help
npm start govdoc -- --help

Output Structure

After processing, find results in the working directory:

~/.govdoc/
├── cli/
    └── output/
        ├── 123204604000/
        │   ├── 123204604000_final_metadata.json  # Comprehensive company metadata with tracked changes
        │   └── document_downloads/
        │       └── *.pdf, *.docx files
        └── govdoc-output.json  # Summary

Note: If you prefer to run each step separately (crawler -> scanner), make sure to use LOG_LEVEL=DEBUG for detailed output when running the separate apps:

Crawler

npm start crawler
# Search for companies or download by GEMI ID
# Results saved to ~/.govdoc/crawler/downloads/ and search results to ~/.govdoc/crawler/search-results.gds

Doc-Scanner

npm start scanner
# Process documents from input directory
# Requires manual document placement in ~/.govdoc/doc-scanner/input/
# Important: Name files with date prefixes (YYYY-MM-DD) for chronological processing
# Features:
#   - Intelligent processing: skips documents that are already up to date
#   - Change tracking: automatically summarizes significant changes between versions
#   - Comprehensive metadata with representative tracking and ownership history

The metadata file includes:

Current Snapshot: Latest company state with all extracted information
Tracked Changes: Document-by-document history of significant changes
Representative Tracking: Complete ownership and role evolution
Change Summaries: Human-readable summaries of key modifications

Next Steps

Explore Configuration for advanced setup options and customization.
Visit Development Setup for instructions on contributing and running the project in development mode.
Read the GSoC 2025 Overview to learn more about the project's background and goals.

Need Help?

If you encounter any issues, check out our Troubleshooting Guide for common problems and solutions.

Prerequisites​

Quick Start​

1. Clone and Install​

2. Environment Setup​

Project Applications​

Individual Application Usage​

CLI Tool (Recommended)​

Output Structure​

Crawler​

Doc-Scanner​

Next Steps​

Need Help?​