GSoC 2025 Project Overview: GovDoc Scanner

Mentors: Giannis E. Skitsas, Vasilis Christopoulos

Assignee: Eftihis Drakakis

Abstract

The GovDoc Scanner project tackles the challenge of accessing and using public company data from the Greek GEMI portal, where information is typically locked in unstructured PDF/DOC/DOCX files. The project delivers a suite of tools that: crawl and download documents, process them using AI (Gemini Flash), and produce structured, searchable metadata with chronological tracking. Outputs are organized per company and consolidated for batch runs, improving transparency and accessibility of public corporate information in Greece.

Main goals for GSoC 2025

The primary objectives for this project were:

Automated Document Crawling: Develop a robust crawler to navigate the GEMI portal, search for companies using various filters, and download all associated public documents with enhanced date extraction for proper organization.
Intelligent Document Processing: Create an advanced document processing pipeline that can handle different file formats (PDF, DOCX, DOC), extract text, and use Google's Gemini 2.5 Flash Lite to extract comprehensive structured metadata with chronological processing and representative tracking.
End-to-End CLI Tool: Build a unified command-line interface to orchestrate the entire workflow, from crawling and downloading to processing and storing the data, complete with interactive prompts, automated batch processing, progress tracking and error handling.
Comprehensive Documentation: Establish a documentation site to provide clear instructions for installation, usage, and development, ensuring the project is accessible to a wide audience.

Development Progress Through Pull Requests

The project was developed through a series of iterative pull requests, each adding significant functionality and improvements:

Core Infrastructure & Setup (PRs #3, #6, #7, #8)

PR #3: Implementation of a batch document processing pipeline for metadata extraction and contextual history generation using Gemini
PR #6: Pipeline for company search and document download
PR #7: Orchestration script for crawling and pdf scanning
PR #8: Updated README and Nx structure - moved .env file to root
PR #26: App updates - Enhanced functionality across all applications
PR #27: CLI Tool - Removal of NX - Streamlined the CLI tool and removed NX dependency for better maintainability

Documentation & Project Management (PRs #9, #10, #35)

PR #9: Docusaurus Setup - Established comprehensive documentation infrastructure
PR #10: GitHub Pages deployment for documentation - Made documentation publicly accessible
PR #35: Docs Updates - Comprehensive documentation updates and improvements

UI/UX & Visual Improvements (PR #29)

PR #29: Logo and homepage images - Added branding and visual elements to improve user experience

Advanced Features & Bug Fixes (PRs #30, #31)

PR #30: Tracked changes feature - Implemented change tracking capabilities for better workflow management
PR #31: Handle fenced JSON responses and enforce raw JSON output - Fixed JSON parsing issues and improved data consistency

Infrastructure & Error Handling (PRs #32, #33, #34)

PR #32: Structured Errors, API Key Validation & UX Cleanup - Implemented comprehensive error handling and improved user experience
PR #33: Config Management - Added centralized configuration management system
PR #34: Unified Error Handling & Logging System - Advanced error handling and logging infrastructure

Implementation

Crawler Application

The crawler is a Node.js application responsible for interacting with the GEMI portal.

Web Scraping and Automation: Playwright powers browser automation (forms, navigation, downloads). Cheerio parses portal HTML to extract links and relevant information.
Enhanced Date Extraction: Advanced logic extracts dates from table rows and prepends them to filenames for proper chronological organization.
User Interface: An interactive command-line interface (CLI) was built using inquirer, allowing users to easily specify which companies to search for or which GEMI IDs to download documents for.
Robust Download System: axios handles HTTP downloads with retries and improved file extension detection. greek-utils assists with Greek-specific text; string-similarity suggests potential company-name matches.
Outputs: Search results are saved to apps/crawler/src/ids.txt. Downloads are organized under apps/crawler/src/downloads/{GEMI_ID}/document_downloads/ with date-prefixed filenames.

Document Scanner Application

The doc-scanner application processes the downloaded documents to extract valuable information using advanced AI capabilities.

Advanced AI-Powered Analysis: The scanner uses @google/generative-ai with Google's Gemini 2.5 Flash Lite, enabling specialized analysis of Greek legal documents with prompts for representative identification and Greek corporate terminology.
Chronological Processing: Documents are automatically sorted by date based on filename prefixes, ensuring proper chronological processing to track company evolution over time.
Intelligent Metadata Schema: Expanded schema includes comprehensive company information, detailed representative data (active status, tax IDs, capital shares), and refined field descriptions with duplicate prevention logic.
Modular Prompt Architecture: Specialized prompts are extracted into a dedicated prompts.mjs file, implementing advanced instructions for Greek legal terminology, representative identification, and ownership analysis.
Document Text Extraction: The application supports multiple document formats. mammoth is used to extract raw text from .docx files, and word-extractor handles older .doc files.
Unified Processing Logic: A streamlined processCompanyFiles(files, inputFolder, outputFolder, gemiId, model) function generates a single comprehensive metadata file, merged chronologically. Incremental logic skips processing when no new files are detected.
Environment Management: The dotenv package is used to manage environment variables, keeping sensitive information like API keys out of the source code.
Outputs: When used directly, results are written to apps/doc-scanner/src/data/output/{GEMI_ID}/{GEMI_ID}_final_metadata.json.

CLI Tool Application

The CLI tool serves as a unified interface that orchestrates the complete workflow, combining both crawler and doc-scanner functionality.

Interactive Mode: User-friendly prompts guide users through input methods (file input, manual GEMI ID entry, or random selection) with step-by-step workflow automation.
Command-Line Mode: Non-interactive automation support for batch processing and CI/CD integration with comprehensive argument parsing and error handling.
Workflow Orchestration: Seamlessly coordinates the crawler and doc-scanner applications, handling file transfer between components and maintaining processing state.
Progress Tracking: Real-time progress bars and comprehensive summaries with integrated logging system.
Advanced Infrastructure:
- Integrated configuration validation and API key verification
- Structured error handling with detailed logging and graceful failure recovery
- Centralized logging system with progress-aware buffering
Usage: Pass arguments using npm’s "--" separator, for example: npm start govdoc -- --input ./companies.gds.
Outputs: Batch outputs are saved under ./output/{GEMI_ID}/ with {GEMI_ID}_final_metadata.json and downloaded documents, plus a consolidated ./output/govdoc-output.json summary.

Shared Infrastructure & Monorepo

NPM Workspaces: The entire project is managed as a monorepo using NPM workspaces. This simplifies the management of the different applications (crawler, doc-scanner, cli) and shared dependencies, enabling consistent development practices across all components.
Shared Modules: Common infrastructure is centralized under shared/ including:
- Configuration Management: Centralized environment-aware configuration loading and validation
- Logging System: Module-specific loggers with automatic progress-aware buffering
- Progress Management: Real-time progress bars with integrated log buffering
- Error Handling: Structured error classes for consistent error management across applications
Browser Automation: playwright is used for browser automation within the crawler.
Development Tools: Unified error handling, logging, and progress tracking across all applications.

Future Roadmap & TODOs

The following enhancements are planned for the GovDoc Scanner project as part of ongoing development:

Infrastructure & Deployment

#18 - Identify Cloud Hosting Strategy and Provide Target Architecture Diagram
- Design scalable cloud infrastructure for production deployment
- Create architectural diagrams for system components
- Define hosting requirements and cost estimates

Backend Development

#13 - Store Company Metadata Output in OpenSearch Database
- Implement OpenSearch integration for structured data storage
- Design efficient indexing strategies for company metadata
- Enable full-text search capabilities across company data
#14 - Implement REST API Server for Company Metadata Querying and Access Management
- Build RESTful API endpoints for data access
- Implement authentication and authorization
- Create rate limiting and usage monitoring
#15 - Implement Backoffice for User, Throttling, Whitelist, and IP Management
- Develop administrative interface for user management
- Create tools for API throttling and access control
- Build monitoring dashboards for system health

Advanced Features

#16 - Implement Crawling Feature for Admin from Backoffice
- Enable administrative control of crawling operations
- Schedule automated crawling tasks
- Provide crawling status and progress monitoring
#17 - Implement MCP Server and Showcase
- Develop Model Context Protocol (MCP) server integration
- Create showcases demonstrating advanced AI capabilities
- Enable seamless AI model switching and optimization

Documentation & Outreach

#21 - Finalize Documentation, GSoC Delivery URL/Page, and README
- Complete comprehensive project documentation
- Create project delivery pages and demos
- Update README with latest features and usage instructions
#20 - Make Social Buzz (Medium Article, LinkedIn Post, FB Post)
- Write technical articles about the project
- Create social media content for project promotion
- Share project achievements and use cases

All issues are tracked on GitHub and contributions are welcome from the community.

Repository

The repository for this project can be found here.

Quick Start

Development: Setup the development environment as described in the Development guide.
Getting Started: Basic usage instructions are available in the Getting Started guide.

Abstract​

Main goals for GSoC 2025​

Development Progress Through Pull Requests​

Core Infrastructure & Setup (PRs #3, #6, #7, #8)​

Documentation & Project Management (PRs #9, #10, #35)​

UI/UX & Visual Improvements (PR #29)​

Advanced Features & Bug Fixes (PRs #30, #31)​

Infrastructure & Error Handling (PRs #32, #33, #34)​

Implementation​

Crawler Application​

Document Scanner Application​

CLI Tool Application​

Shared Infrastructure & Monorepo​

Future Roadmap & TODOs​

Infrastructure & Deployment​

Backend Development​

Advanced Features​

Documentation & Outreach​

Repository​

Quick Start​