GSoC 2025 Project: GovDoc Scanner
Mentors: Giannis E. Skitsas, Vasilis Christopoulos
Assignee: Eftihis Drakakis
The Problem
In Greece, essential public company data exists in thousands of unstructured documents across the Γ.Ε.ΜΗ. (GEMI) portal. This creates significant barriers for:
- Citizens seeking transparency in corporate activities
- Researchers analyzing business trends and economic patterns
- Policymakers requiring data-driven insights for legislation
- Journalists investigating corporate structures and ownership
The current format limits transparency and makes systematic analysis nearly impossible.
The Solution
GovDoc Scanner is an open-source tool designed to convert unstructured GEMI portal PDFs into a fully searchable database accessible via a REST API. The GSoC 2025 project successfully delivered a complete implementation that automates the entire document processing pipeline with AI-powered extraction and production-ready infrastructure.
Core Components Delivered
- Automated Document Crawling: Robust
Playwrightpowered crawler with advanced company search filters, intelligent date extraction, and organized document downloads from the GEMI portal - AI-Powered Document Processing: Advanced pipeline using
Google Gemini 2.5 Flash Litefor metadata extraction, chronological processing, and specialized Greek legal document analysis - Unified CLI Orchestration: Complete
command-line interfacewith interactive prompts, batch processing, progress tracking, and comprehensive error handling that combines all workflow components - Production Infrastructure: Full
OpenSearch integrationwith production-readyREST API serverfeaturing authentication, rate limiting, and comprehensive Swagger documentation for scalable data access - Comprehensive Documentation: Complete
Docusaurus-powered documentation sitewith installation guides, code examples, development instructions, and API references
Development Progress Through Pull Requests
The project was developed through a series of iterative pull requests, each adding significant functionality and improvements:
Core Infrastructure & Setup (PRs #3, #6, #7, #8)
- PR #3: Implementation of a batch document processing pipeline for metadata extraction and contextual history generation using Gemini
- PR #6: Pipeline for company search and document download
- PR #7: Orchestration script for crawling and pdf scanning
- PR #8: Updated README and Nx structure - moved .env file to root
- PR #26: App updates - Enhanced functionality across all applications
- PR #27: CLI Tool - Removal of NX - Streamlined the CLI tool and removed NX dependency for better maintainability
Advanced Features & Bug Fixes (PRs #30, #31)
- PR #30: Tracked changes feature - Implemented change tracking capabilities for better workflow management
- PR #31: Handle fenced JSON responses and enforce raw JSON output - Fixed JSON parsing issues and improved data consistency
- PR #44: Metadata enhancements - Enhanced metadata extraction with separate tracking for company changes and economic changes, new fields for financial data, and improved representative ownership tracking with capital amounts and percentages
- PR #47: Real-Time Logging in Debug Mode - Enhanced debugging capabilities with real-time log output when LOG_LEVEL=debug in CLI application, disabling progress bars for better development experience
Infrastructure & Error Handling (PRs #32, #33, #34)
- PR #32: Structured Errors, API Key Validation & UX Cleanup - Implemented comprehensive error handling and improved user experience
- PR #33: Config Management - Added centralized configuration management system
- PR #34: Unified Error Handling & Logging System - Advanced error handling and logging infrastructure
- PR #45: Structural Refactor - Project restructuring with migration of core applications to
apps/directory and introduction of configurableWORKING_DIRenvironment variable for unified output management
Production Infrastructure & REST API (PRs #41, #43)
- PR #41: Production OpenSearch Setup - Complete production-ready OpenSearch deployment with security, monitoring, and backup features including one-command setup, SSL certificates, user management, and health monitoring
- PR #43: Rest API Setup - Production-ready REST API service with Fastify server, OpenSearch integration, Swagger documentation, authentication, rate limiting, and containerized deployment
Documentation & Project Management (PRs #9, #10, #35)
- PR #9: Docusaurus Setup - Established comprehensive documentation infrastructure
- PR #10: GitHub Pages deployment for documentation - Made documentation publicly accessible
- PR #35: Docs Updates - Comprehensive documentation updates and improvements
- PR #46: Final GSoC Doc Updates
UI/UX & Visual Improvements (PR #29)
- PR #29: Logo and homepage images - Added branding and visual elements to improve user experience
Technical Implementation
Architecture Overview
The project follows a monorepo structure with NPM workspaces managing five specialized applications under the apps/ directory, supported by shared infrastructure libraries:
Application Structure:
apps/
├── crawler/ # Playwright automation for GEMI portal
├── doc-scanner/ # AI document processing pipeline
├── cli/ # Unified command-line interface
├── api/ # Fastify REST API server
└── opensearch/ # Search infrastructure setup
Technical Innovation
AI-Powered Greek Legal Processing:
- Custom prompts engineered for Greek corporate document structures (ΦΕΚ, ΓΕΜΗ formats)
- Chronological document sorting with intelligent date extraction from multiple formats
- Representative role identification using Greek legal terminology and ownership patterns
- Change detection comparing document versions for company evolution tracking
Production-Ready Infrastructure:
- Zero-downtime deployment: Opensearch and REST API docker containerization with health checks and graceful shutdowns
- Security first: OpenSearch with TLS encryption, RBAC user management, and dedicated service accounts
- Scalable data pipeline: Batch processing with configurable chunk sizes and retry mechanisms
- Monitoring & observability: Health check endpoints, backup automation, and cluster monitoring
Future Development Roadmap
Planned Enhancements:
Production Infrastructure
- Cloud Deployment Strategy: Scalable cloud architecture with cost optimization and monitoring
- Administrative Backoffice: User management, API throttling, access control, and system health dashboards
- Automated Crawling: Scheduled crawling operations with administrative control and progress monitoring
Advanced AI Integration
- Model Context Protocol (MCP) Server: Enhanced AI capabilities with model switching and optimization
- Advanced Analytics: Pattern recognition and trend analysis across corporate data
All development is tracked on GitHub Issues with community contributions welcome.
Getting Started
Quick Links:
- Complete Documentation - Comprehensive guides and API references
- Getting Started Guide - Basic usage and setup instructions
Repository: github.com/flexivian/govdoc-scanner