What is GovDoc Scanner?
The Problem
In Greece, essential public company data exists in thousands of unstructured documents across the Γ.Ε.ΜΗ. (GEMI) portal. This creates significant barriers for:
- Citizens seeking transparency in corporate activities
- Researchers analyzing business trends and economic patterns
- Policymakers requiring data-driven insights for legislation
- Journalists investigating corporate structures and ownership
The current format limits transparency and makes systematic analysis nearly impossible.
The Solution
GovDoc Scanner is an open-source tool designed to convert unstructured GEMI portal PDFs into a fully searchable database accessible via a REST API. It automates the complete document processing pipeline with AI-powered extraction and production-ready infrastructure:
- Smart Crawling: Automated document discovery and download from GEMI portal with advanced filtering
- AI Extraction: Google Gemini 2.5 Flash processes Greek legal documents with specialized prompts
- Structured Data: Comprehensive metadata extraction including representatives, ownership, and change tracking
- Full-Text Search: OpenSearch integration with Greek language analyzers for powerful querying
- REST API: Production-ready server with authentication, rate limiting, and comprehensive documentation
Architecture & Components
GovDoc Scanner follows a modern monorepo architecture with five specialized applications:
Core Applications
cli: A unified command-line interface that orchestrates the complete workflow, combining crawling and scanning with interactive prompts and automated batch processing (recommended for most users).doc-scanner: Processes.pdf,.docand.docxdocuments for a given GEMI company, extracting comprehensive metadata with chronological processing and intelligent representative tracking using Gemini 2.5 Flash Lite.crawler: Scrapes the GEMI portal to search for companies using advanced filters and downloads all available public documents with enhanced date extraction, intelligent file management, and robust retry mechanisms.api: Fastify-based REST API server providing search endpoints for companies and representatives with OpenSearch integration.opensearch: Complete OpenSearch integration with development and production configurations for searchable data indexing.
Technology Stack
Core Technologies:
- Node.js v20+: Modern JavaScript runtime with ES modules
- Google Gemini 2.5 Flash: Specialized AI for Greek legal document processing
- OpenSearch 3.1+: Full-text search with Greek language analyzers
- Fastify: High-performance web framework with built-in validation
- Playwright: Robust web automation for document crawling
- Docker: Containerized deployment and development environments
Development & Operations:
- NPM Workspaces: Monorepo management for multiple applications
- ESLint & Prettier: Code quality and formatting standards
- Docusaurus: Documentation site with live reloading
- GitHub Actions: Automated testing and deployment workflows
Getting Started
Ready to process Greek company documents? Go ahead:
Something Missing?
If something is missing in the documentation or if you found some part confusing, please click on edit this Page and create a PR for improvement. We love your contribution!