Skip to main content

OpenSearch Integration

This guide shows how to set up and use OpenSearch 3.1+ with govdoc-scanner for searchable company data indexing.

Overview

The apps/opensearch/ directory provides complete OpenSearch integration with:

  • Development environment: Quick local setup for testing
  • Production environment: Secure, scalable deployment with authentication
  • Index templates: Pre-configured mappings for company data
  • Dashboard setup: Ready-to-use visualizations and index patterns

Prerequisites

  • Docker and Docker Compose
  • Node.js 18+ (20+ recommended)

Development Setup

For local development and testing:

1. Start Development Cluster

cd apps/opensearch/development
cp .env.template .env
# Edit .env with a strong password (8+ characters)
docker compose up -d

2. Configure Application

Update your root .env file:

OPENSEARCH_PUSH=true
OPENSEARCH_URL=https://localhost:9200
OPENSEARCH_USERNAME=admin
OPENSEARCH_PASSWORD=yourAdminPassword
OPENSEARCH_INSECURE=true
OPENSEARCH_INDEX=govdoc-companies-000001

3. Create Index Template

curl -k -u admin:yourAdminPassword -X PUT "https://localhost:9200/_index_template/govdoc-company-template" \
-H "Content-Type: application/json" \
-d @apps/opensearch/shared/templates/company-index-template.json

4. Create Initial Index

curl -k -u admin:yourAdminPassword -X PUT "https://localhost:9200/govdoc-companies-000001"

Verify setup:

# Check if template was created
curl -k -u admin:yourAdminPassword "https://localhost:9200/_index_template/govdoc-company-template?pretty"

# Check index mappings
curl -k -u admin:yourAdminPassword "https://localhost:9200/govdoc-companies-000001/_mapping?pretty"

5. Test Data Ingestion

npm start govdoc -- --input ./companies.gds --push

Access Dashboards

Create index patterns manually:

  1. Go to Discover -> Create Index Pattern
  2. Create pattern: govdoc-companies-*
  3. Set time field: scan_date
  4. Explore data in Discover tab

Shut Down Docker Container:

cd apps/opensearch/development
docker compose down

Reset development environment:

cd apps/opensearch/development
docker compose down --volumes --remove-orphans

Production Setup

For production deployments with security and monitoring:

Quick Setup

cd apps/opensearch/production
./setup-production.sh

This automatically:

  1. Generates secure passwords and certificates
  2. Creates security configuration (users, roles, mappings)
  3. Starts production OpenSearch cluster
  4. Initializes security configuration with proper authentication
  5. Creates test data to verify bulk operations work

Important: After setup completes, passwords are stored in apps/opensearch/production/.env. Copy the govdoc_ingest password to your root .env file.

Manual Setup

For step-by-step control:

cd apps/opensearch/production
# Step 1: Run security setup (creates .env file automatically)
./scripts/setup-security.sh
# Step 2: Start production cluster
docker compose -f docker-compose.prod.yml up -d
# Step 3: Initialize security configuration (loads YAML files into OpenSearch)
./scripts/initialize-security.sh
# Step 4: Initialize indices and templates
./scripts/initialize-cluster.sh
# Step 5: Setup dashboards
./scripts/setup-dashboards.sh

Security Note: Production uses a dedicated govdoc_ingest user with minimal permissions (only bulk write access to govdoc-companies-* indexes). Admin credentials are separate and should be stored securely.

Configure your application by copying from the .env created to the root .env file:

OPENSEARCH_URL=https://localhost:9200
OPENSEARCH_USERNAME=govdoc_ingest
OPENSEARCH_PASSWORD=govdoc_ingest_password
OPENSEARCH_INDEX=govdoc-companies-write
OPENSEARCH_BATCH_SIZE=500
OPENSEARCH_INSECURE=true # Set to false when using proper certificates

Shut Down Docker Container:

cd apps/opensearch/production
docker compose -f docker-compose.prod.yml down

Reset production environment:

cd apps/opensearch/production
./cleanup-production.sh

Production Maintenance

Health Monitoring

Check cluster health and status:

cd apps/opensearch/production
./scripts/health-check.sh

This script monitors:

  • Cluster status (green/yellow/red) with shard distribution
  • Node health, heap memory usage, and JVM statistics
  • Index statistics and document counts for govdoc-companies-* indices
  • Disk usage (both container and host)
  • Recent snapshot status and backup health
  • Security configuration (HTTPS and authentication status)

Data Backup

Create backups of your data:

cd apps/opensearch/production
./scripts/backup.sh

Features:

  • Creates timestamped snapshots (govdoc-daily-YYYYMMDD-HHMMSS)
  • Backs up govdoc-companies-* indices with metadata
  • Automatic cleanup of old snapshots (30-day retention)
  • Repository verification and integrity checks
  • Progress monitoring and detailed reporting
  • Support for --list-only, --cleanup-only, --verify-only options

Access Production Dashboards

Index patterns are automatically created. You can:

  • Explore data in Discover
  • Create visualizations in Visualize
  • Build dashboards in Dashboard
  • Monitor health in Stack Management

Troubleshooting

OpenSearch Production Startup Issues

If OpenSearch production fails to start, check the logs first:

cd apps/opensearch/production
docker compose -f docker-compose.prod.yml logs opensearch

Common Issues:

  1. Insufficient Memory: The most common issue is insufficient RAM allocation

    • Solution: Reduce memory settings from 4GB to 2GB in docker-compose.prod.yml:
    environment:
    - "OPENSEARCH_JAVA_OPTS=-Xms2g -Xmx2g"
  2. Port Conflicts: Ports 9200 or 5601 already in use

    • Check: sudo netstat -tulpn | grep :9200
    • Solution: Stop conflicting services or change ports in docker-compose
  3. Permission Issues: Container cannot write to mounted volumes

    • Solution: Fix ownership: sudo chown -R 1000:1000 apps/opensearch/production/

Environment Differences

FeatureDevelopmentProduction
Memory512MB heap4GB heap
SecurityBasic authFull TLS + RBAC
PersistenceDocker volumesNamed volumes + backup
MonitoringBasic health checksHealth checks + monitoring
CertificatesAuto-generatedDemo certs

CLI Integration

Interactive Mode

npm start govdoc
# Automatically pushes if OPENSEARCH_PUSH=true

Command Mode with Flags

# Development
npm start govdoc -- --input ./companies.gds \
--push \
--os.endpoint https://localhost:9200 \
--os.username admin \
--os.password yourDevPassword \
--os.index govdoc-companies-000001 \
--os.insecure \
--os.batch-size 500

# Production
npm start govdoc -- --input ./companies.gds \
--push \
--os.endpoint https://localhost:9200 \
--os.username govdoc_ingest \
--os.password yourProdPassword \
--os.index govdoc-companies-write \
--os.insecure \
--os.batch-size 500

Data Model

The index template (apps/opensearch/shared/templates/company-index-template.json) defines:

  • Index pattern: govdoc-companies-*
  • Dynamic mapping: false (unknown fields rejected)
  • Document structure: One document per company (gemi_id)

Key Fields:

  • gemi_id, company_tax_id (keyword)
  • company_name (text + keyword subfield)
  • creation_date, scan_date, document_date (date)
  • representatives (nested array)
  • tracked_changes_history (nested array with company_changes, economic_changes per document)

Query Examples

Search by company name:

curl -k -u admin:yourPassword -X POST "https://localhost:9200/govdoc-companies-000001/_search" \
-H "Content-Type: application/json" \
-d '{
"query": {
"match": { "company_name": "ΤΕΧΝΙΚΗ" }
}
}'

Filter by region and aggregate cities:

curl -k -u admin:yourPassword -X POST "https://localhost:9200/govdoc-companies-000001/_search" \
-H "Content-Type: application/json" \
-d '{
"size": 0,
"query": {
"term": { "region": "ΑΤΤΙΚΗΣ" }
},
"aggs": {
"cities": { "terms": { "field": "city" } }
}
}'

Find active representatives:

curl -k -u admin:yourPassword -X POST "https://localhost:9200/govdoc-companies-000001/_search" \
-H "Content-Type: application/json" \
-d '{
"query": {
"nested": {
"path": "representatives",
"query": {
"bool": {
"must": [
{ "match": { "representatives.name": "ΓΕΩΡΓΙΟΣ" } },
{ "term": { "representatives.is_active": true } }
]
}
}
}
}
}'

Directory Structure

apps/opensearch/
├── README.md # Quick start guide
├── development/ # Development environment
│ ├── docker-compose.yml # Dev Docker Compose
│ └── .env.template # Environment template
├── production/ # Production environment
│ ├── docker-compose.prod.yml # Production Docker Compose
│ ├── setup-production.sh # One-click setup script
│ ├── cleanup-production.sh # Reset script
│ ├── config/ # OpenSearch configuration
│ └── scripts/ # Setup automation scripts
└── shared/ # Shared resources
└── templates/ # Index templates
└── company-index-template.json # Company data mapping