Crawler Examples

The crawler application handles searching for companies and downloading their documents from the GEMI portal.

Basic Usage

Interactive Search

npm start crawler
# Select: "Search for companies"
# Enter company name: "ALPHA BANK"
# Apply filters as needed
# Results saved to apps/crawler/src/ids.txt

Direct Download by GEMI ID

npm start crawler
# Select: "Download documents by GEMI ID"
# Option 1: Enter single ID: 123204604000
# Option 2: Provide file path with multiple IDs

Programmatic Usage

Download Documents for Specific IDs

// From a script in the repo root
import { runCrawlerForGemiIds } from "./apps/crawler/src/id_crawler.mjs";
import { createLogger } from "./shared/logging/index.mjs";
import { validateConfig, validateApiKey } from "./shared/config/validator.mjs";

const logger = createLogger("CRAWLER-SCRIPT");

async function downloadDocuments(gemiIds, outputDir) {
  try {
    // Validate configuration first
    validateConfig();
    const apiResult = await validateApiKey();
    if (!apiResult.ok) {
      logger.error(`API validation failed: ${apiResult.reason}`);
      throw new Error(`API validation failed: ${apiResult.reason}`);
    }

    logger.info(`Starting download for ${gemiIds.length} companies`);
    const results = await runCrawlerForGemiIds(gemiIds, outputDir);
    logger.info("Download completed successfully");
    return results;
  } catch (error) {
    logger.error("Download failed", error);
    throw error;
  }
}

// Usage
const gemiIds = ["123204604000", "144340502000"];
const results = await downloadDocuments(gemiIds, "./downloads");

Batch Processing with File Input

# Create input file
echo "123204604000" > company-ids.txt
echo "144340502000" >> company-ids.txt

# Run crawler with file
npm start crawler
# Select file option and provide path

Output Structure

Downloaded documents are organized with date prefixes for chronological processing:

apps/crawler/src/downloads/
├── 123204604000/
│   └── document_downloads/
│       ├── 2019-09-23_90189.pdf
│       ├── 2020-11-03_2334237.pdf
│       └── 2021-12-13_2747556.pdf
└── 144340502000/
    └── document_downloads/
        └── 2019-01-17_77417.pdf

Search Features

Company Name Search

Fuzzy matching with "Did you mean?" suggestions
Support for Greek and Latin characters
Filter by legal form, status, and location

Advanced Filters

Legal form (AE, EPE, OE, etc.)
Company status (Active, Inactive, etc.)
Registration date ranges
Geographic location

Common Use Cases

Research Workflow

# 1. Search for companies
npm start crawler
# Search for: "TELECOMMUNICATIONS"
# Apply filters: Active companies only

# 2. Review ids.txt results (saved in crawler src directory)
cat apps/crawler/src/ids.txt

# 3. Download documents for selected IDs
npm start crawler
# Use file option with filtered IDs

Bulk Download

# Prepare large ID list
cat > bulk-companies.txt << EOF
123204604000
144340502000
148851015000
152034008000
EOF

# Run bulk download
npm start crawler
# Select file option
# Enter: bulk-companies.txt

Error Handling

The crawler handles common issues automatically with proper logging:

import { createLogger } from "./shared/logging/index.mjs";
import {
  DocumentDownloadError,
  BrowserAutomationError,
} from "./shared/errors/index.mjs";

const logger = createLogger("CRAWLER-ERROR-HANDLER");

// Example error handling in crawler operations
async function safeDownload(url, retries = 3) {
  for (let attempt = 1; attempt <= retries; attempt++) {
    try {
      logger.debug(`Download attempt ${attempt}/${retries} for ${url}`);
      const result = await downloadFile(url);
      logger.info(`✅ Successfully downloaded ${url}`);
      return result;
    } catch (error) {
      logger.warn(`⚠️ Attempt ${attempt} failed for ${url}: ${error.message}`);

      if (attempt === retries) {
        logger.error(`❌ All ${retries} attempts failed for ${url}`, error);
        throw new DocumentDownloadError(
          `Failed after ${retries} attempts`,
          url
        );
      }

      // Exponential backoff
      await new Promise((resolve) => setTimeout(resolve, 1000 * attempt));
    }
  }
}

Built-in error handling includes:

Network timeouts with retry logic and enhanced reliability
Rate limiting with exponential backoff
Invalid document formats (skipped with logging)
Browser crashes (automatic restart with error tracking)
Failed downloads with robust retry mechanism and detailed logging

Tips

Use specific company names for better search results
Check ids.txt before bulk downloading
Monitor disk space for large batch downloads
Files are automatically named with date prefixes for chronological processing
Re-running downloads is safe - existing files will be automatically skipped
Stable internet connection recommended for best results

Basic Usage​

Interactive Search​

Direct Download by GEMI ID​

Programmatic Usage​

Download Documents for Specific IDs​

Batch Processing with File Input​

Output Structure​

Search Features​

Company Name Search​

Advanced Filters​

Common Use Cases​

Research Workflow​

Bulk Download​

Error Handling​

Tips​