Skip to main content

Crawler Examples

The crawler application handles searching for companies and downloading their documents from the GEMI portal.

Basic Usage

npx nx run crawler:start
# Select: "Search for companies"
# Enter company name: "ALPHA BANK"
# Apply filters as needed
# Results saved to ids.txt

Direct Download by GEMI ID

npx nx run crawler:start
# Select: "Download documents by GEMI ID"
# Option 1: Enter single ID: 123204604000
# Option 2: Provide file path with multiple IDs

Programmatic Usage

Download Documents for Specific IDs

import { runCrawlerForGemiIds } from './apps/crawler/src/id_crawler.mjs';

async function downloadDocuments(gemiIds, outputDir) {
try {
const results = await runCrawlerForGemiIds(gemiIds, outputDir);
console.log('Download results:', results);
return results;
} catch (error) {
console.error('Download failed:', error);
throw error;
}
}

// Usage
const gemiIds = ['123204604000', '144340502000'];
const results = await downloadDocuments(gemiIds, './downloads');
### Batch Processing with File Input

```bash
# Create input file
echo "123204604000" > company-ids.txt
echo "144340502000" >> company-ids.txt

# Run crawler with file
npx nx run crawler:start
# Select file option and provide path

Output Structure

Downloaded documents are organized as:

apps/crawler/src/downloads/
├── 123204604000/
│ └── document_downloads/
│ ├── 2019-09-23_90189.pdf
│ ├── 2020-11-03_2334237.pdf
│ └── 2021-12-13_2747556.pdf
└── 144340502000/
└── document_downloads/
└── 2019-01-17_77417.pdf

Search Features

  • Fuzzy matching with "Did you mean?" suggestions
  • Support for Greek and Latin characters
  • Filter by legal form, status, and location

Advanced Filters

  • Legal form (AE, EPE, OE, etc.)
  • Company status (Active, Inactive, etc.)
  • Registration date ranges
  • Geographic location

Common Use Cases

Research Workflow

# 1. Search for companies
npx nx run crawler:start
# Search for: "TELECOMMUNICATIONS"
# Apply filters: Active companies only

# 2. Review ids.txt results
cat apps/crawler/ids.txt

# 3. Download documents for selected IDs
npx nx run crawler:start
# Use file option with filtered IDs

Bulk Download

# Prepare large ID list
cat > bulk-companies.txt << EOF
123204604000
144340502000
148851015000
152034008000
EOF

# Run bulk download
npx nx run crawler:start
# Select file option
# Enter: bulk-companies.txt

Error Handling

The crawler handles common issues automatically:

  • Network timeouts with retry logic
  • Rate limiting with exponential backoff
  • Invalid document formats (skipped)
  • Browser crashes (automatic restart)

Tips

  • Use specific company names for better search results
  • Check ids.txt before bulk downloading
  • Monitor disk space for large batch downloads
  • Stable internet connection recommended for best results