Skip to main content

Doc-Scanner Examples

The doc-scanner application processes documents (PDF, DOC, DOCX) to extract metadata and generate contextual histories.

Basic Usage

Process Documents Interactively

# 1. Place documents in input directory
mkdir -p apps/doc-scanner/src/data/input/123204604000
cp /path/to/documents/*.pdf apps/doc-scanner/src/data/input/123204604000/

# 2. Run the scanner
npx nx run doc-scanner:start
# Enter GEMI ID: 123204604000

# 3. Find results in output directory
ls apps/doc-scanner/src/data/output/123204604000/

Programmatic Usage

Process Single Document

import { processSingleFile } from "./apps/doc-scanner/src/processing-logic.mjs";
import { getMetadataModel } from "./apps/doc-scanner/src/gemini-config.mjs";

async function processDocument(filePath, outputDir, fileName) {
const model = getMetadataModel();

try {
const result = await processSingleFile(
filePath,
outputDir,
fileName,
model
);

if (result.status === "success") {
console.log("Processing successful:", result.data);
return result.data;
} else {
console.error("Processing failed:", result.reason);
throw new Error(result.reason);
}
} catch (error) {
console.error("Error processing document:", error);
throw error;
}
}

// Usage
const result = await processDocument(
"./input/document.pdf",
"./output",
"document.pdf"
);

Batch Document Processing

import fs from "fs/promises";
import path from "path";
import pLimit from "p-limit";
import { processSingleFile } from "./apps/doc-scanner/src/processing-logic.mjs";
import { getMetadataModel } from "./apps/doc-scanner/src/gemini-config.mjs";

async function batchProcessDocuments(inputDir, outputDir, concurrency = 5) {
const files = await fs.readdir(inputDir);
const documentFiles = files.filter((f) => /\.(pdf|docx?)$/i.test(f));

const limit = pLimit(concurrency);
const model = getMetadataModel();

const tasks = documentFiles.map((file) =>
limit(() =>
processSingleFile(path.join(inputDir, file), outputDir, file, model)
)
);

const results = await Promise.allSettled(tasks);
return results.map((result, index) => ({
file: documentFiles[index],
result: result.status === "fulfilled" ? result.value : result.reason,
}));
}

// Usage
const results = await batchProcessDocuments(
"./input/123204604000",
"./output/123204604000"
);

Output Structure

Processed documents generate structured outputs:

apps/doc-scanner/src/data/output/123204604000/
├── document_metadata/
│ ├── document1.json
│ ├── document2.json
│ └── document3.json
└── 123204604000_contextual_document_histories.json

Supported Document Types

  • PDF: Company registration, financial reports
  • DOCX: Modern Word documents
  • DOC: Legacy Word documents

Extracted Metadata

Each document produces structured JSON with:

{
"companyName": "ALPHA BANK AE",
"legalForm": "AE",
"registrationNumber": "000123456789",
"documentType": "INCORPORATION_DOCUMENT",
"documentDate": "2019-09-23",
"address": {
"street": "Stadiou 40",
"city": "Athens",
"postalCode": "10564"
}
}

Configuration

Custom Gemini Settings

import { GoogleGenerativeAI } from "@google/generative-ai";

export function getCustomModel(options = {}) {
const {
modelName = "gemini-2.0-flash-lite",
temperature = 0.1,
maxTokens = 8192,
} = options;

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);

return genAI.getGenerativeModel({
model: modelName,
generationConfig: {
temperature,
maxOutputTokens: maxTokens,
responseMimeType: "application/json",
},
});
}

Environment Configuration

# .env settings for doc-scanner
GEMINI_API_KEY=your_api_key_here
GEMINI_CONCURRENCY_LIMIT=15
NODE_ENV=production

Tips

  • Ensure Gemini API key is valid and has sufficient quota
  • Use appropriate concurrency limits (default: 15)
  • Monitor API usage to avoid rate limits
  • Check file permissions for input/output directories
  • Large documents may take longer to process (5-20 seconds each)