AI Tools

Overview

DataGen provides powerful AI tools for content analysis, structured data extraction, and intelligent processing. These tools leverage advanced language models to transform unstructured data into structured formats, write content, and understand multimedia files.

OpenAI-Powered Tools

Extract Structured Output

Transform unstructured content into structured data using AI-powered analysis with the extract_structured_output tool. This tool is particularly useful for classification, data extraction, and parameter extraction.

Input Parameters

Parameter	Type	Required	Description
`instruction_prompt`	string	Yes	Instruction describing fields to extract and constraints
`content`	string	Yes	The content to extract structured data from
`structured_output_type`	type[BaseModel]	Yes	Pydantic model type defining the output structure

Output Data

Returns an instance of the provided structured_output_type with extracted data structured according to your Pydantic model definition.

Use Cases

Content Classification

Example: Classify LinkedIn comments by engagement level

class EngagementType(BaseModel):
    is_strong_engagement: bool
    confidence_score: float

result = extract_structured_output(
    instruction_prompt="Classify the comment into strong engagement or not",
    content="I love your post, it's amazing!",
    structured_output_type=EngagementType
)

Data Extraction

Example: Extract person information from text

class PersonInfo(BaseModel):
    name: str
    age: int
    email: str

result = extract_structured_output(
    instruction_prompt="Extract the information from the unstructured data",
    content="John Doe is 25 years old and his email is john.doe@example.com",
    structured_output_type=PersonInfo
)

Parameter Extraction

Example: Extract tool parameters from user queries

class WeatherRequest(BaseModel):
    location: str
    date: str

result = extract_structured_output(
    instruction_prompt="Extract the parameter for the Weather tool which require location and date",
    content="What is the weather in San Francisco on 2025-01-01?",
    structured_output_type=WeatherRequest
)

AI Writer

Generate high-quality content based on instruction prompts using the ai_writer tool. Commonly used for writing emails, summaries, and other text content.

Input Parameters

Parameter	Type	Required	Description
`instruction_prompt`	string	Yes	Instructions for what type of content to write
`content`	string	Yes	Source content or context for writing

Output Data

Returns a string containing the AI-generated content based on your instructions and input content.

Use Cases

Email Writing

Example: Generate personalized emails

result = ai_writer(
    instruction_prompt="Write a professional follow-up email to a potential client",
    content="Client: TechCorp, discussed AI automation solutions, next meeting in 2 weeks"
)

Content Summarization

Example: Summarize research findings

result = ai_writer(
    instruction_prompt="Write a concise executive summary of the research findings",
    content="[Long research document content...]"
)

Report Generation

Example: Create analytical reports

result = ai_writer(
    instruction_prompt="Generate a quarterly performance report with key insights",
    content="Q3 sales data: 15% growth, 200 new customers, top products..."
)

Extract Tool Parameters

Specialized tool for extracting parameters needed by other tools using the extract_tool_params function. This is essentially a wrapper around extract_structured_output optimized for tool parameter extraction.

Input Parameters

Parameter	Type	Required	Description
`instruction`	string	Yes	Instructions for parameter extraction
`query`	string	Yes	User query to extract parameters from
`tool_params_type`	type[BaseModel]	Yes	Pydantic model defining expected parameters

Example Usage

class LinkedInSearchParams(BaseModel):
    company_name: str
    job_title: str
    location: str

params = extract_tool_params(
    instruction="Extract LinkedIn search parameters from the user query",
    query="Find software engineers at Google in San Francisco",
    tool_params_type=LinkedInSearchParams
)

Gemini-Powered Understanding Tools

Document Understanding

Analyze and extract structured data from documents (HTML, PDF, Text) using Google’s Gemini model with the doc_understanding_tool.

Input Parameters

Parameter	Type	Required	Description
`instruction_prompt`	string	Yes	Instructions for data extraction and constraints
`url`	string	Yes	Document URL (HTML/PDF/Text)
`file_type`	string	Yes	Document type: PDF, HTML, or Text
`structured_output_type`	type[BaseModel]	Yes	Pydantic model for structured output

Supported Formats

PDF: Application documents, reports, research papers
HTML: Web pages, online articles, documentation
Text: Plain text files, transcripts, notes

Example Usage

class CompanyInfo(BaseModel):
    company_name: str
    revenue: str
    employees: int
    founded_year: int

result = doc_understanding_tool(
    instruction_prompt="Extract company information from the annual report",
    url="https://example.com/annual-report.pdf",
    file_type="PDF",
    structured_output_type=CompanyInfo
)

Image Understanding

Analyze images and extract structured data using the image_understanding_tool. Supports JPEG and PNG formats.

Input Parameters

Parameter	Type	Required	Description
`instruction_prompt`	string	Yes	Instructions for image analysis
`url`	string	Yes	Image URL (JPEG/PNG only)
`file_type`	string	Yes	Image type: JPEG, JPG, or PNG
`structured_output_type`	type[BaseModel]	Yes	Pydantic model for structured output

Use Cases

Chart Analysis

Extract data from charts and graphs

class ChartData(BaseModel):
    chart_type: str
    data_points: List[Dict[str, Any]]
    trends: List[str]

result = image_understanding_tool(
    instruction_prompt="Analyze this sales chart and extract key data points",
    url="https://example.com/sales-chart.png",
    file_type="PNG",
    structured_output_type=ChartData
)

Document Scanning

Extract text and data from scanned documents

class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total_amount: float
    vendor: str

result = image_understanding_tool(
    instruction_prompt="Extract invoice details from this scanned document",
    url="https://example.com/invoice.jpg",
    file_type="JPEG",
    structured_output_type=InvoiceData
)

Audio Understanding

Analyze audio files and extract structured information using the audio_understanding_tool. Supports MP3 and WAV formats.

Input Parameters

Parameter	Type	Required	Description
`instruction_prompt`	string	Yes	Instructions for audio analysis
`url`	string	Yes	Audio file URL
`file_type`	string	Yes	Audio type: MP3 or WAV
`structured_output_type`	type[BaseModel]	Yes	Pydantic model for structured output

Use Cases

Meeting Transcription

Extract key information from meeting recordings

class MeetingNotes(BaseModel):
    attendees: List[str]
    key_decisions: List[str]
    action_items: List[str]
    next_meeting: str

result = audio_understanding_tool(
    instruction_prompt="Extract meeting notes and action items",
    url="https://example.com/meeting.mp3",
    file_type="MP3",
    structured_output_type=MeetingNotes
)

Customer Call Analysis

Analyze customer service calls

class CallAnalysis(BaseModel):
    sentiment: str
    main_issues: List[str]
    resolution_status: str
    follow_up_required: bool

result = audio_understanding_tool(
    instruction_prompt="Analyze customer call for sentiment and issues",
    url="https://example.com/customer-call.wav",
    file_type="WAV",
    structured_output_type=CallAnalysis
)

Video Understanding

Analyze YouTube videos and extract structured data using the video_understanding_tool. Currently supports YouTube URLs only.

Input Parameters

Parameter	Type	Required	Description
`instruction_prompt`	string	Yes	Instructions for video analysis
`url`	string	Yes	YouTube video URL
`structured_output_type`	type[BaseModel]	Yes	Pydantic model for structured output

Supported URLs

youtube.com/watch?v=...
youtu.be/...
youtube.com/shorts/...

Use Cases

Content Analysis

Extract key information from educational or business videos

class VideoSummary(BaseModel):
    title: str
    main_topics: List[str]
    key_takeaways: List[str]
    duration_minutes: int

result = video_understanding_tool(
    instruction_prompt="Summarize this educational video",
    url="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    structured_output_type=VideoSummary
)

Competitive Analysis

Analyze competitor product demos

class ProductAnalysis(BaseModel):
    product_features: List[str]
    pricing_mentioned: bool
    target_audience: str
    competitive_advantages: List[str]

result = video_understanding_tool(
    instruction_prompt="Analyze this product demo for competitive intelligence",
    url="https://www.youtube.com/watch?v=example",
    structured_output_type=ProductAnalysis
)

Configuration & Performance

Model Configuration

OpenAI Tools: Uses GPT-5 series for optimal balance of speed and accuracy
Gemini Tools: Uses Gemini for multimodal understanding
Temperature: Set to 0.1-0.2 for consistent, factual outputs
Response Format: Enforced JSON structure for reliable parsing

Rate Limits & Usage

Rate Limiting: 60 calls per minute across all AI tools
Daily Limits: 1,000 calls per day per tool
Credit System: 1 credit per request
Caching: 1-hour TTL for repeated queries
Retry Logic: 3 attempts with exponential backoff

Best Practices

Effective Prompting

Guidelines for Better Results:

Be specific and detailed in instructions
Provide clear examples when possible
Define constraints and expected formats
Use consistent terminology
Test with sample data first

Pydantic Models

Model Design Tips:

Use descriptive field names
Add field descriptions and constraints
Include default values where appropriate
Use appropriate data types
Consider optional vs required fields

Error Handling

Robust Implementation:

Validate input parameters
Handle API timeouts gracefully
Implement fallback strategies
Log errors for debugging
Test edge cases thoroughly

Performance Optimization

Efficiency Tips:

Cache repeated operations
Batch similar requests
Use appropriate model sizes
Optimize prompt length
Monitor usage patterns

Integration Examples

Multi-Step Processing Pipeline

# Step 1: Extract structured data from document
class ContractData(BaseModel):
    parties: List[str]
    terms: List[str]
    expiration_date: str

contract_info = doc_understanding_tool(
    instruction_prompt="Extract key contract information",
    url="https://example.com/contract.pdf",
    file_type="PDF",
    structured_output_type=ContractData
)

# Step 2: Generate summary
summary = ai_writer(
    instruction_prompt="Write an executive summary of this contract",
    content=f"Contract parties: {contract_info.parties}, Terms: {contract_info.terms}"
)

# Step 3: Classify risk level
class RiskAssessment(BaseModel):
    risk_level: str
    concerns: List[str]
    recommendations: List[str]

risk_analysis = extract_structured_output(
    instruction_prompt="Assess the risk level of this contract",
    content=summary,
    structured_output_type=RiskAssessment
)

Content Processing Workflow

# Process multiple content types in sequence
results = []

# Analyze video content
video_data = video_understanding_tool(
    instruction_prompt="Extract key points from this training video",
    url="https://youtube.com/watch?v=training_video",
    structured_output_type=TrainingContent
)

# Analyze supporting documents
doc_data = doc_understanding_tool(
    instruction_prompt="Extract supplementary information",
    url="https://example.com/training_guide.pdf",
    file_type="PDF",
    structured_output_type=SupportingInfo
)

# Generate comprehensive summary
combined_summary = ai_writer(
    instruction_prompt="Create a comprehensive training summary",
    content=f"Video content: {video_data}\nDocument content: {doc_data}"
)

Troubleshooting

Model validation errors

Common causes:

Pydantic model doesn’t match extracted data structure
Required fields are missing from the content
Data types don’t match field definitions

Solutions:

Review and adjust your Pydantic model
Make fields optional when data might be missing
Add validation and default values
Test with simpler models first

Timeout errors with large files

Optimization strategies:

Reduce file size when possible
Use more specific instruction prompts
Process in smaller chunks
Increase timeout settings for large files

File size limits:

Documents: Recommended < 10MB
Images: Recommended < 5MB
Audio: Recommended < 25MB

API key configuration issues

Setup requirements:

OpenAI tools require OPENAI_API_KEY
Gemini tools require GEMINI_API_KEY or GOOGLE_API_KEY
Ensure API keys have proper permissions
Check rate limits and quotas

What’s Next?

Web Research

Combine AI tools with web research capabilities

LinkedIn Tools

Use AI tools to process LinkedIn data

Use Cases

See real-world AI automation examples

Deployment

Deploy AI workflows as production APIs

Getting Started

Connect MCPs

Create Custom Tools

Deploy Agents

DataGen SDK

Built-in Tools

Overview

OpenAI-Powered Tools

Extract Structured Output

AI Writer

Extract Tool Parameters

Gemini-Powered Understanding Tools

Document Understanding

Image Understanding

Audio Understanding

Video Understanding

Configuration & Performance

Model Configuration

Rate Limits & Usage

Best Practices

Effective Prompting

Pydantic Models

Error Handling

Performance Optimization

Integration Examples

Multi-Step Processing Pipeline

Content Processing Workflow

Troubleshooting

What’s Next?

Web Research

LinkedIn Tools

Use Cases

Deployment

Getting Started

Connect MCPs

Create Custom Tools

Deploy Agents

DataGen SDK

Built-in Tools

​Overview

​OpenAI-Powered Tools

​Extract Structured Output

​AI Writer

​Extract Tool Parameters

​Gemini-Powered Understanding Tools

​Document Understanding

​Image Understanding

​Audio Understanding

​Video Understanding

​Configuration & Performance

​Model Configuration

​Rate Limits & Usage

​Best Practices

Effective Prompting

Pydantic Models

Error Handling

Performance Optimization

​Integration Examples

​Multi-Step Processing Pipeline

​Content Processing Workflow

​Troubleshooting

​What’s Next?

Web Research

LinkedIn Tools

Use Cases

Deployment

Overview

OpenAI-Powered Tools

Extract Structured Output

AI Writer

Extract Tool Parameters

Gemini-Powered Understanding Tools

Document Understanding

Image Understanding

Audio Understanding

Video Understanding

Configuration & Performance

Model Configuration

Rate Limits & Usage

Best Practices

Integration Examples

Multi-Step Processing Pipeline

Content Processing Workflow

Troubleshooting

What’s Next?