Skip to main content

Overview

DataGen provides powerful AI tools for content analysis, structured data extraction, and intelligent processing. These tools leverage advanced language models to transform unstructured data into structured formats, write content, and understand multimedia files.

OpenAI-Powered Tools

Extract Structured Output

Transform unstructured content into structured data using AI-powered analysis with the extract_structured_output tool. This tool is particularly useful for classification, data extraction, and parameter extraction.
ParameterTypeRequiredDescription
instruction_promptstringYesInstruction describing fields to extract and constraints
contentstringYesThe content to extract structured data from
structured_output_typetype[BaseModel]YesPydantic model type defining the output structure
Returns an instance of the provided structured_output_type with extracted data structured according to your Pydantic model definition.
Example: Classify LinkedIn comments by engagement level
class EngagementType(BaseModel):
    is_strong_engagement: bool
    confidence_score: float

result = extract_structured_output(
    instruction_prompt="Classify the comment into strong engagement or not",
    content="I love your post, it's amazing!",
    structured_output_type=EngagementType
)
Example: Extract person information from text
class PersonInfo(BaseModel):
    name: str
    age: int
    email: str

result = extract_structured_output(
    instruction_prompt="Extract the information from the unstructured data",
    content="John Doe is 25 years old and his email is john.doe@example.com",
    structured_output_type=PersonInfo
)
Example: Extract tool parameters from user queries
class WeatherRequest(BaseModel):
    location: str
    date: str

result = extract_structured_output(
    instruction_prompt="Extract the parameter for the Weather tool which require location and date",
    content="What is the weather in San Francisco on 2025-01-01?",
    structured_output_type=WeatherRequest
)

AI Writer

Generate high-quality content based on instruction prompts using the ai_writer tool. Commonly used for writing emails, summaries, and other text content.
ParameterTypeRequiredDescription
instruction_promptstringYesInstructions for what type of content to write
contentstringYesSource content or context for writing
Returns a string containing the AI-generated content based on your instructions and input content.
Example: Generate personalized emails
result = ai_writer(
    instruction_prompt="Write a professional follow-up email to a potential client",
    content="Client: TechCorp, discussed AI automation solutions, next meeting in 2 weeks"
)
Example: Summarize research findings
result = ai_writer(
    instruction_prompt="Write a concise executive summary of the research findings",
    content="[Long research document content...]"
)
Example: Create analytical reports
result = ai_writer(
    instruction_prompt="Generate a quarterly performance report with key insights",
    content="Q3 sales data: 15% growth, 200 new customers, top products..."
)

Extract Tool Parameters

Specialized tool for extracting parameters needed by other tools using the extract_tool_params function. This is essentially a wrapper around extract_structured_output optimized for tool parameter extraction.
ParameterTypeRequiredDescription
instructionstringYesInstructions for parameter extraction
querystringYesUser query to extract parameters from
tool_params_typetype[BaseModel]YesPydantic model defining expected parameters
class LinkedInSearchParams(BaseModel):
    company_name: str
    job_title: str
    location: str

params = extract_tool_params(
    instruction="Extract LinkedIn search parameters from the user query",
    query="Find software engineers at Google in San Francisco",
    tool_params_type=LinkedInSearchParams
)

Gemini-Powered Understanding Tools

Document Understanding

Analyze and extract structured data from documents (HTML, PDF, Text) using Google’s Gemini model with the doc_understanding_tool.
ParameterTypeRequiredDescription
instruction_promptstringYesInstructions for data extraction and constraints
urlstringYesDocument URL (HTML/PDF/Text)
file_typestringYesDocument type: PDF, HTML, or Text
structured_output_typetype[BaseModel]YesPydantic model for structured output
  • PDF: Application documents, reports, research papers
  • HTML: Web pages, online articles, documentation
  • Text: Plain text files, transcripts, notes
class CompanyInfo(BaseModel):
    company_name: str
    revenue: str
    employees: int
    founded_year: int

result = doc_understanding_tool(
    instruction_prompt="Extract company information from the annual report",
    url="https://example.com/annual-report.pdf",
    file_type="PDF",
    structured_output_type=CompanyInfo
)

Image Understanding

Analyze images and extract structured data using the image_understanding_tool. Supports JPEG and PNG formats.
ParameterTypeRequiredDescription
instruction_promptstringYesInstructions for image analysis
urlstringYesImage URL (JPEG/PNG only)
file_typestringYesImage type: JPEG, JPG, or PNG
structured_output_typetype[BaseModel]YesPydantic model for structured output
Extract data from charts and graphs
class ChartData(BaseModel):
    chart_type: str
    data_points: List[Dict[str, Any]]
    trends: List[str]

result = image_understanding_tool(
    instruction_prompt="Analyze this sales chart and extract key data points",
    url="https://example.com/sales-chart.png",
    file_type="PNG",
    structured_output_type=ChartData
)
Extract text and data from scanned documents
class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total_amount: float
    vendor: str

result = image_understanding_tool(
    instruction_prompt="Extract invoice details from this scanned document",
    url="https://example.com/invoice.jpg",
    file_type="JPEG",
    structured_output_type=InvoiceData
)

Audio Understanding

Analyze audio files and extract structured information using the audio_understanding_tool. Supports MP3 and WAV formats.
ParameterTypeRequiredDescription
instruction_promptstringYesInstructions for audio analysis
urlstringYesAudio file URL
file_typestringYesAudio type: MP3 or WAV
structured_output_typetype[BaseModel]YesPydantic model for structured output
Extract key information from meeting recordings
class MeetingNotes(BaseModel):
    attendees: List[str]
    key_decisions: List[str]
    action_items: List[str]
    next_meeting: str

result = audio_understanding_tool(
    instruction_prompt="Extract meeting notes and action items",
    url="https://example.com/meeting.mp3",
    file_type="MP3",
    structured_output_type=MeetingNotes
)
Analyze customer service calls
class CallAnalysis(BaseModel):
    sentiment: str
    main_issues: List[str]
    resolution_status: str
    follow_up_required: bool

result = audio_understanding_tool(
    instruction_prompt="Analyze customer call for sentiment and issues",
    url="https://example.com/customer-call.wav",
    file_type="WAV",
    structured_output_type=CallAnalysis
)

Video Understanding

Analyze YouTube videos and extract structured data using the video_understanding_tool. Currently supports YouTube URLs only.
ParameterTypeRequiredDescription
instruction_promptstringYesInstructions for video analysis
urlstringYesYouTube video URL
structured_output_typetype[BaseModel]YesPydantic model for structured output
  • youtube.com/watch?v=...
  • youtu.be/...
  • youtube.com/shorts/...
Extract key information from educational or business videos
class VideoSummary(BaseModel):
    title: str
    main_topics: List[str]
    key_takeaways: List[str]
    duration_minutes: int

result = video_understanding_tool(
    instruction_prompt="Summarize this educational video",
    url="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    structured_output_type=VideoSummary
)
Analyze competitor product demos
class ProductAnalysis(BaseModel):
    product_features: List[str]
    pricing_mentioned: bool
    target_audience: str
    competitive_advantages: List[str]

result = video_understanding_tool(
    instruction_prompt="Analyze this product demo for competitive intelligence",
    url="https://www.youtube.com/watch?v=example",
    structured_output_type=ProductAnalysis
)

Configuration & Performance

Model Configuration

  • OpenAI Tools: Uses GPT-5 series for optimal balance of speed and accuracy
  • Gemini Tools: Uses Gemini for multimodal understanding
  • Temperature: Set to 0.1-0.2 for consistent, factual outputs
  • Response Format: Enforced JSON structure for reliable parsing

Rate Limits & Usage

  • Rate Limiting: 60 calls per minute across all AI tools
  • Daily Limits: 1,000 calls per day per tool
  • Credit System: 1 credit per request
  • Caching: 1-hour TTL for repeated queries
  • Retry Logic: 3 attempts with exponential backoff

Best Practices

Effective Prompting

Guidelines for Better Results:
  • Be specific and detailed in instructions
  • Provide clear examples when possible
  • Define constraints and expected formats
  • Use consistent terminology
  • Test with sample data first

Pydantic Models

Model Design Tips:
  • Use descriptive field names
  • Add field descriptions and constraints
  • Include default values where appropriate
  • Use appropriate data types
  • Consider optional vs required fields

Error Handling

Robust Implementation:
  • Validate input parameters
  • Handle API timeouts gracefully
  • Implement fallback strategies
  • Log errors for debugging
  • Test edge cases thoroughly

Performance Optimization

Efficiency Tips:
  • Cache repeated operations
  • Batch similar requests
  • Use appropriate model sizes
  • Optimize prompt length
  • Monitor usage patterns

Integration Examples

Multi-Step Processing Pipeline

# Step 1: Extract structured data from document
class ContractData(BaseModel):
    parties: List[str]
    terms: List[str]
    expiration_date: str

contract_info = doc_understanding_tool(
    instruction_prompt="Extract key contract information",
    url="https://example.com/contract.pdf",
    file_type="PDF",
    structured_output_type=ContractData
)

# Step 2: Generate summary
summary = ai_writer(
    instruction_prompt="Write an executive summary of this contract",
    content=f"Contract parties: {contract_info.parties}, Terms: {contract_info.terms}"
)

# Step 3: Classify risk level
class RiskAssessment(BaseModel):
    risk_level: str
    concerns: List[str]
    recommendations: List[str]

risk_analysis = extract_structured_output(
    instruction_prompt="Assess the risk level of this contract",
    content=summary,
    structured_output_type=RiskAssessment
)

Content Processing Workflow

# Process multiple content types in sequence
results = []

# Analyze video content
video_data = video_understanding_tool(
    instruction_prompt="Extract key points from this training video",
    url="https://youtube.com/watch?v=training_video",
    structured_output_type=TrainingContent
)

# Analyze supporting documents
doc_data = doc_understanding_tool(
    instruction_prompt="Extract supplementary information",
    url="https://example.com/training_guide.pdf",
    file_type="PDF",
    structured_output_type=SupportingInfo
)

# Generate comprehensive summary
combined_summary = ai_writer(
    instruction_prompt="Create a comprehensive training summary",
    content=f"Video content: {video_data}\nDocument content: {doc_data}"
)

Troubleshooting

Common causes:
  1. Pydantic model doesn’t match extracted data structure
  2. Required fields are missing from the content
  3. Data types don’t match field definitions
Solutions:
  • Review and adjust your Pydantic model
  • Make fields optional when data might be missing
  • Add validation and default values
  • Test with simpler models first
Optimization strategies:
  1. Reduce file size when possible
  2. Use more specific instruction prompts
  3. Process in smaller chunks
  4. Increase timeout settings for large files
File size limits:
  • Documents: Recommended < 10MB
  • Images: Recommended < 5MB
  • Audio: Recommended < 25MB
Setup requirements:
  1. OpenAI tools require OPENAI_API_KEY
  2. Gemini tools require GEMINI_API_KEY or GOOGLE_API_KEY
  3. Ensure API keys have proper permissions
  4. Check rate limits and quotas

What’s Next?

Web Research

Combine AI tools with web research capabilities

LinkedIn Tools

Use AI tools to process LinkedIn data

Use Cases

See real-world AI automation examples

Deployment

Deploy AI workflows as production APIs