Vantager Technical Assessment

Extended Multi-Needle in a Haystack

Your challenge is to develop a solution that finds and extracts specific pieces of information ("needles") from a large text corpus ("haystack"). You'll receive a text file containing the entire Dune Universe (~3.5 million words), within which are hidden "needles" containing information about fictional technology companies. Each needle includes details such as company name, location, employee count, founding year, public/private status, valuation, and primary focus.

Example:

"Ryoshi, based in Neo Tokyo, Japan, is a private quantum computing firm founded in 2031, currently valued at $8.7 billion with 1,200 employees focused on quantum cryptography."

Task

Your task is to find all needles, extract the relevant information, and present the data in a structured table. You have 48 hours to complete this challenge, but expected time commitment is 3-6 hours. Use Python with any packages/tools you find suitable. The key is to optimize your solution for both accuracy and processing speed, given the input size.

Please create a generalizable function that accepts three inputs:

A Pydantic schema defining the structure of needles to be extracted
A dynamic text corpus (haystack)
An list of example sentences (needles)

from typing import List, Type, TypeVar
from pydantic import BaseModel

T = TypeVar('T', bound=BaseModel)

def extract_multi_needle(schema: Type[T], haystack: str, example_needles: List[str]) -> List[T]:
    """
    Extracts and structures information from a large text corpus based on a given schema and examples.

    Args:
    schema (Type[T]): A Pydantic model defining the structure of the needle to be extracted.
    haystack (str): The large text corpus to search through (haystack).
    example_needles (List[str]): A list of example sentences (needles).

    Returns:
    List[T]: A list of extracted needles conforming to the provided schema.
    """
    # Implementation goes here
    extracted_needles = []
    return extracted_needles

The function should adapt to various Pydantic schemas and text inputs. For example:

from typing import Optional
from pydantic import BaseModel, Field

class TechCompany(BaseModel):
    name: Optional[str] = Field(default=None, description="The full name of the technology company")
    location: Optional[str] = Field(default=None, description="City and country where the company is headquartered")
    employee_count: Optional[int] = Field(default=None, description="Total number of employees")
    founding_year: Optional[int] = Field(default=None, description="Year the company was established")
    is_public: Optional[bool] = Field(default=None, description="Whether the company is publicly traded (True) or privately held (False)")
    valuation: Optional[float] = Field(default=None, description="Company's valuation in billions of dollars")
    primary_focus: Optional[str] = Field(default=None, description="Main area of technology or industry the company focuses on")

Your function should return extracted data based on the provided schema. Demonstrate its ability to handle different data structures efficiently and accurately. Note that your solution will be tested on a new needle and haystack example to evaluate its adaptability and performance.

Data

haystack.txt

</aside>

Submission

For submission, provide your code via a GitHub repository or zip file and deliver the extracted data in a structured format (CSV). Additionally, create a short video walkthrough (max 5 minutes) explaining your solution and approach. Submit all materials to [email protected].

Evaluation

Accuracy in finding and extracting needles
Processing speed and cost
Code quality and organization
Clarity of video explanation
Questions asked, both during and after the assessment </aside>