Your challenge is to develop a solution that finds and extracts specific pieces of information ("needles") from a large text corpus ("haystack"). You'll receive a text file containing the entire Dune Universe (~3.5 million words), within which are hidden "needles" containing information about fictional technology companies. Each needle includes details such as company name, location, employee count, founding year, public/private status, valuation, and primary focus.
Example:
"Ryoshi, based in Neo Tokyo, Japan, is a private quantum computing firm founded in 2031, currently valued at $8.7 billion with 1,200 employees focused on quantum cryptography."
Your task is to find all needles, extract the relevant information, and present the data in a structured table. You have 48 hours to complete this challenge, but expected time commitment is 3-6 hours. Use Python with any packages/tools you find suitable. The key is to optimize your solution for both accuracy and processing speed, given the input size.
Please create a generalizable function that accepts three inputs:
from typing import List, Type, TypeVar
from pydantic import BaseModel
T = TypeVar('T', bound=BaseModel)
def extract_multi_needle(schema: Type[T], haystack: str, example_needles: List[str]) -> List[T]:
"""
Extracts and structures information from a large text corpus based on a given schema and examples.
Args:
schema (Type[T]): A Pydantic model defining the structure of the needle to be extracted.
haystack (str): The large text corpus to search through (haystack).
example_needles (List[str]): A list of example sentences (needles).
Returns:
List[T]: A list of extracted needles conforming to the provided schema.
"""
# Implementation goes here
extracted_needles = []
return extracted_needles
The function should adapt to various Pydantic schemas and text inputs. For example:
from typing import Optional
from pydantic import BaseModel, Field
class TechCompany(BaseModel):
name: Optional[str] = Field(default=None, description="The full name of the technology company")
location: Optional[str] = Field(default=None, description="City and country where the company is headquartered")
employee_count: Optional[int] = Field(default=None, description="Total number of employees")
founding_year: Optional[int] = Field(default=None, description="Year the company was established")
is_public: Optional[bool] = Field(default=None, description="Whether the company is publicly traded (True) or privately held (False)")
valuation: Optional[float] = Field(default=None, description="Company's valuation in billions of dollars")
primary_focus: Optional[str] = Field(default=None, description="Main area of technology or industry the company focuses on")
Your function should return extracted data based on the provided schema. Demonstrate its ability to handle different data structures efficiently and accurately. Note that your solution will be tested on a new needle and haystack example to evaluate its adaptability and performance.
<aside> <img src="https://prod-files-secure.s3.us-west-2.amazonaws.com/91a88975-57f1-45d3-8090-d6ce40d198c3/8f3c5f09-eb42-43fd-86ad-15d3dee35d44/Vantager_ProfilePicture_Sky.png" alt="https://prod-files-secure.s3.us-west-2.amazonaws.com/91a88975-57f1-45d3-8090-d6ce40d198c3/8f3c5f09-eb42-43fd-86ad-15d3dee35d44/Vantager_ProfilePicture_Sky.png" width="40px" />
</aside>
For submission, provide your code via a GitHub repository or zip file and deliver the extracted data in a structured format (CSV). Additionally, create a short video walkthrough (max 5 minutes) explaining your solution and approach. Submit all materials to [email protected].
<aside> <img src="https://prod-files-secure.s3.us-west-2.amazonaws.com/91a88975-57f1-45d3-8090-d6ce40d198c3/c71b27b5-bed4-442c-9697-a6b69bdb26a3/Vantager_ProfilePicture_Sky.png" alt="https://prod-files-secure.s3.us-west-2.amazonaws.com/91a88975-57f1-45d3-8090-d6ce40d198c3/c71b27b5-bed4-442c-9697-a6b69bdb26a3/Vantager_ProfilePicture_Sky.png" width="40px" />