In the digital age, data is the new oil. But for many developers and businesses, much of that oil is trapped in rock—unstructured text, messy HTML, and inconsistent documents. The traditional tools for extracting this value, like regex, custom parsers, and complex ETL pipelines, are often brittle, time-consuming, and a nightmare to maintain.
What if you could define your data needs just as you define your application's logic or infrastructure? What if your data definitions were version-controlled, testable, and lived right alongside your code?
This is the promise of 'Data as Code'—a revolutionary paradigm that shifts data processing from complex, imperative scripting to simple, declarative definitions. Let's explore what this concept means and how platforms like extract.do are making it a reality.
If you've ever been tasked with extracting information, you know the drill.
The common thread here is brittleness. Traditional methods are tightly coupled to the structure of the source data. Any change, no matter how small, can cause a total failure, leading to data loss and frantic maintenance cycles.
'Data as Code' flips the script. Instead of telling the machine how to find the data step-by-step, you simply declare what data you want.
Inspired by movements like Infrastructure as Code (IaC), 'Data as Code' treats your desired data structure as a core, version-controlled asset of your application. You define your target data schema using a familiar format, like a JSON Schema or a TypeScript interface, and an intelligent agent handles the rest.
This makes your data extraction workflows:
extract.do is a platform built from the ground up on the 'Data as Code' principle. It uses advanced AI agents to turn unstructured text, documents, and websites into clean, structured JSON with a simple API call.
Let's see it in action. Imagine you want to extract contact information from a block of text. Instead of writing messy regex, you can do this:
import { DO } from '@do-inc/sdk';
// Initialize the .do client
const secret = process.env.DO_SECRET;
const digo = new DO({ secret });
// Define the source text and desired data structure
const sourceText = 'Contact John Doe at j.doe@example.com. He is the CEO of Acme Inc.';
// This interface is your 'Data as Code' definition!
interface ContactInfo {
name: string;
email: string;
company: string;
}
// Run the extraction agent
const extractedData = await digo
.agent<ContactInfo>('extract')
.run({
source: sourceText,
description: 'Extract the full name, email, and company from the text.'
});
console.log(extractedData);
// {
// "name": "John Doe",
// "email": "j.doe@example.com",
// "company": "Acme Inc."
// }
The ContactInfo interface is the heart of this workflow. You declare the exact fields and types you need, and the extract.do agent intelligently parses the sourceText to populate that structure. No CSS selectors, no XPath, no regex. Just a clear, readable definition of your desired output.
Adopting this approach offers a wealth of benefits over traditional methods.
Instead of building and maintaining a complex ETL pipeline, you write a few lines of code. This dramatically reduces development time and allows you to focus on using the data, not just fighting to get it.
AI agents aren't looking for a specific <span> with id="email". They understand that "j.doe@example.com" is an email address belonging to "John Doe". This contextual understanding means your code doesn't break when a website's layout changes.
The same 'Data as Code' approach works whether your source is raw text, the HTML from a URL, or text extracted from a document. extract.do is designed to be source-agnostic, allowing you to build a single, consistent workflow for all your data extraction needs.
While it's fantastic for web scraping, this paradigm is a comprehensive solution for any unstructured data problem:
'Data as Code' represents a fundamental shift in how we interact with information. It replaces brittle, imperative scripts with intelligent, declarative definitions that live an as integral part of your application.
By treating data schemas as code, you empower your team to move faster, build more resilient systems, and unlock the value trapped in unstructured data with unprecedented ease.
Ready to stop wrestling with data and start defining it? Visit extract.do to see how you can implement 'Data as Code' today.