Web scraping has long been a powerful tool for gathering data, but let's be honest: it can be a major headache. You write a scraper that meticulously targets specific CSS selectors or XPath queries, only for it to break the moment a developer pushes a minor UI update. The constant maintenance, the brittle code, and the complexity of handling different page layouts can turn a simple data extraction task into a perpetual engineering chore.
But what if you could scrape a website without writing a scraper? What if you could simply describe the data you want—like "product name," "price," and "specifications"—and let an AI figure out how to find it?
This isn't a futuristic dream; it's a new reality powered by AI-driven data extraction. In this tutorial, we'll show you how to use the extract.do API to pull clean, structured JSON from any website's HTML with a single, simple API call.
Traditional web scraping tools rely on a rigid set of instructions. You tell them: "Go to this page, find the <h1> element with the class product-title, and get its text."
This approach has several critical flaws:
This fragility means you spend more time fixing scrapers than you do using the data they provide.
Instead of relying on a fragile map of the website's structure, extract.do uses AI to understand the meaning (the semantics) of the content. You don't tell it where the data is; you tell it what the data is.
The process is incredibly simple:
This approach is more resilient, flexible, and infinitely simpler.
Let's walk through a real-world example. Imagine we want to pull key details from a product page on an e-commerce site.
First, you need the raw HTML content of the page you want to scrape. You can get this programmatically using a library like axios or fetch in your language of choice. For this example, we'll use a sample HTML string.
<!-- Fictional Product Page HTML -->
<main class="product-details">
<div class="gallery">...</div>
<div class="info">
<h1>SuperCharge Pro Laptop</h1>
<span class="brand">By TechCorp</span>
<p class="description">The fastest laptop for professionals on the go.</p>
<div class="pricing">
<span class="price">$1499.99</span>
<button>Add to Cart</button>
</div>
<div class="specs">
<h3>Specifications</h3>
<ul>
<li>CPU: M3 Ultra Chip</li>
<li>Memory: 32GB Unified RAM</li>
<li>Storage: 1TB SSD</li>
</ul>
</div>
</div>
</main>
This is where the magic happens. Instead of writing CSS selectors, you simply define a JSON object that represents the data you want. You specify the field name and the expected data type.
For our product page, the schema would look like this:
const productSchema = {
productName: 'string',
brand: 'string',
price: 'number',
specifications: {
cpu: 'string',
memory: 'string',
storage: 'string',
}
};
Notice how we can even define nested objects. The AI is smart enough to understand the hierarchy and find the related specifications.
Now, combine the HTML and your schema in a single API call to extract.do.
import { Do } from '@do-sdk/core';
// The unstructured HTML content from the product page
const productHtml = `
<main class="product-details">
<div class="gallery">...</div>
<div class="info">
<h1>SuperCharge Pro Laptop</h1>
<span class="brand">By TechCorp</span>
<p class="description">The fastest laptop for professionals on the go.</p>
<div class="pricing">
<span class="price">$1499.99</span>
<button>Add to Cart</button>
</div>
<div class="specs">
<h3>Specifications</h3>
<ul>
<li>CPU: M3 Ultra Chip</li>
<li>Memory: 32GB Unified RAM</li>
<li>Storage: 1TB SSD</li>
</ul>
</div>
</div>
</main>
`;
// Simply define the data structure you want
const structuredProduct = await Do.extract('extract.do', {
text: productHtml,
schema: {
productName: 'string',
brand: 'string',
price: 'number',
specifications: {
cpu: 'string',
memory: 'string',
storage: 'string',
}
}
});
console.log(JSON.stringify(structuredProduct, null, 2));
Running this code will produce the following structured output, ready to be used in your application:
{
"productName": "SuperCharge Pro Laptop",
"brand": "TechCorp",
"price": 1499.99,
"specifications": {
"cpu": "M3 Ultra Chip",
"memory": "32GB Unified RAM",
"storage": "1TB SSD"
}
}
That's it. No XPath. No regex. No brittle selector logic. You got perfectly structured data even though the price was prefixed with a "$" and the specifications were inside an unordered list. The AI handled it all.
extract.do can process virtually any unstructured text source, including raw text, emails, PDFs, Word documents, HTML content from websites, and more. Just provide the text content, and our AI will do the rest.
You provide a simple JSON schema describing the fields and data types you need. The AI uses this schema as a guide to find and structure the relevant information from the source text, requiring no complex rules or templates.
Unlike brittle regex or scrapers that break with layout changes, extract.do uses AI to understand the semantic context of the data. This makes it far more resilient, flexible, and capable of handling varied data formats without manual rule-setting.
Yes. extract.do is built on a scalable architecture designed to handle high-volume, real-time data processing. It's ideal for powering applications, enriching user profiles, or feeding data into your analytics pipelines.
By shifting from structural scraping to semantic extraction, you can drastically reduce development time and eliminate maintenance headaches. Your code becomes decoupled from a website's fickle design, allowing you to focus on building features, not fixing scrapers.
Ready to turn any website into a clean, structured API? Get started with extract.do today and experience the future of data extraction.