Data is the lifeblood of modern applications, but it rarely comes in the clean, structured format we need. This is especially true in the real estate industry, where valuable listing information is scattered across thousands of websites, PDFs, and emails, each with its own unique layout.
Traditionally, developers have relied on brittle web scrapers and complex regular expressions (regex) to parse this data. This approach is a maintenance nightmare. A tiny change to a website's HTML can break your entire data pipeline, leading to hours of frustrating debugging.
But what if you could bypass this fragility? What if you could simply describe the data you want and let an AI figure out how to get it?
This is where intelligent data extraction comes in. In this guide, we'll walk you through building a powerful real-estate listing parser using extract.do, an AI-powered API that turns unstructured text into clean, developer-ready JSON.
Let's look at two typical real estate listings.
Listing 1: A simple text description
For sale: A charming 2-bedroom, 1.5-bath bungalow located at 456 Oak Street, Springfield. This beautiful home boasts a newly renovated kitchen and a spacious backyard. Asking price is $350,000. For more details, contact our agent.
Listing 2: A snippet from a different website
<div>
<h2>Condo with a View! - $475,000</h2>
<p>789 Pine Ave, Metro City</p>
<ul>
<li>Beds: 3</li>
<li>Baths: 2</li>
<li>Amenities: Rooftop pool, gym, concierge</li>
</ul>
</div>
Parsing these with a single, traditional scraper is nearly impossible. The data labels are different (For sale vs. a price in the <h2>), the structure is inconsistent, and the order of information varies. You'd need to write and maintain separate, complex rules for each source.
extract.do uses a fundamentally different approach. Instead of writing rigid rules, you provide a simple schema that describes your desired output. The AI then reads the source text, understands the context and semantics, and intelligently maps the information to your schema.
It doesn't care if the price is in a headline or a paragraph; it understands what a "price" is. It can identify "2-bedroom" and "Beds: 3" as the same type of data. This makes it incredibly resilient and flexible.
We'll build a simple Node.js application to demonstrate how easy this is.
First, make sure you have Node.js installed. Then, create a new project and install the extract.do SDK.
mkdir real-estate-parser
cd real-estate-parser
npm init -y
npm install @do-sdk/core
This is the most crucial step. Think about the perfect JSON object for a real estate listing. What information do you need? Let's define a schema for it. We want the address, price, number of bedrooms and bathrooms, and a list of amenities.
const listingSchema = {
address: 'string',
city: 'string',
price: 'number',
bedrooms: 'number',
bathrooms: 'number',
amenities: 'string[]', // An array of strings
};
This simple object tells the AI exactly what to look for and what data type each field should be.
Now, let's put it all together. We'll use the code from our first listing example and pass it to the extract.do API along with our schema.
// parser.ts
import { Do } from '@do-sdk/core';
// Any unstructured text from documents, emails, or websites
const listingText = `
For sale: A charming 2-bedroom, 1.5-bath bungalow located at 456 Oak Street, Springfield.
This beautiful home boasts a newly renovated kitchen and a spacious backyard.
Asking price is $350,000. For more details, contact our agent.
`;
// The schema tells the AI what data to extract
const listingSchema = {
address: 'string',
city: 'string',
price: 'number',
bedrooms: 'number',
bathrooms: 'number',
amenities: 'string[]',
};
async function parseListing(text: string) {
console.log('--- Processing Listing ---');
console.log(text);
try {
const structuredData = await Do.extract('extract.do', {
text: text,
schema: listingSchema,
});
console.log('\n--- Extracted JSON ---');
console.log(JSON.stringify(structuredData, null, 2));
} catch (error) {
console.error('Error extracting data:', error);
}
}
// Run the function with our first listing
parseListing(listingText);
Save the file as parser.ts and run it (you'll need ts-node or compile it first).
npx ts-node parser.ts
You'll see the following output:
--- Processing Listing ---
For sale: A charming 2-bedroom, 1.5-bath bungalow located at 456 Oak Street, Springfield.
This beautiful home boasts a newly renovated kitchen and a spacious backyard.
Asking price is $350,000. For more details, contact our agent.
--- Extracted JSON ---
{
"address": "456 Oak Street",
"city": "Springfield",
"price": 350000,
"bedrooms": 2,
"bathrooms": 1.5,
"amenities": [
"newly renovated kitchen",
"spacious backyard"
]
}
Look at that! Clean, structured, and ready to be stored in a database or used in an application. The AI correctly identified all the fields, converted the price and bathroom count to numbers, and even extracted key features into the amenities array.
Now for the real test. Let's feed it our second, completely different listing format without changing a single line of our extraction logic.
Simply update the listingText variable and run it again:
// ... (keep the same schema and function)
const listingText2 = `
<div>
<h2>Condo with a View! - $475,000</h2>
<p>789 Pine Ave, Metro City</p>
<ul>
<li>Beds: 3</li>
<li>Baths: 2</li>
<li>Amenities: Rooftop pool, gym, concierge</li>
</ul>
</div>
`;
parseListing(listingText2);
Output:
--- Extracted JSON ---
{
"address": "789 Pine Ave",
"city": "Metro City",
"price": 475000,
"bedrooms": 3,
"bathrooms": 2,
"amenities": [
"Rooftop pool",
"gym",
"concierge"
]
}
It worked perfectly again. This is the power of AI-driven data extraction. The same code handles vastly different inputs because it understands the meaning of the data, not just its position on a page.
This simple experiment highlights a major shift in how we handle unstructured data:
Ready to stop wrestling with brittle scrapers and start building? Get started with extract.do and turn data chaos into developer-ready output with a single API call.