Building a Real-Estate Listing Parser with AI in Minutes

Data is the lifeblood of modern applications, but it rarely comes in the clean, structured format we need. This is especially true in the real estate industry, where valuable listing information is scattered across thousands of websites, PDFs, and emails, each with its own unique layout.

Traditionally, developers have relied on brittle web scrapers and complex regular expressions (regex) to parse this data. This approach is a maintenance nightmare. A tiny change to a website's HTML can break your entire data pipeline, leading to hours of frustrating debugging.

But what if you could bypass this fragility? What if you could simply describe the data you want and let an AI figure out how to get it?

This is where intelligent data extraction comes in. In this guide, we'll walk you through building a powerful real-estate listing parser using extract.do, an AI-powered API that turns unstructured text into clean, developer-ready JSON.

The Challenge: The Chaos of Real Estate Data

Let's look at two typical real estate listings.

Listing 1: A simple text description

For sale: A charming 2-bedroom, 1.5-bath bungalow located at 456 Oak Street, Springfield. This beautiful home boasts a newly renovated kitchen and a spacious backyard. Asking price is $350,000. For more details, contact our agent.

Listing 2: A snippet from a different website

<div>
  <h2>Condo with a View! - $475,000</h2>
  <p>789 Pine Ave, Metro City</p>
  <ul>
    <li>Beds: 3</li>
    <li>Baths: 2</li>
    <li>Amenities: Rooftop pool, gym, concierge</li>
  </ul>
</div>

Parsing these with a single, traditional scraper is nearly impossible. The data labels are different (For sale vs. a price in the <h2>), the structure is inconsistent, and the order of information varies. You'd need to write and maintain separate, complex rules for each source.

The AI Solution: Describing What You Want, Not How to Get It

extract.do uses a fundamentally different approach. Instead of writing rigid rules, you provide a simple schema that describes your desired output. The AI then reads the source text, understands the context and semantics, and intelligently maps the information to your schema.

It doesn't care if the price is in a headline or a paragraph; it understands what a "price" is. It can identify "2-bedroom" and "Beds: 3" as the same type of data. This makes it incredibly resilient and flexible.

Let's Build the Parser: A Step-by-Step Guide

We'll build a simple Node.js application to demonstrate how easy this is.

Step 1: Set Up Your Project

First, make sure you have Node.js installed. Then, create a new project and install the extract.do SDK.

mkdir real-estate-parser
cd real-estate-parser
npm init -y
npm install @do-sdk/core

Step 2: Define Your Ideal Data Structure (The Schema)

This is the most crucial step. Think about the perfect JSON object for a real estate listing. What information do you need? Let's define a schema for it. We want the address, price, number of bedrooms and bathrooms, and a list of amenities.

const listingSchema = {
  address: 'string',
  city: 'string',
  price: 'number',
  bedrooms: 'number',
  bathrooms: 'number',
  amenities: 'string[]', // An array of strings
};

This simple object tells the AI exactly what to look for and what data type each field should be.

Step 3: Write the Extraction Function

Now, let's put it all together. We'll use the code from our first listing example and pass it to the extract.do API along with our schema.

// parser.ts
import { Do } from '@do-sdk/core';

// Any unstructured text from documents, emails, or websites
const listingText = `
  For sale: A charming 2-bedroom, 1.5-bath bungalow located at 456 Oak Street, Springfield. 
  This beautiful home boasts a newly renovated kitchen and a spacious backyard. 
  Asking price is $350,000. For more details, contact our agent.
`;

// The schema tells the AI what data to extract
const listingSchema = {
  address: 'string',
  city: 'string',
  price: 'number',
  bedrooms: 'number',
  bathrooms: 'number',
  amenities: 'string[]',
};

async function parseListing(text: string) {
  console.log('--- Processing Listing ---');
  console.log(text);

  try {
    const structuredData = await Do.extract('extract.do', {
      text: text,
      schema: listingSchema,
    });

    console.log('\n--- Extracted JSON ---');
    console.log(JSON.stringify(structuredData, null, 2));
  } catch (error) {
    console.error('Error extracting data:', error);
  }
}

// Run the function with our first listing
parseListing(listingText);

Step 4: Run the Code and See the Magic

Save the file as parser.ts and run it (you'll need ts-node or compile it first).

npx ts-node parser.ts

You'll see the following output:

--- Processing Listing ---

  For sale: A charming 2-bedroom, 1.5-bath bungalow located at 456 Oak Street, Springfield. 
  This beautiful home boasts a newly renovated kitchen and a spacious backyard. 
  Asking price is $350,000. For more details, contact our agent.


--- Extracted JSON ---
{
  "address": "456 Oak Street",
  "city": "Springfield",
  "price": 350000,
  "bedrooms": 2,
  "bathrooms": 1.5,
  "amenities": [
    "newly renovated kitchen",
    "spacious backyard"
  ]
}

Look at that! Clean, structured, and ready to be stored in a database or used in an application. The AI correctly identified all the fields, converted the price and bathroom count to numbers, and even extracted key features into the amenities array.

Testing for Resilience

Now for the real test. Let's feed it our second, completely different listing format without changing a single line of our extraction logic.

Simply update the listingText variable and run it again:

// ... (keep the same schema and function)

const listingText2 = `
  <div>
    <h2>Condo with a View! - $475,000</h2>
    <p>789 Pine Ave, Metro City</p>
    <ul>
      <li>Beds: 3</li>
      <li>Baths: 2</li>
      <li>Amenities: Rooftop pool, gym, concierge</li>
    </ul>
  </div>
`;

parseListing(listingText2);

Output:

--- Extracted JSON ---
{
  "address": "789 Pine Ave",
  "city": "Metro City",
  "price": 475000,
  "bedrooms": 3,
  "bathrooms": 2,
  "amenities": [
    "Rooftop pool",
    "gym",
    "concierge"
  ]
}

It worked perfectly again. This is the power of AI-driven data extraction. The same code handles vastly different inputs because it understands the meaning of the data, not just its position on a page.

Why This Changes Everything for Developers

This simple experiment highlights a major shift in how we handle unstructured data:

Unmatched Resilience: Your code is no longer tied to a specific HTML layout. As long as the text is present on the page, extract.do can find it.
Radical Simplicity: You no longer need to be a regex guru or spend hours inspecting web elements. You just need to define the data you want.
Universal Application: The same logic can process data from anywhere—HTML content, plain text from an email, the contents of a PDF, and more.
Scalability: Built for high-volume, real-time tasks, this approach is perfect for powering data-intensive applications, from market analysis dashboards to customer-facing property search engines.

Ready to stop wrestling with brittle scrapers and start building? Get started with extract.do and turn data chaos into developer-ready output with a single API call.

Do Work. With AI.