Skip to content
June 8, 2023

Five Years of Training Data Builds CRE’s Most Powerful Generative AI Real Estate Solution

Prophia data scientists have always known that the road to building one of the first advanced AI models for CRE would be long. Five years later, and 100,000 CRE documents smarter, Prophia is one of the most comprehensive CRE AI models on the market.

Five Years Training Data blog featured image

Generative AI is a hot topic lately, but it has been our mainstay since 2018, when  Prophia was first founded by Cameron Steele and Chris Hsu. Since then, our machine learning algorithms and ML models have come a long way, which is a feat of AI engineering and data science when you consider how fiercely private the CRE industry has a tendency to be with their data.

So how were we able to build this powerful generative AI model? And where do we go from here as generative AI ramps up in industries beyond CRE? In our exploration below, we’ll reveal how Prophia built one of the most powerful CRE training datasets for its AI on direct customer relationship, trust, and CRE expertise.

Jump To A Section in This Article
Inherent Challenges of Machine Learning in CRE
Building Prophia's Training Datasets
The Dataset Grows 10X
What's on the Horizon?

Generative AI in Real Estate & the Inherent Challenges of Machine Learning in CRE

One of the reasons we all love ChatGPT so much is because of its ability to quickly process unimaginable amounts of data and generate a human-like response in seconds. This is the power of training an AI model on the vast amounts of publicly available data on the Internet —the algorithm’s output is equal to the accessibility and quality of the source data and training dataset.

In CRE, some of the most critical data associated with commercial buildings is all private. Contracts, such lease agreements, are privately held and owned by the contracting parties, specifically landlords and tenants, and are subject to non-disclosure agreements. What’s more, these contracts not only contain private information about every party involved in a deal, they also hold sensitive, building-specific information such as rent amounts, tenant information, contractual rights, etc. that is not readily available to the public.

The reason for this privacy is twofold: the contractual information contained within leases  contains sensitive financial information and contractual details and terms that each party  doesn’t wish to share. Historically, maintaining the privacy of CRE contracts gives firms,  landlords, and large tenants a competitive edge in the market. There’s private and then there’s CRE industry level private.

So if the accessibility of strategic CRE data is inherently low, this creates challenges for data scientists when creating and tuning algorithms and building a training database for an AI. That is, until now.

Five Years in the Making: Building Prophia’s Training Data Set on Trust and Direct Customer Relationships

It can take months to build and refine an advanced system capable of large-scale predictive modeling when training that algorithm on public data. But when machine learning requires access to private, unstructured data to build a dataset, development and training can take years. So when Prophia set out to change the world of CRE in 2018, compiling our training data actually began in the conference rooms of some of the largest CRE firms in the country.

The significance of trusted direct customer relationships

In CRE, where contract confidentiality is the status quo, acquiring enough documents and data to sufficiently train machine learning models requires a human element as much as it calls for advanced computation. That’s why, nurturing trusted, direct customer relationships played a significant role in developing Prophia’s customer base as well as the CRE data from which our first AI NLP algorithms were built.

Building trusted customer relationships was key to gaining access to the critical, proprietary data held by commercial property owners and investors, ensuring our machine learning algorithms were built on a set of verified annotations driven from the comprehensive foundation of unique characteristics and requirements of commercial real estate.

With this uniquely accurate dataset and a business model that trained our NLP algorithms with every new tenant onboarded, Prophia’s AI matured at a fast pace. Our machine learning models are continuously tested and trained  on private data, with the explicit permission from our customers. This accuracy quickly earned Prophia a reputation for providing CRE professionals and stakeholders with an accurate source of truth for portfolio data and critical documents.

Continuing to prioritize client data privacy

In addition, as we continue to develop Prophia as an AI commercial real estate solution, client privacy and confidentiality remains at the forefront of every conversation with investors, prospective clients, and industry experts. Using strict software development standards and protocols that include following the OWASP Top 10 for web application security as well as adherence to SOC 2 data encryption standards, we continue to center client confidentiality as we fine-tune our AI solution.

These protocols ensure that all data provided by customers is handled with the utmost care and in compliance with applicable data protection regulations. Prophia's robust data security measures safeguard against unauthorized access, breaches, and data misuse. And the continued commitment to this level of security means Prophia’s AI model continues to refine itself with the most up-to-date and secure CRE data in the Proptech industry.

The symbiotic relationship between client trust and data modeling

After five years of this relationship with our CRE partners and their data and Prophia, the data scientists on our team have continued to enhance Prophia’s generative AI solution, and we have many ideas for AI applications to come. And as our customer base and the dataset grows, the AI models will continue to improve.

Access to trustworthy and proprietary data enables Prophia's generative AI models to generate outputs that accurately reflect the nuances and details associated with some of the world’s largest and most complex commercial properties. This improved performance results in highly accessible, trusted and valuable content for Prophia's customers, enhancing their decision-making processes and operational efficiency.

Building long-term, lasting partnerships

By building trust through responsible data handling, audited security and tangible results, Prophia continues to establish long-term partnerships with its customers. These partnerships foster collaboration, feedback exchange, and continuous improvement, allowing Prophia to stay at the forefront of generative AI innovation within the CRE industry.

And those relationships have greater impacts than simply our ability to develop our machine learning models. When partnering with Urban Renaissance Group on a deal with an extremely short due diligence period, URG’s ability to leverage contract data through Prophia allowed them to quickly and accurately abstract hundreds of leases to perform a risk assessment. The exercise left them with the insight they needed to move forward confidently with the deal and Prophia with a due diligence success story in our brand’s history.

Today, our client base contains some of the foremost institutions in commercial real estate and a litany of other CRE-specific challenges like those faced by URG. In addition to due diligence, firms like RXR, FD Stonewater, and Nuveen are categorically changing the ways they complete administrative tasks, automating many of them with Prophia’s generative AI. 

Prophia’s Training Database Grows from 10,000 CRE Documents to nearly 100,000 in Two Years

The client relationship piece of Prophia’s history is nothing new to the world of startup funding, and the strategy paid off doubly in the creation of the AI. As of May 2020, Prophia’s training dataset was utilizing approximately 10,000 CRE documents. But thanks to growth in our customer base, our document dataset has grown 10x since those days, supporting nearly 1M verified annotations. And this growth strategy has really paid off. Today, the current AI model’s capabilities range from lease abstract automation to tracking tenant rights and encumbrances for negotiations, in addition to data processing, extraction, and visualization capabilities:

Data Preprocessing:

  • Document ingestion, parsing, and cleaning
  • Optical Character Recognition (OCR) for text extraction
  • Natural Language Processing (NLP) Techniques:
  • Tokenization, Named Entity Recognition (NER), POS tagging, and dependency parsing
  • Generative AI Models for Document Understanding:
  • Language models (e.g., GPT-3.5…) and sequence-to-sequence models

Data Extraction:

  • Data extraction and matching to predefined templates or schemas
  • Information extraction from structured data (e.g., tables, forms)

While Prophia’s generative outputs may not yet resemble the human-to-machine conversation the public recognizes with ChatGPT, our technology uses advanced preprocessing and data extraction techniques to synthesize portfolio data and “respond” to user’s data uploads in the form of tags and hyperlinks. This makes Prophia’s proprietary AI a digital conduit, facilitating a conversation between CRE professionals and the ever-changing, critical data points in a living, breathing lease agreement.

What’s on the Horizon for Prophia?

AI solutions are continuing to improve and new use cases and applications are popping up everyday—Prophia being no exception. Over the next few years, as technology adoption becomes more widespread across the industry, those industry leaders who successfully streamline and expedite business functions will ultimately come out on top and hold the greatest influence in the market. For the engineers, data scientists, and the CRE experts at Prophia, it’s our job to listen and respond to those industry leaders, shaping and modeling our AI model to provide the ever-changing CRE industry.

Incorporating chatbot capabilities

ChatGPT really was a breakthrough in generative AI—no question. But it is still a relatively new model and AI becomes more intelligent and accurate with time. Working with our extensive training dataset, Prophia engineers are already starting to look into the feasibility to train and build the first AI chatbot for commercial real estate. This prospective interface improvement will make it possible for CRE professionals to interact with Prophia the same way we interact with ChatGPT today and the way humans interacted with ELIZA in the 1960s.

For instance, instead of opening a tenant document or the stacking plan, Prophia users will be able to directly interact with the AI and ask it a question about their portfolio data. Let’s say you want to know how many of your office tenants have vacancies coming up in 2025, you would be able to query Prophia and get an immediate and accurate response turning insights gathering into a literal conversation with our AI and the structured portfolio data on the platform.

Enhanced user interaction

Along with the conversational interface, shaping the AI’s data-processing capabilities into an interface like ChatGPT requires developing Natural Language Processing (NLP) techniques. Luckily, we have the most comprehensive repository of CRE data in Proptech and many of these NLP efforts have been underway since we first assembled our extensive training data.

The prospect of developing a CRE conversational query or chatbot opens the industry to further benefits, including immediate access to insights and important information. What’s more, a query-based interface promotes greater exploration and inquisition into portfolio data, unlocking insights pathways that haven’t been possible, or at least easy, with manual data scouring.

It’s an exciting prospect and a path forward that uses AI to enhance the abilities of people, and it all started with the first 10,000 CRE documents less than five years ago.


Hannah Overhiser

Hannah is Prophia's Content Marketing Manager and a seasoned B2B and B2C marketer. Her career began in eCommerce consulting with a focus on code testing. This technical expertise transferred seamlessly to SEO and she started working agency-side as an SEO and Content Strategist. Today, her home is Prophia, and she puts...

Other posts you might be interested in

View All Posts