A graphic depicting a digital brain with processor lines in the background and graph lines in an upward trend in the foreground all glowing on a dark background

Using machine learning and big data to predict real estate returns

March 19, 2025 By Anh Tran

Reading time: 0 minutes

The ability to predict financial returns has captivated investors and scholars for many years. Return prediction is essential to asset pricing and portfolio management. Predicting commercial real estate returns is complex and notoriously difficult but not impossible within some limits due to serial correlation and fundamental relationships. Classical econometric methods like linear regression have been widely used in financial forecasting, but their limitations in handling nonlinear relationships and large datasets restrict their prediction accuracy. Now, two transformative technologies — machine learning (ML) and big data — are beginning to reshape the landscape of financial forecasting, including in the commercial real estate (CRE) sector. By using ML algorithms, researchers can analyze enormous datasets, uncover hidden relationships, and make far more accurate predictions than traditional tools.

To assess ML’s ability to forecast returns in the private CRE market, we conducted a comparative analysis of ML and traditional linear regression methods. Drawing insights from a massive dataset of over 5,000 variables spanning 1978 through the first half of 2024, we examined the advantages and opportunities for data-driven real estate forecasting using ML.

Forecasting models

In real estate, excess returns — returns above the risk-free rate — are of significant interest to investors as they represent the compensation for taking on risk beyond a no-risk benchmark. We conducted a large-scale empirical analysis to investigate the degree of predictability of excess returns in private commercial real estate markets.

With private real estate’s excess return as our target, we collected total quarterly returns of the National Council of Real Estate Investment Fiduciaries (NCREIF) Property Index (also known as NPI) and the rate of three-month Treasury bills to use as a proxy for a risk-free rate. We obtained private CRE excess returns by subtracting the three-month Treasury bill rate from the NPI return. To forecast the real estate excess return for the quarter ahead, we used the current information set, which includes the most recent quarterly characteristics at the end of the current quarter and information from the previous 20 quarters. Similarly, we forecast the private excess return for two quarters (a half year), four quarters (one year) and up to 20 quarters (five years) ahead.

In our analysis, we compared three models:

A simple regression model using the returns of publicly traded REITs as a single variable, or predictor.
A multivariate model with the REIT returns along with fundamental predictors from NCREIF — earning-to-price (EP) ratios of the properties covered by the NPI index, the growth rate of capital expenditures (CAPEX), income returns and economic predictors (S&P 500 index returns and a measure of U.S. consumer sentiment).
An ML model (XGBoost) using 5,000 predictors, including all predictors of Model 1 and Model 2, plus additional economic predictors and their lagged variables.

We recursively estimated the three models in a rolling-window fashion to produce out-of-sample (OOS) forecasts of the excess returns spanning 1990 through the second quarter of 2024. A rolling window forecast mimics real-time forecasting by sequentially updating the available data and re-estimating the model to make predictions at each time step, similar to how forecasts are conducted as new information becomes available. After obtaining predictions from different models, we evaluated the predictability of models through the root mean squared forecast error (RMSE) and the out-of-sample R-square (R²-OOS).

The bottom line: We found significantly enhanced predictive accuracy when using ML compared with linear regression models, particularly over the intermediate and long-term forecast horizons. Model 3, on average, reduced the forecasting error 68% over Model 1 and 26% over Model 2. Below are the details of our comparison.

Model 1: Simple linear regression

Linear regression is a simple model that predicts outcomes based on a straight-line relationship between inputs and outputs, making it straightforward for interpretation. The first model considered was simple regression using one predictor — the public CRE market — to forecast private CRE returns. This relationship between private and public CRE returns has been the subject of a vast amounts of literature conducted by finance research scholars.

The ability of public REIT returns to predict private CRE returns is rooted in the fundamental link between the two markets and the differences in how they are priced and reported. REITs, being publicly traded, provide real-time pricing based on investor expectations, market sentiment and economic conditions. This allows REITs to react quickly to new information, such as changes in interest rates, macroeconomic indicators, or property market fundamentals. In contrast, private real estate returns are based on appraised values, which are typically reported with significant lags (e.g., quarterly or annually). These valuations are slower to reflect changes in market conditions, often making private real estate returns appear smoother and delayed compared with REITs. Because REITs are forward-looking and incorporate real-time expectations, they can serve as leading indicators for private real estate performance.

In addition, studies have typically shown a weak contemporaneous correlation between the returns on private and public real estate investments. But the relationship strengthens over longer periods. Thus, in the long term, private and public CRE returns are integrated, such that there is a common real estate factor driving the returns on private and public CRE investments and they behave like each other. This relationship is summarized in Table 1, which shows the correlations between private and public CRE returns over various time horizons from our data sample. Their correlations increase as we move from quarterly to annual and five-year returns.

Table 1 – Correlation of private and public CRE returns over time horizons

1-quarter	2-quarter	4-quarter	8-quarter	12-quarter	16-quarter	20-quarter
0.1295	0.1244	0.1995	0.3456	0.392	0.4363	0.5034

As the result of forecasting, Table 2 shows the first regression model’s results with the root mean squared error (RMSE) and the out-of-sample R-square (R²-OOS).

Table 2 – Simple regression out-of-sample performance

Formula	1-quarter	2-quarter	4-quarter	8-quarter	12-quarter	16-quarter	20-quarter
RMSE	0.024	0.048	0.094	0.173	0.254	0.330	0.405
R² OOS	-0.578	-0.594	-0.711	-1.589	-3.251	-4.897	-7.053

Note the RMSE increases and R²-OOS decreases when the forecast horizon lengthens. This reflects more difficulty in forecasting the excess returns over mid- and long-terms. The R²-OOS is positive when the given model yields an OOS predictive performance that is more accurate than a simple comparison to the historical average. The negative R²-OOS in Table 2 means that the model using only REIT returns performs worse than simply expecting the historical sample mean (See the R²-OOS formula in the sidebar).

Relying on the strong long-term correlation between public and private returns, we chart the prediction results from the REIT regression model compared to the actual private returns four years out of sample, or 16 quarters ahead. Despite the high correlation, Figure 1 reveals very little predictability as the lines move independently.

Note: U.S. recessions determined by the Business Cycle Dating Committee of the National Bureau Economic Research. Source: UF Bergstrom Real Estate Center

Model 2: Multivariate linear regression

Taking a further step from the first REIT model and following research studies in literature, we conducted the multivariate regression model by employing the real estate fundamental variables consisting of earnings-to-price (EP) ratio — computed as aggregate net operating income (NOI) divided by the total market value of the properties covered by the index, growth rate of the capital expenditures (CAPEX) and income return from NCREIF dataset. We also added public REIT returns and economic predictors, S&P 500, and consumer sentiment from FRED (database from the Federal Reserve Bank of St. Louis.) Results of this forecast model are presented in Table 3.

Table 3 – Multivariate regression out-of-sample performance

Formula	1-quarter	2-quarter	4-quarter	8-quarter	12-quarter	16-quarter	20-quarter
RMSE	0.010	0.020	0.045	0.091	0.115	0.127	0.115
R²-OOS	0.559	0.503	0.293	0.045	-0.014	0.080	0.276

The multivariate model improves on the first model, producing smaller forecast errors and higher R²-OOS at all forecast horizons. However, same as the first model, the predictability vanishes as evidenced by the R²-OOS dropping into negative territory when forecasting 12-quarter excess returns. Returning to a four-year projection, Figure 2 reveals a much closer relationship between model results and actual returns than shown in Figure 1. Projected results are more volatile than actual returns which fits this appraisal smoothing theory.

Note: U.S. recessions determined by the Business Cycle Dating Committee of the National Bureau Economic Research. Source: UF Bergstrom Real Estate Center

The limitations in predictability of these traditional regression models lie mainly in the linearity assumption, which assumes a straight-line relationship between predictors and returns, ignoring nonlinear dynamics in real estate markets and the limited data handling, often focusing on a small set of variables. The more advanced statistical tools in machine learning can help overcome these severe limitations.

Model 3: AI and machine learning using big data

A depiction of glowing zeros and ones on dark backgrounds in columns rolling over a curving cylindrical shape

Artificial intelligence (AI) is the technology that enables computers and machines to simulate human learning, comprehension, problem-solving, decision-making, creativity and autonomy. Machine learning is a branch of AI that enables computers to learn patterns from data and make predictions without being explicitly programmed. Unlike traditional regression, ML algorithms can process massive datasets, identify nonlinear relationships and adapt as new data becomes available.

Several characteristics make analyzing real estate returns using ML techniques appealing — and provides a notable improvement over traditional empirical methods in predicting financial returns. First, forecasting is essentially a predictive challenge, and since ML methods are primarily geared toward prediction, they are particularly well-equipped to handle the measurement of excess returns. ML also enables a much broader array of variables and accommodates more intricate and detailed specifications of functional forms. Generally, ML has offered increased predictive accuracy over traditional models at the cost of difficult reasoning. It is often impossible to explain why ML produces a particular forecast other than it seems to work.

Glowing lines as if a reflection on a window with tall buildings outside

In our experiment with ML, we relied on big data — an extremely large and complex dataset of over 5,000 economic and real estate variables and their lags. Traditional tools cannot process such a large and complex dataset. In addition to the predictors in models 1 and 2, we used the economic research database from the Federal Reserve Bank of St. Louis (FRED-QD) of 248 quarterly series. FRED classifies the data into 14 groups: national income and product accounts; industrial production; employment and unemployment; housing; inventories, orders, and sales; prices; earnings and productivity; interest rates; money and credit; household balance sheets; exchange rates; stock markets; non-household balance sheets; and other.

We report performance using the XGBoost model, one of the ML methods, to forecast private real estate excess returns. XGBoost, which stands Extreme Gradient Boosting is a powerful ensemble method that builds and combines multiple decision trees to handle complex, nonlinear relationships and interactions between variables. The ML model shows an improved predictive accuracy which is measured as the observed reduction in the RMSE and the increase of the R²-OOS (Table 4). In particular, the ML model, on average, reduced the forecasting error 68% over Model 1 and 26% over Model 2. The positive R²-OOS in all forecast horizons means the ML model yields an out of sample predictive performance that is stronger than that of the historical sample mean. The ML model also has the highest R²-OOS among three models in forecasting over the mid- and long-term (from four to 20 quarters ahead).

Table 4 – ML method out-of-sample performance

Formula	1-quarter	2-quarter	4-quarter	8-quarter	12-quarter	16-quarter	20-quarter
RMSE	0.011	0.020	0.035	0.061	0.063	0.063	0.067
R²-OOS	0.437	0.431	0.491	0.479	0.621	0.651	0.675

Figure 3 shows the forecasts for 16 quarters ahead using the XGBoost ML model. XGBoost appears to have forecasted and detected the downturn early in the 2000 recession and 2008 recession periods. One explanation for the superior forecasts is that the ML model can learn from difficult years in the past data from the recent 45-quarter rolling window and avoid similar missteps in future prediction. For instance, XGBoost learns the pattern from the downturn of the 1990s to predict the downturn in the 2000s, learns from the 2000s to predict 2008, and learns from the 2008 downturn to predict 2020, the suddenly challenging year for the economy because of the COVID-19 pandemic. (See Table 5 for comparison of benefits and limitations of the forecasting models.)

Note: U.S. recessions determined by the Business Cycle Dating Committee of the National Bureau Economic Research. Source: UF Bergstrom Real Estate Center

XGBoost found the most successful predictors are (i) fundamental ones: income return, EP ratio; and (ii) other economic measures: capacity utilization, real private fixed investment, real imports of goods and services, Moody’s bond spread, real gross private domestic investment, unfilled orders for durable goods industries and manufacturing output.

Table 5 – Summary of benefits and limitations of forecasting models

Model	Benefits	Limitations
Single regression using only REIT returns	Intuitive — uses continuously observed market real estate data to predict unobserved private returns.	Limited in scope as it incorporates non-real estate components from the stock market and ownership structure.
Multivariate regression	Intuitive — provides a description of important variables and their magnitude of impact, including more relevant factors.	Assumes a linear relationship among multiple variables; key factors must be pre-specified.
Machine learning	Provides more accurate predictions by capturing nonlinear relationships among high-dimensional variables.	Not intuitive; the underlying mechanism is often unclear (“black box” nature).

Put it together

The combination of machine learning and big data can help improve our understanding of the predictability of commercial real estate excess returns. We found the ML model outperforms the traditional linear regression models in forecasting private excess returns over various time horizons. The ML, on average, reduced the forecasting error 68% over the simple regression model and 26% over multivariate regression model. By leveraging advanced algorithms and diverse datasets, investors can make more informed decisions, optimize portfolios and manage risks more effectively. Future research can explore the predictability of different real estate sectors across various industries and geographic locations to fully unlock the potential of real estate portfolios. ML models are in the early stage of development and require significant computing resources. As these models become more widely used, developers, investors and lenders may take advantage of the improved predictive accuracy they offer.

Anh Tran, Ph.D., is a real estate researcher at the UF Bergstrom Real Estate Center.

Graphic depicting a lengthy glowing bar chart on top of data in layers on a dark background

Methodology and sources

We forecast the private CRE returns over various quarterly periods: one, two, four and up to 20 quarters (H quarters in the formulas below). We describe an asset’s excess return as an additive prediction error model:

$R_{t + H} = E_{t} [R_{t + H}] + ε_{t + H}$

where $R_{t + H} = \sum \binom{H}{i = 1} R_{t + i}$ is the cumulative H-period excess return in excess of the risk-free rate on private real estate between $t + 1$ and $t + H$ , and the conditional expected return $E_{t} [R_{t + H}]$ is a flexible function of predictors:

$E_{t} [R_{t + H}] = g (R_{t})$

R-squared out-of-sample (Campbell and Thompson (2008))

$R^{2} - OOS$
$=$
$1 - \frac{\sum \binom{T}{t = H} {(R_{t + 1 : t + H} - {\hat{R}}_{t + 1 : t + H | t} (M, H))}^{2}}{\sum \binom{T}{t = H} {(R_{t + 1 : t + H} - {\bar{R}}_{t + 1 : t + H | t} (M, H))}^{2}}$
$=$
$1 - \frac{MSE (M, H)}{MSE (sample mean, H)}$

where $R_{t + 1 : t + H}$ is cumulative excess return over the period $[t + 1, t + H]$ and ${\hat{R}}_{t + 1 : t + H | t} (M, H)$ is its forecast based on information up to time under model $M$ , and ${\bar{R}}_{t + 1 : t + H | t} (H)$ is sample mean estimated up to time $t$ . The formula implies that $R \binom{2}{OOS}$ is positive if $MSE (M, H) < MSE (sample mean, H)$ , when the predictive model $M$ yields predictions better than the model just simply using the historical sample mean.

References

Our study draws on work in academic literature: Boudry, W. I., Coulson, N. E., Kallberg, J. G., and Liu, C. H. (2012); Campbell, J. Y. and Thompson, S. B. (2008); Fan, Y. and Yavas, A. (2023); Gu, S., Kelly, B., and Xiu, D. (2020); Guidolin, M., Pedio, M., and Petrova, M. T. (2020); Ling, D. C. and Naranjo, A. (2015); Oikarinen, E., Hoesli, M., and Serrano, C. (2011); Pagliari Jr, J. L., Scherer, K. A., and Monopoli, R. T. (2005).