Estéfano Tuyama
Projects About

Hotel Precification Model¶

View on Github

Introduction¶

Problem: Arbitrarily choosing a price for a hotel room isn't efficient in terms of maximizing profit. It can work, but having a ML model that is able to price a hotel room based on existing hotel room prices can be beneficial and time-saving.¶

Approach: This project aims to solve that problem by creating several linear regression models that take in different parameters and aims to compare them and choose the best one for our purposes.¶

Database: Consists of 1000 fictional hotel rooms. Each hotel room has 4 parameters:¶

  • Number of stars (1 to 5)
  • Distance to turism (in km)
  • Capacity, or the number of people this room can house
  • Price (in dollars)

Tools used: Python, Pandas, Numpy, Seaborn, Plotly, Statsmodels, Sklearn¶

Taking a look at the first few rows of the Database¶

In [89]:
df = pd.read_csv('hotels.csv')
df.head()
Out[89]:
Stars DistanceToTurism Capacity Price
0 5 9.301565 3 506.275452
1 1 1.785891 1 246.363458
2 4 15.504293 3 325.873550
3 4 4.173188 3 521.343284
4 4 9.443685 1 252.587087

Checking Database information, so we can see if there are null values or whether or not we need to make type conversions. Luckily, this database seems to be complete¶

In [90]:
df.info()
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Stars             1000 non-null   int64  
 1   DistanceToTurism  1000 non-null   float64
 2   Capacity          1000 non-null   int64  
 3   Price             1000 non-null   float64

Looking at a description of the database to see if we find anything uncommon. We found that there are hotel room with negative prices, which is weird.¶

We should then clean up the DataFrame to remove these negative values. We will also remove any hotel room that has a price lower than $50, as that is not realistic. Furthermore, we can save these unrealistic values to predict them and see what values our future model will come up if, and analyze if those values are unrealistic

In [91]:
df.describe()
Out[91]:
Stars DistanceToTurism Capacity Price
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 3.008000 7.650878 2.519000 396.611361
std 1.407095 5.870137 1.108543 171.742433
min 1.000000 0.013850 1.000000 -220.208705
25% 2.000000 3.034775 2.000000 283.590980
50% 3.000000 6.430035 3.000000 401.743527
75% 4.000000 10.863295 4.000000 516.097856
max 5.000000 31.709748 4.000000 836.261308
In [92]:
df_price_under_50 = df[df['Price'] < 50].copy()
df = df[df['Price'] >= 50].copy()

Using Seaborn's pairplot to take a look at which variables are correlated with the price of a hotel.¶

We can see the distance to turism seems to be highly correlated with the price, and there seems to be a correlation also between stars/capacity and the price.

In [129]:
sns.pairplot(df, x_vars=['Stars', 'DistanceToTurism', 'Capacity'], y_vars='Price')
plt.show()
No description has been provided for this image

Separating the data in train/test splits to train our model. We will be using 80% of the data to train and 20% to test¶

In [94]:
from sklearn.model_selection import train_test_split
x = df.drop(columns='Price')
y = df['Price']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=136)

Creating the training DataFrame¶

In [95]:
df_train = pd.DataFrame(data=x_train)
df_train['Price'] = y_train
df_train.head()
Out[95]:
Stars DistanceToTurism Capacity Price
360 5 23.466470 1 61.495168
700 1 1.592505 2 298.057859
353 3 2.444435 3 582.630609
671 3 19.223248 3 220.208619
262 2 11.541840 4 297.254357

Creating the first model and evaluating it.¶

A r² score of .43 indicates that the distance to turism does have a correlation with the price, although it doesn't fully explain the price by itself (as expected).

In [97]:
model = ols('Price ~ DistanceToTurism', data=df_train).fit()

model.rsquared
Out[97]:
np.float64(0.43067515674525736)
In [98]:
model.resid
Out[98]:
360    -40.298231
700   -214.875867
353     85.709651
671     38.660135
262    -28.672899
          ...    
56      84.778015
661    -47.532610
983   -154.795662
240    249.319655
230    -31.831201
Length: 778

Using the first model to make a prediction using the test data¶

Since the R² score is not very similar to the training data R² score, we can't say this model is consistent. we should then look to create other models.

In [100]:
y_predict = model.predict(x_test)

r2_score(y_test, y_predict)
Out[100]:
0.3192107616205031

Adding a constant to the training DataFrame¶

In [101]:
x_train = sm.add_constant(x_train)
x_train.head()
Out[101]:
const Stars DistanceToTurism Capacity
360 1.0 5 23.466470 1
700 1.0 1 1.592505 2
353 1.0 3 2.444435 3
671 1.0 3 19.223248 3
262 1.0 2 11.541840 4

Creating other saturated models, with more variables so we can compare and see which one is better¶

In [102]:
model_2 = sm.OLS(
    y_train,
    x_train[['const', 'Stars', 'DistanceToTurism', 'Capacity']]
).fit()

model_3 = sm.OLS(
    y_train,
    x_train[['const', 'Stars', 'DistanceToTurism']]
).fit()
model_4 = sm.OLS(
    y_train,
    x_train[['const', 'Stars', 'Capacity']]
).fit()

Comparing all models to select the best one¶

Results: Model 2 achieved R² = 0.91, meaning 91% of price variability is explained by the selected variables. There is also no indication of multicolinearity (we'll check for that more thoroughly later)¶

In [103]:
model.summary()
Out[103]:
OLS Regression Results
Dep. Variable: Price R-squared: 0.431
Model: OLS Adj. R-squared: 0.430
Method: Least Squares F-statistic: 587.0
Date: Thu, 13 Mar 2025 Prob (F-statistic): 5.23e-97
Time: 16:46:43 Log-Likelihood: -4815.8
No. Observations: 778 AIC: 9636.
Df Residuals: 776 BIC: 9645.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 542.8662 7.079 76.684 0.000 528.969 556.763
DistanceToTurism -18.7959 0.776 -24.228 0.000 -20.319 -17.273
Omnibus: 14.039 Durbin-Watson: 2.075
Prob(Omnibus): 0.001 Jarque-Bera (JB): 8.130
Skew: -0.005 Prob(JB): 0.0172
Kurtosis: 2.499 Cond. No. 15.4

In [104]:
model_2.summary()
Out[104]:
OLS Regression Results
Dep. Variable: Price R-squared: 0.913
Model: OLS Adj. R-squared: 0.913
Method: Least Squares F-statistic: 2704.
Date: Thu, 13 Mar 2025 Prob (F-statistic): 0.00
Time: 16:46:50 Log-Likelihood: -4085.5
No. Observations: 778 AIC: 8179.
Df Residuals: 774 BIC: 8198.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 196.8588 5.985 32.894 0.000 185.111 208.607
Stars 50.3854 1.186 42.496 0.000 48.058 52.713
DistanceToTurism -19.5439 0.304 -64.281 0.000 -20.141 -18.947
Capacity 78.9019 1.512 52.195 0.000 75.934 81.869
Omnibus: 1.482 Durbin-Watson: 1.885
Prob(Omnibus): 0.477 Jarque-Bera (JB): 1.510
Skew: -0.062 Prob(JB): 0.470
Kurtosis: 2.823 Cond. No. 35.9

In [105]:
model_3.summary()
Out[105]:
OLS Regression Results
Dep. Variable: Price R-squared: 0.606
Model: OLS Adj. R-squared: 0.605
Method: Least Squares F-statistic: 596.9
Date: Thu, 13 Mar 2025 Prob (F-statistic): 1.25e-157
Time: 16:46:54 Log-Likelihood: -4672.3
No. Observations: 778 AIC: 9351.
Df Residuals: 775 BIC: 9365.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 402.8791 9.557 42.154 0.000 384.118 421.641
Stars 46.7724 2.515 18.599 0.000 41.836 51.709
DistanceToTurism -18.9726 0.646 -29.390 0.000 -20.240 -17.705
Omnibus: 114.275 Durbin-Watson: 1.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.954
Skew: -0.009 Prob(JB): 8.51e-07
Kurtosis: 2.072 Cond. No. 26.3

In [106]:
model_4.summary()
Out[106]:
OLS Regression Results
Dep. Variable: Price R-squared: 0.448
Model: OLS Adj. R-squared: 0.447
Method: Least Squares F-statistic: 314.5
Date: Thu, 13 Mar 2025 Prob (F-statistic): 1.02e-100
Time: 16:46:56 Log-Likelihood: -4803.8
No. Observations: 778 AIC: 9614.
Df Residuals: 775 BIC: 9628.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 66.6932 14.169 4.707 0.000 38.878 94.508
Stars 49.1060 2.983 16.463 0.000 43.251 54.961
Capacity 75.4037 3.801 19.838 0.000 67.942 82.865
Omnibus: 42.160 Durbin-Watson: 2.004
Prob(Omnibus): 0.000 Jarque-Bera (JB): 47.995
Skew: -0.607 Prob(JB): 3.78e-11
Kurtosis: 3.075 Cond. No. 14.8

Looking at model_2's parameters, we can see how much the price of a hotel is affected by a change in each parameter (everything else remaining constant).¶

For example, for a increase of 1 in the hotel capacity, we can expect the price to increase on an average of $78
Insights: Star rating and capacity are key drivers; proximity to tourism negatively impacts price)

In [108]:
model_2.params
Out[108]:
const               196.858766
Stars                50.385416
DistanceToTurism    -19.543904
Capacity             78.901909

Adding a constant and predicting the values on the test split using our best model¶

In [109]:
x_test = sm.add_constant(x_test)

predict_2 = model_2.predict(x_test[['const', 'Stars', 'DistanceToTurism', 'Capacity']])

Checking that the R² is similar between training and testing.¶

It is simillar (0.91 vs 0.90), indicating that the model is consistent.

In [110]:
print(r2_score(y_test, predict_2), model_2.rsquared)
0.8976636624423547 0.9129107017167504

Next, let's precify a new hotel using our two best models so we can see the difference.¶

We can see that there is over \$100 of difference between both models' precification of the same hotel room. This is relevant to us as a \$100 difference is very relevant and can make a difference on whether a person will book a hotel room or not

In [120]:
new_hotel = pd.DataFrame({
    'const':[1],
    'Stars':[5],
    'DistanceToTurism':[0.3],
    'Capacity':[4]
})

model_2.predict(new_hotel)[0]
Out[120]:
np.float64(758.5303113576596)
In [121]:
new_hotel = new_hotel.drop(columns='Capacity')
model_3.predict(new_hotel)[0]
Out[121]:
np.float64(631.0491442885145)

The next step is to check for multicolinearity for our model_2.¶

A vif < 5 is considered to be indicative of no multicolinearity. Our model has very low vif for each variable, which is ideal.

In [122]:
variables = ['const', 'Stars', 'DistanceToTurism', 'Capacity']

vif = pd.DataFrame()

vif['variables'] = variables

vif['vif'] = [variance_inflation_factor(x_train[variables], i) for i in range(len(variables))]

vif
Out[122]:
variables vif
0 const 13.004992
1 Stars 1.003638
2 DistanceToTurism 1.001515
3 Capacity 1.004657

Making a price prediction x real price graph. Analyzing the graph, we can see a linear trend, which is good; It indicates our model is accurately predicting a price that is feasible for a hotel, even if it isn't perfectly accurate¶

In [123]:
y_predicted_train = model_2.predict(x_train)
sns.set_style('whitegrid')
ax = sns.scatterplot(x=y_predicted_train, y=y_train)
ax.set_title("Hotel rooms: Model 2 Predicted Price x Real Hotel Price")
ax.set_xlabel("Model 2 Predicted Price")
ax.set_ylabel("Real Hotel Price")
plt.show()
No description has been provided for this image

Making a Predicted Price x Residuals graph to analyze heteroscedasticity. The more scattered the graph, the better as that indicates low heteroscedasticity; It indicates that the model is consistent for the price interval we have.¶

We can see our model has homoscedasticity, as it is scattered all throughout the x axis length

In [124]:
residuals = model_2.resid
In [125]:
ax = sns.scatterplot(x=y_predicted_train, y=residuals)
ax.set_title('Residuals x Price prediction')
ax.set_xlabel('Price prediction')
ax.set_ylabel('Resídual')
plt.show()
No description has been provided for this image

Conclusion/Next steps¶

We can conclude that model 2 would be the best model for our purposes of predicting a hotel room price.

The next step would be to implement this on a real hotel pricing API. Of course, this database is fictional but similar results can be achieved using a real Database, adjusting it for different variables.

This project was made as an effort to learn about linear regression models and apply them to useful situations. A lot was learned here, as this was my first project of such kind in this area and I am eager to improve, learn more and apply knowledge to real world problems.

Thank you so much for reading!






To contact me: estefanotuyama@gmail.com