Hotel Precification Model¶

Introduction¶

Problem: Arbitrarily choosing a price for a hotel room isn't efficient in terms of maximizing profit. It can work, but having a ML model that is able to price a hotel room based on existing hotel room prices can be beneficial and time-saving.¶

Approach: This project aims to solve that problem by creating several linear regression models that take in different parameters and aims to compare them and choose the best one for our purposes.¶

Database: Consists of 1000 fictional hotel rooms. Each hotel room has 4 parameters:¶

Number of stars (1 to 5)
Distance to turism (in km)
Capacity, or the number of people this room can house
Price (in dollars)

Tools used: Python, Pandas, Numpy, Seaborn, Plotly, Statsmodels, Sklearn¶

Taking a look at the first few rows of the Database¶

In [89]:

df = pd.read_csv('hotels.csv')
df.head()

Out[89]:

	Stars	DistanceToTurism	Capacity	Price
0	5	9.301565	3	506.275452
1	1	1.785891	1	246.363458
2	4	15.504293	3	325.873550
3	4	4.173188	3	521.343284
4	4	9.443685	1	252.587087

Checking Database information, so we can see if there are null values or whether or not we need to make type conversions. Luckily, this database seems to be complete¶

In [90]:

df.info()

RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Stars             1000 non-null   int64  
 1   DistanceToTurism  1000 non-null   float64
 2   Capacity          1000 non-null   int64  
 3   Price             1000 non-null   float64

Looking at a description of the database to see if we find anything uncommon. We found that there are hotel room with negative prices, which is weird.¶

We should then clean up the DataFrame to remove these negative values. We will also remove any hotel room that has a price lower than $50, as that is not realistic. Furthermore, we can save these unrealistic values to predict them and see what values our future model will come up if, and analyze if those values are unrealistic

In [91]:

df.describe()

Out[91]:

	Stars	DistanceToTurism	Capacity	Price
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	3.008000	7.650878	2.519000	396.611361
std	1.407095	5.870137	1.108543	171.742433
min	1.000000	0.013850	1.000000	-220.208705
25%	2.000000	3.034775	2.000000	283.590980
50%	3.000000	6.430035	3.000000	401.743527
75%	4.000000	10.863295	4.000000	516.097856
max	5.000000	31.709748	4.000000	836.261308

In [92]:

df_price_under_50 = df[df['Price'] < 50].copy()
df = df[df['Price'] >= 50].copy()

Using Seaborn's pairplot to take a look at which variables are correlated with the price of a hotel.¶

We can see the distance to turism seems to be highly correlated with the price, and there seems to be a correlation also between stars/capacity and the price.

In [129]:

sns.pairplot(df, x_vars=['Stars', 'DistanceToTurism', 'Capacity'], y_vars='Price')
plt.show()

No description has been provided for this image

Separating the data in train/test splits to train our model. We will be using 80% of the data to train and 20% to test¶

In [94]:

from sklearn.model_selection import train_test_split
x = df.drop(columns='Price')
y = df['Price']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=136)

Creating the training DataFrame¶

In [95]:

df_train = pd.DataFrame(data=x_train)
df_train['Price'] = y_train
df_train.head()

Out[95]:

	Stars	DistanceToTurism	Capacity	Price
360	5	23.466470	1	61.495168
700	1	1.592505	2	298.057859
353	3	2.444435	3	582.630609
671	3	19.223248	3	220.208619
262	2	11.541840	4	297.254357

Creating the first model and evaluating it.¶

A r² score of .43 indicates that the distance to turism does have a correlation with the price, although it doesn't fully explain the price by itself (as expected).

In [97]:

model = ols('Price ~ DistanceToTurism', data=df_train).fit()

model.rsquared

Out[97]:

np.float64(0.43067515674525736)

In [98]:

model.resid

Out[98]:

360    -40.298231
700   -214.875867
353     85.709651
671     38.660135
262    -28.672899
          ...    
56      84.778015
661    -47.532610
983   -154.795662
240    249.319655
230    -31.831201
Length: 778

Using the first model to make a prediction using the test data¶

Since the R² score is not very similar to the training data R² score, we can't say this model is consistent. we should then look to create other models.

In [100]:

y_predict = model.predict(x_test)

r2_score(y_test, y_predict)

Out[100]:

0.3192107616205031

Adding a constant to the training DataFrame¶

In [101]:

x_train = sm.add_constant(x_train)
x_train.head()

Out[101]:

	const	Stars	DistanceToTurism	Capacity
360	1.0	5	23.466470	1
700	1.0	1	1.592505	2
353	1.0	3	2.444435	3
671	1.0	3	19.223248	3
262	1.0	2	11.541840	4

Creating other saturated models, with more variables so we can compare and see which one is better¶

In [102]:

model_2 = sm.OLS(
    y_train,
    x_train[['const', 'Stars', 'DistanceToTurism', 'Capacity']]
).fit()

model_3 = sm.OLS(
    y_train,
    x_train[['const', 'Stars', 'DistanceToTurism']]
).fit()
model_4 = sm.OLS(
    y_train,
    x_train[['const', 'Stars', 'Capacity']]
).fit()

Comparing all models to select the best one¶

Results: Model 2 achieved R² = 0.91, meaning 91% of price variability is explained by the selected variables. There is also no indication of multicolinearity (we'll check for that more thoroughly later)¶

In [103]:

model.summary()

Out[103]:

OLS Regression Results
Dep. Variable:	Price	R-squared:	0.431
Model:	OLS	Adj. R-squared:	0.430
Method:	Least Squares	F-statistic:	587.0
Date:	Thu, 13 Mar 2025	Prob (F-statistic):	5.23e-97
Time:	16:46:43	Log-Likelihood:	-4815.8
No. Observations:	778	AIC:	9636.
Df Residuals:	776	BIC:	9645.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	542.8662	7.079	76.684	0.000	528.969	556.763
DistanceToTurism	-18.7959	0.776	-24.228	0.000	-20.319	-17.273

Omnibus:	14.039	Durbin-Watson:	2.075
Prob(Omnibus):	0.001	Jarque-Bera (JB):	8.130
Skew:	-0.005	Prob(JB):	0.0172
Kurtosis:	2.499	Cond. No.	15.4

In [104]:

model_2.summary()

Out[104]:

OLS Regression Results
Dep. Variable:	Price	R-squared:	0.913
Model:	OLS	Adj. R-squared:	0.913
Method:	Least Squares	F-statistic:	2704.
Date:	Thu, 13 Mar 2025	Prob (F-statistic):	0.00
Time:	16:46:50	Log-Likelihood:	-4085.5
No. Observations:	778	AIC:	8179.
Df Residuals:	774	BIC:	8198.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	196.8588	5.985	32.894	0.000	185.111	208.607
Stars	50.3854	1.186	42.496	0.000	48.058	52.713
DistanceToTurism	-19.5439	0.304	-64.281	0.000	-20.141	-18.947
Capacity	78.9019	1.512	52.195	0.000	75.934	81.869

Omnibus:	1.482	Durbin-Watson:	1.885
Prob(Omnibus):	0.477	Jarque-Bera (JB):	1.510
Skew:	-0.062	Prob(JB):	0.470
Kurtosis:	2.823	Cond. No.	35.9

In [105]:

model_3.summary()

Out[105]:

OLS Regression Results
Dep. Variable:	Price	R-squared:	0.606
Model:	OLS	Adj. R-squared:	0.605
Method:	Least Squares	F-statistic:	596.9
Date:	Thu, 13 Mar 2025	Prob (F-statistic):	1.25e-157
Time:	16:46:54	Log-Likelihood:	-4672.3
No. Observations:	778	AIC:	9351.
Df Residuals:	775	BIC:	9365.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	402.8791	9.557	42.154	0.000	384.118	421.641
Stars	46.7724	2.515	18.599	0.000	41.836	51.709
DistanceToTurism	-18.9726	0.646	-29.390	0.000	-20.240	-17.705

Omnibus:	114.275	Durbin-Watson:	1.999
Prob(Omnibus):	0.000	Jarque-Bera (JB):	27.954
Skew:	-0.009	Prob(JB):	8.51e-07
Kurtosis:	2.072	Cond. No.	26.3

In [106]:

model_4.summary()

Out[106]:

OLS Regression Results
Dep. Variable:	Price	R-squared:	0.448
Model:	OLS	Adj. R-squared:	0.447
Method:	Least Squares	F-statistic:	314.5
Date:	Thu, 13 Mar 2025	Prob (F-statistic):	1.02e-100
Time:	16:46:56	Log-Likelihood:	-4803.8
No. Observations:	778	AIC:	9614.
Df Residuals:	775	BIC:	9628.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	66.6932	14.169	4.707	0.000	38.878	94.508
Stars	49.1060	2.983	16.463	0.000	43.251	54.961
Capacity	75.4037	3.801	19.838	0.000	67.942	82.865

Omnibus:	42.160	Durbin-Watson:	2.004
Prob(Omnibus):	0.000	Jarque-Bera (JB):	47.995
Skew:	-0.607	Prob(JB):	3.78e-11
Kurtosis:	3.075	Cond. No.	14.8

Looking at model_2's parameters, we can see how much the price of a hotel is affected by a change in each parameter (everything else remaining constant).¶

For example, for a increase of 1 in the hotel capacity, we can expect the price to increase on an average of $78
Insights: Star rating and capacity are key drivers; proximity to tourism negatively impacts price)

In [108]:

model_2.params

Out[108]:

const               196.858766
Stars                50.385416
DistanceToTurism    -19.543904
Capacity             78.901909

Adding a constant and predicting the values on the test split using our best model¶

In [109]:

x_test = sm.add_constant(x_test)

predict_2 = model_2.predict(x_test[['const', 'Stars', 'DistanceToTurism', 'Capacity']])

Checking that the R² is similar between training and testing.¶

It is simillar (0.91 vs 0.90), indicating that the model is consistent.

In [110]:

print(r2_score(y_test, predict_2), model_2.rsquared)

0.8976636624423547 0.9129107017167504

Next, let's precify a new hotel using our two best models so we can see the difference.¶

We can see that there is over \$100 of difference between both models' precification of the same hotel room. This is relevant to us as a \$100 difference is very relevant and can make a difference on whether a person will book a hotel room or not

In [120]:

new_hotel = pd.DataFrame({
    'const':[1],
    'Stars':[5],
    'DistanceToTurism':[0.3],
    'Capacity':[4]
})

model_2.predict(new_hotel)[0]

Out[120]:

np.float64(758.5303113576596)

In [121]:

new_hotel = new_hotel.drop(columns='Capacity')
model_3.predict(new_hotel)[0]

Out[121]:

np.float64(631.0491442885145)

The next step is to check for multicolinearity for our model_2.¶

A vif < 5 is considered to be indicative of no multicolinearity. Our model has very low vif for each variable, which is ideal.

In [122]:

variables = ['const', 'Stars', 'DistanceToTurism', 'Capacity']

vif = pd.DataFrame()

vif['variables'] = variables

vif['vif'] = [variance_inflation_factor(x_train[variables], i) for i in range(len(variables))]

vif

Out[122]:

	variables	vif
0	const	13.004992
1	Stars	1.003638
2	DistanceToTurism	1.001515
3	Capacity	1.004657

Making a price prediction x real price graph. Analyzing the graph, we can see a linear trend, which is good; It indicates our model is accurately predicting a price that is feasible for a hotel, even if it isn't perfectly accurate¶

In [123]:

y_predicted_train = model_2.predict(x_train)
sns.set_style('whitegrid')
ax = sns.scatterplot(x=y_predicted_train, y=y_train)
ax.set_title("Hotel rooms: Model 2 Predicted Price x Real Hotel Price")
ax.set_xlabel("Model 2 Predicted Price")
ax.set_ylabel("Real Hotel Price")
plt.show()

Making a Predicted Price x Residuals graph to analyze heteroscedasticity. The more scattered the graph, the better as that indicates low heteroscedasticity; It indicates that the model is consistent for the price interval we have.¶

We can see our model has homoscedasticity, as it is scattered all throughout the x axis length

In [124]:

residuals = model_2.resid

In [125]:

ax = sns.scatterplot(x=y_predicted_train, y=residuals)
ax.set_title('Residuals x Price prediction')
ax.set_xlabel('Price prediction')
ax.set_ylabel('Resídual')
plt.show()

Conclusion/Next steps¶

We can conclude that model 2 would be the best model for our purposes of predicting a hotel room price.

The next step would be to implement this on a real hotel pricing API. Of course, this database is fictional but similar results can be achieved using a real Database, adjusting it for different variables.

This project was made as an effort to learn about linear regression models and apply them to useful situations. A lot was learned here, as this was my first project of such kind in this area and I am eager to improve, learn more and apply knowledge to real world problems.

Thank you so much for reading!