Hotel Precification Model¶
Introduction¶
Problem: Arbitrarily choosing a price for a hotel room isn't efficient in terms of maximizing profit. It can work, but having a ML model that is able to price a hotel room based on existing hotel room prices can be beneficial and time-saving.¶
Approach: This project aims to solve that problem by creating several linear regression models that take in different parameters and aims to compare them and choose the best one for our purposes.¶
Database: Consists of 1000 fictional hotel rooms. Each hotel room has 4 parameters:¶
- Number of stars (1 to 5)
- Distance to turism (in km)
- Capacity, or the number of people this room can house
- Price (in dollars)
Tools used: Python, Pandas, Numpy, Seaborn, Plotly, Statsmodels, Sklearn¶
Taking a look at the first few rows of the Database¶
df = pd.read_csv('hotels.csv')
df.head()
Stars | DistanceToTurism | Capacity | Price | |
---|---|---|---|---|
0 | 5 | 9.301565 | 3 | 506.275452 |
1 | 1 | 1.785891 | 1 | 246.363458 |
2 | 4 | 15.504293 | 3 | 325.873550 |
3 | 4 | 4.173188 | 3 | 521.343284 |
4 | 4 | 9.443685 | 1 | 252.587087 |
Checking Database information, so we can see if there are null values or whether or not we need to make type conversions. Luckily, this database seems to be complete¶
df.info()
RangeIndex: 1000 entries, 0 to 999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Stars 1000 non-null int64 1 DistanceToTurism 1000 non-null float64 2 Capacity 1000 non-null int64 3 Price 1000 non-null float64
Looking at a description of the database to see if we find anything uncommon. We found that there are hotel room with negative prices, which is weird.¶
We should then clean up the DataFrame to remove these negative values. We will also remove any hotel room that has a price lower than $50, as that is not realistic. Furthermore, we can save these unrealistic values to predict them and see what values our future model will come up if, and analyze if those values are unrealistic
df.describe()
Stars | DistanceToTurism | Capacity | Price | |
---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 3.008000 | 7.650878 | 2.519000 | 396.611361 |
std | 1.407095 | 5.870137 | 1.108543 | 171.742433 |
min | 1.000000 | 0.013850 | 1.000000 | -220.208705 |
25% | 2.000000 | 3.034775 | 2.000000 | 283.590980 |
50% | 3.000000 | 6.430035 | 3.000000 | 401.743527 |
75% | 4.000000 | 10.863295 | 4.000000 | 516.097856 |
max | 5.000000 | 31.709748 | 4.000000 | 836.261308 |
df_price_under_50 = df[df['Price'] < 50].copy()
df = df[df['Price'] >= 50].copy()
Using Seaborn's pairplot to take a look at which variables are correlated with the price of a hotel.¶
We can see the distance to turism seems to be highly correlated with the price, and there seems to be a correlation also between stars/capacity and the price.
sns.pairplot(df, x_vars=['Stars', 'DistanceToTurism', 'Capacity'], y_vars='Price')
plt.show()
Separating the data in train/test splits to train our model. We will be using 80% of the data to train and 20% to test¶
from sklearn.model_selection import train_test_split
x = df.drop(columns='Price')
y = df['Price']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=136)
Creating the training DataFrame¶
df_train = pd.DataFrame(data=x_train)
df_train['Price'] = y_train
df_train.head()
Stars | DistanceToTurism | Capacity | Price | |
---|---|---|---|---|
360 | 5 | 23.466470 | 1 | 61.495168 |
700 | 1 | 1.592505 | 2 | 298.057859 |
353 | 3 | 2.444435 | 3 | 582.630609 |
671 | 3 | 19.223248 | 3 | 220.208619 |
262 | 2 | 11.541840 | 4 | 297.254357 |
Creating the first model and evaluating it.¶
A r² score of .43 indicates that the distance to turism does have a correlation with the price, although it doesn't fully explain the price by itself (as expected).
model = ols('Price ~ DistanceToTurism', data=df_train).fit()
model.rsquared
np.float64(0.43067515674525736)
model.resid
360 -40.298231 700 -214.875867 353 85.709651 671 38.660135 262 -28.672899 ... 56 84.778015 661 -47.532610 983 -154.795662 240 249.319655 230 -31.831201 Length: 778
Using the first model to make a prediction using the test data¶
Since the R² score is not very similar to the training data R² score, we can't say this model is consistent. we should then look to create other models.
y_predict = model.predict(x_test)
r2_score(y_test, y_predict)
0.3192107616205031
Adding a constant to the training DataFrame¶
x_train = sm.add_constant(x_train)
x_train.head()
const | Stars | DistanceToTurism | Capacity | |
---|---|---|---|---|
360 | 1.0 | 5 | 23.466470 | 1 |
700 | 1.0 | 1 | 1.592505 | 2 |
353 | 1.0 | 3 | 2.444435 | 3 |
671 | 1.0 | 3 | 19.223248 | 3 |
262 | 1.0 | 2 | 11.541840 | 4 |
Creating other saturated models, with more variables so we can compare and see which one is better¶
model_2 = sm.OLS(
y_train,
x_train[['const', 'Stars', 'DistanceToTurism', 'Capacity']]
).fit()
model_3 = sm.OLS(
y_train,
x_train[['const', 'Stars', 'DistanceToTurism']]
).fit()
model_4 = sm.OLS(
y_train,
x_train[['const', 'Stars', 'Capacity']]
).fit()
model.summary()
Dep. Variable: | Price | R-squared: | 0.431 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.430 |
Method: | Least Squares | F-statistic: | 587.0 |
Date: | Thu, 13 Mar 2025 | Prob (F-statistic): | 5.23e-97 |
Time: | 16:46:43 | Log-Likelihood: | -4815.8 |
No. Observations: | 778 | AIC: | 9636. |
Df Residuals: | 776 | BIC: | 9645. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 542.8662 | 7.079 | 76.684 | 0.000 | 528.969 | 556.763 |
DistanceToTurism | -18.7959 | 0.776 | -24.228 | 0.000 | -20.319 | -17.273 |
Omnibus: | 14.039 | Durbin-Watson: | 2.075 |
---|---|---|---|
Prob(Omnibus): | 0.001 | Jarque-Bera (JB): | 8.130 |
Skew: | -0.005 | Prob(JB): | 0.0172 |
Kurtosis: | 2.499 | Cond. No. | 15.4 |
model_2.summary()
Dep. Variable: | Price | R-squared: | 0.913 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.913 |
Method: | Least Squares | F-statistic: | 2704. |
Date: | Thu, 13 Mar 2025 | Prob (F-statistic): | 0.00 |
Time: | 16:46:50 | Log-Likelihood: | -4085.5 |
No. Observations: | 778 | AIC: | 8179. |
Df Residuals: | 774 | BIC: | 8198. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 196.8588 | 5.985 | 32.894 | 0.000 | 185.111 | 208.607 |
Stars | 50.3854 | 1.186 | 42.496 | 0.000 | 48.058 | 52.713 |
DistanceToTurism | -19.5439 | 0.304 | -64.281 | 0.000 | -20.141 | -18.947 |
Capacity | 78.9019 | 1.512 | 52.195 | 0.000 | 75.934 | 81.869 |
Omnibus: | 1.482 | Durbin-Watson: | 1.885 |
---|---|---|---|
Prob(Omnibus): | 0.477 | Jarque-Bera (JB): | 1.510 |
Skew: | -0.062 | Prob(JB): | 0.470 |
Kurtosis: | 2.823 | Cond. No. | 35.9 |
model_3.summary()
Dep. Variable: | Price | R-squared: | 0.606 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.605 |
Method: | Least Squares | F-statistic: | 596.9 |
Date: | Thu, 13 Mar 2025 | Prob (F-statistic): | 1.25e-157 |
Time: | 16:46:54 | Log-Likelihood: | -4672.3 |
No. Observations: | 778 | AIC: | 9351. |
Df Residuals: | 775 | BIC: | 9365. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 402.8791 | 9.557 | 42.154 | 0.000 | 384.118 | 421.641 |
Stars | 46.7724 | 2.515 | 18.599 | 0.000 | 41.836 | 51.709 |
DistanceToTurism | -18.9726 | 0.646 | -29.390 | 0.000 | -20.240 | -17.705 |
Omnibus: | 114.275 | Durbin-Watson: | 1.999 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 27.954 |
Skew: | -0.009 | Prob(JB): | 8.51e-07 |
Kurtosis: | 2.072 | Cond. No. | 26.3 |
model_4.summary()
Dep. Variable: | Price | R-squared: | 0.448 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.447 |
Method: | Least Squares | F-statistic: | 314.5 |
Date: | Thu, 13 Mar 2025 | Prob (F-statistic): | 1.02e-100 |
Time: | 16:46:56 | Log-Likelihood: | -4803.8 |
No. Observations: | 778 | AIC: | 9614. |
Df Residuals: | 775 | BIC: | 9628. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 66.6932 | 14.169 | 4.707 | 0.000 | 38.878 | 94.508 |
Stars | 49.1060 | 2.983 | 16.463 | 0.000 | 43.251 | 54.961 |
Capacity | 75.4037 | 3.801 | 19.838 | 0.000 | 67.942 | 82.865 |
Omnibus: | 42.160 | Durbin-Watson: | 2.004 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 47.995 |
Skew: | -0.607 | Prob(JB): | 3.78e-11 |
Kurtosis: | 3.075 | Cond. No. | 14.8 |
Looking at model_2's parameters, we can see how much the price of a hotel is affected by a change in each parameter (everything else remaining constant).¶
For example, for a increase of 1 in the hotel capacity, we can expect the price to increase on an average of $78
Insights: Star rating and capacity are key drivers; proximity to tourism negatively impacts price)
model_2.params
const 196.858766 Stars 50.385416 DistanceToTurism -19.543904 Capacity 78.901909
Adding a constant and predicting the values on the test split using our best model¶
x_test = sm.add_constant(x_test)
predict_2 = model_2.predict(x_test[['const', 'Stars', 'DistanceToTurism', 'Capacity']])
Checking that the R² is similar between training and testing.¶
It is simillar (0.91 vs 0.90), indicating that the model is consistent.
print(r2_score(y_test, predict_2), model_2.rsquared)
0.8976636624423547 0.9129107017167504
Next, let's precify a new hotel using our two best models so we can see the difference.¶
We can see that there is over \$100 of difference between both models' precification of the same hotel room. This is relevant to us as a \$100 difference is very relevant and can make a difference on whether a person will book a hotel room or not
new_hotel = pd.DataFrame({
'const':[1],
'Stars':[5],
'DistanceToTurism':[0.3],
'Capacity':[4]
})
model_2.predict(new_hotel)[0]
np.float64(758.5303113576596)
new_hotel = new_hotel.drop(columns='Capacity')
model_3.predict(new_hotel)[0]
np.float64(631.0491442885145)
The next step is to check for multicolinearity for our model_2.¶
A vif < 5 is considered to be indicative of no multicolinearity. Our model has very low vif for each variable, which is ideal.
variables = ['const', 'Stars', 'DistanceToTurism', 'Capacity']
vif = pd.DataFrame()
vif['variables'] = variables
vif['vif'] = [variance_inflation_factor(x_train[variables], i) for i in range(len(variables))]
vif
variables | vif | |
---|---|---|
0 | const | 13.004992 |
1 | Stars | 1.003638 |
2 | DistanceToTurism | 1.001515 |
3 | Capacity | 1.004657 |
Making a price prediction x real price graph. Analyzing the graph, we can see a linear trend, which is good; It indicates our model is accurately predicting a price that is feasible for a hotel, even if it isn't perfectly accurate¶
y_predicted_train = model_2.predict(x_train)
sns.set_style('whitegrid')
ax = sns.scatterplot(x=y_predicted_train, y=y_train)
ax.set_title("Hotel rooms: Model 2 Predicted Price x Real Hotel Price")
ax.set_xlabel("Model 2 Predicted Price")
ax.set_ylabel("Real Hotel Price")
plt.show()
Making a Predicted Price x Residuals graph to analyze heteroscedasticity. The more scattered the graph, the better as that indicates low heteroscedasticity; It indicates that the model is consistent for the price interval we have.¶
We can see our model has homoscedasticity, as it is scattered all throughout the x axis length
residuals = model_2.resid
ax = sns.scatterplot(x=y_predicted_train, y=residuals)
ax.set_title('Residuals x Price prediction')
ax.set_xlabel('Price prediction')
ax.set_ylabel('Resídual')
plt.show()
Conclusion/Next steps¶
We can conclude that model 2 would be the best model for our purposes of predicting a hotel room price.
The next step would be to implement this on a real hotel pricing API. Of course, this database is fictional but similar results can be achieved using a real Database, adjusting it for different variables.
This project was made as an effort to learn about linear regression models and apply them to useful situations. A lot was learned here, as this was my first project of such kind in this area and I am eager to improve, learn more and apply knowledge to real world problems.
Thank you so much for reading!