Bike Rental Analysis and Forecasting Model¶
This project explores bike rental patterns in London and builds a predictive model to forecast future demand. By combining exploratory data analysis (EDA) with time-series forecasting techniques, we uncover seasonal trends and identify key factors influencing bike rentals.¶
Why?¶
Identifying consumer trends is an important step towards understanding a business and taking smart, informed steps towards maximizing profit.¶
df=pd.read_csv("Bike_data.csv")
1. Data Acquisition and Cleaning¶
Overview¶
We begin with a dataset containing over 17,000 hourly records of bike rentals. The data includes variables such as date and time, temperature, thermal sensation, humidity, wind speed, weather conditions, holiday indicators, weekend flags, and season.¶
Data Cleaning¶
Handling Missing Values and Duplicates:¶
Although the dataset has only 23 missing values, we chose to remove these entries (rather than interpolating) to preserve data integrity. Duplicate records were also dropped.¶
Feature Refinement¶
The irrelevant column (an unnamed index) was removed, and the date_time field was converted to a proper datetime format for further time-based analysis.¶
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
df.drop(columns='Unnamed: 0', inplace=True)
df['date_time'] = pd.to_datetime(df['date_time'])
df.info()
Index: 17406 entries, 0 to 17428 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_time 17406 non-null datetime64[ns] 1 count 17406 non-null int64 2 temperature 17406 non-null float64 3 thermal_sensation 17406 non-null float64 4 humidity 17406 non-null float64 5 wind_speed 17406 non-null float64 6 weather 17406 non-null object 7 holiday 17406 non-null int64 8 weekend 17406 non-null int64 9 season 17406 non-null object
2. Exploratory Data Analysis (EDA)¶
Initial Insights:¶
Descriptive Statistics:
The summary statistics reveal that some days record zero rentals, while other days see significant activity. Temperature and thermal sensation show similar patterns, while humidity exhibits the largest variation.
Distribution Analysis:
Histograms for temperature, thermal sensation, humidity, and wind speed suggest:
Temperature & Thermal Sensation: Follow a near-normal distribution, hinting that extreme weather (either too hot or too cold) might reduce bike rentals.
Humidity: A right-skewed distribution suggests higher rentals on more humid days.
Wind Speed: Increased wind correlates with a decline in rentals.
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
sns.set_theme()
sns.histplot(data=df, x='temperature', ax=axes[0,0], bins=20)
axes[0,0].set_title("Temperature distribution")
sns.histplot(data=df, x='thermal_sensation', ax=axes[0,1], bins=20, color='green')
axes[0,1].set_title("Thermal sensation distribution")
sns.histplot(data=df, x='humidity', ax=axes[1,0], bins=20, color='purple')
axes[1,0].set_title("humidity distribution")
sns.histplot(data=df, x='wind_speed', ax=axes[1,1], bins=20, color='pink')
axes[1, 1].set_title("Wind speed distribution")
sns.despine()
plt.tight_layout()
plt.show()
3. Correlation Analysis¶
To assess relationships between variables, a heatmap of the correlation matrix was generated. Although most correlations are moderate, two notable observations include:¶
Temperature: Positively correlated with bike rentals (0.39), suggesting warmer days drive demand.
Humidity: Negatively correlated (-0.46), indicating that overly humid days might deter rentals.
plt.figure(figsize=(8,4))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='Blues')
plt.show()
weather_df = df.groupby('weather')['count'].sum().reset_index().sort_values(by='count', ascending=False)
plt.figure(figsize=(8, 5))
sns.barplot(data=weather_df, x='count', y='weather', hue='weather')
plt.title("Bikes rented per weather category")
plt.xlabel("Bikes rented (in millions)")
plt.ylabel("Weather")
plt.show()
season_df = df.groupby('season')['count'].sum().reset_index().sort_values(by='count', ascending=False)
plt.figure(figsize=(8, 5))
sns.barplot(data=season_df, x='season', y='count', hue='season', hue_order=['Winter', 'Fall', 'Spring', 'Summer'], palette='coolwarm')
plt.title("Bike rentals per season")
plt.xlabel("Season")
plt.ylabel("Bike rentals (in millions)")
plt.show()
df['month'] = df['date_time'].dt.month
df['hour'] = df['date_time'].dt.hour #extacting month and hour
df['date_time'] = df['date_time'].dt.date # extracting date
df= df.rename(columns={'date_time':'date'})
df['date'] = pd.to_datetime(df['date'])
df_month = df.groupby('month')['count'].sum().reset_index().sort_values(by='month')
plt.figure(figsize=(8, 5))
sns.barplot(data=df_month, x='month', y='count', hue='month', legend=False)
plt.title('Bike rentals per month (in millions)')
plt.xlabel("Month")
plt.ylabel("Bikes rented(in millions)")
plt.show()
Weekday vs. Weekend Bike Rental Behavior:¶
Below we will find two graphs plotted to compare how Bike rentals behave on weekdays vs weekends. This is relevant as it can help the Bike rental company on how to strategize to maximize profit.¶
Weekday vs. Weekend:
Weekdays: Peak rental times likely coincide with commute hours.
Weekends: Rentals are more evenly distributed, indicating leisure usage.
df_weekday = df[df['weekend'] == 0]
df_weekend = df[df['weekend'] == 1]
df_weekday_hour = df_weekday.groupby('hour')['count'].sum().reset_index()
df_weekend_hour = df_weekend.groupby('hour')['count'].sum().reset_index()
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
sns.barplot(data=df_weekday_hour, x='hour', y='count', hue='hour', legend=False, ax=axes[0])
axes[0].set_title("Bike rentals per hour on WEEK DAYS")
axes[0].set_xlabel("Hour")
axes[0].set_ylabel("Bikes rented(In millions)")
sns.barplot(data=df_weekend_hour, x='hour', y='count', hue='hour', legend=False, ax=axes[1])
axes[1].set_title("Bike rentals per hour on WEEKENDS")
axes[1].set_xlabel("Hour")
axes[1].set_ylabel("Bikes rented(In millions)")
plt.show()
df_prophet = df[['date', 'count']].rename(columns={'date':'ds', 'count':'y'})
df_prophet = df_prophet.groupby('ds')['y'].sum().reset_index()
# Separating the data between training and testing
df_train = pd.DataFrame()
df_test = pd.DataFrame()
# Separating 80% of the data for training (548 out of 730 total)
df_train['ds'] = df_prophet['ds'][:584]
df_train['y'] = df_prophet['y'][:584]
# 20% for testing
df_test['ds'] = df_prophet['ds'][584:]
df_test['y'] = df_prophet['y'][584:]
Initial Forecasting and Evaluation:¶
A Prophet model with yearly seasonality was trained on the training set. The model forecasted 150 days into the future.¶
Performance:¶
The initial model yielded an RMSE of approximately 6,200, meaning that on average, the model’s predictions were off by around 6,200 bikes per day.¶
import numpy as np
from prophet import Prophet
np.random.seed(4482)
model1 = Prophet(yearly_seasonality=True)
# Training the model
model1.fit(df_train)
# Creating a dataframe for future predictions
future = model1.make_future_dataframe(periods=150, freq='D')
# Making the prediction
prediction = model1.predict(future)
12:00:13 - cmdstanpy - INFO - Chain [1] start processing 12:00:13 - cmdstanpy - INFO - Chain [1] done processing
from sklearn.metrics import mean_squared_error
df_prediction = prediction[['ds', 'yhat']]
df_comparation = pd.merge(df_prediction, df_test, on='ds')
mse = mean_squared_error(df_comparation['y'], df_comparation['yhat'])
rmse = np.sqrt(mse)
print(f'MSE:{mse}, RMSE:{rmse}')
MSE:37755631.75189984, RMSE:6144.561152100273
Improving Model Accuracy by Treating Outliers:¶
Outliers were then addressed to improve forecast accuracy. After removing days with abnormal rental counts, the model was retrained:¶
Improved Performance:¶
The refined model’s RMSE dropped to approximately 3,840, a significant improvement that better captures typical daily rental behavior.¶
np.random.seed(4482)
model2 = Prophet()
model2.fit(df_prophet)
future = model2.make_future_dataframe(periods=0)
prediction = model2.predict(future)
12:00:14 - cmdstanpy - INFO - Chain [1] start processing 12:00:14 - cmdstanpy - INFO - Chain [1] done processing
df_no_outliers = df_prophet[(df_prophet['y'] > prediction['yhat_lower']) & (df_prophet['y'] < prediction['yhat_upper'])]
After treating outliers, we can now separate again between testing and training and creating our model¶
df_train = pd.DataFrame()
df_train['ds'] = df_no_outliers['ds'][:505]
df_train['y'] = df_no_outliers['y'][:505]
df_test = pd.DataFrame()
df_test['ds'] = df_no_outliers['ds'][505:]
df_test['y'] = df_no_outliers['y'][505:]
np.random.seed(4482)
model3 = Prophet(yearly_seasonality=True)
model3.fit(df_train)
future = model3.make_future_dataframe(periods=365, freq='D')
prediction = model3.predict(future)
12:00:14 - cmdstanpy - INFO - Chain [1] start processing 12:00:14 - cmdstanpy - INFO - Chain [1] done processing
fig1 = model3.plot(prediction)
plt.title("Prophet Model prediction of future bike rentals")
plt.xlabel("Date")
plt.ylabel("Bikes rented")
plt.plot(df_test['ds'], df_test['y'], '.r')
plt.show()
Refined model’s RMSE¶
df_pred = prediction[['ds', 'yhat']]
df_compar = pd.merge(df_pred, df_test, on='ds')
mse = mean_squared_error(df_compar['y'], df_compar['yhat'])
rmse = np.sqrt(mse)
print(f'MSE: {mse}, RMSE: {rmse}')
MSE: 14747779.651404256, RMSE: 3840.2837982894252
7. Conclusion and Future Work¶
This project demonstrates a complete end-to-end pipeline:¶
Data Preparation: Cleaned and preprocessed the dataset.
Exploratory Analysis: Uncovered key patterns such as seasonal trends, the influence of weather, and differences in weekday versus weekend behavior.
Forecasting: Employed Prophet for time series forecasting, and refined the model by addressing outliers.
Key Takeaways:¶
Bike rentals in London peak during the summer, with distinct patterns on weekdays versus weekends.
Weather and environmental conditions significantly influence rental volumes.
Addressing data anomalies can substantially enhance forecasting accuracy.
Future Enhancements:¶
Incorporate additional external factors (e.g., special events or public transportation changes) to further refine predictions.
Explore advanced machine learning models and ensemble methods for even greater accuracy.
Thank you so much for reading!