Estéfano Tuyama
Projects About

Bike Rental Analysis and Forecasting Model¶

View on Github

This project explores bike rental patterns in London and builds a predictive model to forecast future demand. By combining exploratory data analysis (EDA) with time-series forecasting techniques, we uncover seasonal trends and identify key factors influencing bike rentals.¶

Why?¶

Identifying consumer trends is an important step towards understanding a business and taking smart, informed steps towards maximizing profit.¶

In [3]:
df=pd.read_csv("Bike_data.csv")

1. Data Acquisition and Cleaning¶

Overview¶

We begin with a dataset containing over 17,000 hourly records of bike rentals. The data includes variables such as date and time, temperature, thermal sensation, humidity, wind speed, weather conditions, holiday indicators, weekend flags, and season.¶

Data Cleaning¶

Handling Missing Values and Duplicates:¶

Although the dataset has only 23 missing values, we chose to remove these entries (rather than interpolating) to preserve data integrity. Duplicate records were also dropped.¶

Feature Refinement¶

The irrelevant column (an unnamed index) was removed, and the date_time field was converted to a proper datetime format for further time-based analysis.¶

In [4]:
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
df.drop(columns='Unnamed: 0', inplace=True)
df['date_time'] = pd.to_datetime(df['date_time'])
df.info()
Index: 17406 entries, 0 to 17428
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date_time          17406 non-null  datetime64[ns]
 1   count              17406 non-null  int64         
 2   temperature        17406 non-null  float64       
 3   thermal_sensation  17406 non-null  float64       
 4   humidity           17406 non-null  float64       
 5   wind_speed         17406 non-null  float64       
 6   weather            17406 non-null  object        
 7   holiday            17406 non-null  int64         
 8   weekend            17406 non-null  int64         
 9   season             17406 non-null  object

2. Exploratory Data Analysis (EDA)¶

Initial Insights:¶

  • Descriptive Statistics:

    The summary statistics reveal that some days record zero rentals, while other days see significant activity. Temperature and thermal sensation show similar patterns, while humidity exhibits the largest variation.

  • Distribution Analysis:

    Histograms for temperature, thermal sensation, humidity, and wind speed suggest:

    • Temperature & Thermal Sensation: Follow a near-normal distribution, hinting that extreme weather (either too hot or too cold) might reduce bike rentals.

    • Humidity: A right-skewed distribution suggests higher rentals on more humid days.

    • Wind Speed: Increased wind correlates with a decline in rentals.

In [5]:
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

sns.set_theme()
sns.histplot(data=df, x='temperature', ax=axes[0,0], bins=20)
axes[0,0].set_title("Temperature distribution")

sns.histplot(data=df, x='thermal_sensation', ax=axes[0,1], bins=20, color='green')
axes[0,1].set_title("Thermal sensation  distribution")

sns.histplot(data=df, x='humidity', ax=axes[1,0], bins=20, color='purple')
axes[1,0].set_title("humidity  distribution")

sns.histplot(data=df, x='wind_speed', ax=axes[1,1], bins=20, color='pink')
axes[1, 1].set_title("Wind speed distribution")

sns.despine()
plt.tight_layout()
plt.show()
No description has been provided for this image

3. Correlation Analysis¶

To assess relationships between variables, a heatmap of the correlation matrix was generated. Although most correlations are moderate, two notable observations include:¶

  • Temperature: Positively correlated with bike rentals (0.39), suggesting warmer days drive demand.

  • Humidity: Negatively correlated (-0.46), indicating that overly humid days might deter rentals.

In [6]:
plt.figure(figsize=(8,4))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='Blues')
plt.show()
No description has been provided for this image

4. Impact of Weather and Season on Rentals¶

Weather Impact:¶

A bar plot segmented by weather categories reveals that weather conditions significantly affect rental volumes. Clear skies lead to the highest rentals, while adverse weather reduces usage.¶

In [7]:
weather_df = df.groupby('weather')['count'].sum().reset_index().sort_values(by='count', ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(data=weather_df, x='count', y='weather', hue='weather')
plt.title("Bikes rented per weather category")
plt.xlabel("Bikes rented (in millions)")
plt.ylabel("Weather")
plt.show()
No description has been provided for this image

Seasonal Trends:¶

Grouping by season shows that summer months record a substantial increase in rentals, confirming our expectations given London's milder weather during this time.¶

In [8]:
season_df = df.groupby('season')['count'].sum().reset_index().sort_values(by='count', ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(data=season_df, x='season', y='count', hue='season', hue_order=['Winter', 'Fall', 'Spring', 'Summer'], palette='coolwarm')
plt.title("Bike rentals per season")
plt.xlabel("Season")
plt.ylabel("Bike rentals (in millions)")
plt.show()
No description has been provided for this image

5. Time Series Analysis¶

Monthly Trends:¶

When analyzing bike rentals by month, a clear seasonal pattern emerges:¶

  • May to October: Rentals surge, peaking in July, likely due to warmer, more favorable conditions for cycling.
In [9]:
df['month'] = df['date_time'].dt.month
df['hour'] = df['date_time'].dt.hour #extacting month and hour
df['date_time'] = df['date_time'].dt.date # extracting date
df= df.rename(columns={'date_time':'date'})
df['date'] = pd.to_datetime(df['date'])

df_month = df.groupby('month')['count'].sum().reset_index().sort_values(by='month')
In [10]:
plt.figure(figsize=(8, 5))
sns.barplot(data=df_month, x='month', y='count', hue='month', legend=False)
plt.title('Bike rentals per month (in millions)')
plt.xlabel("Month")
plt.ylabel("Bikes rented(in millions)")
plt.show()
No description has been provided for this image

Weekday vs. Weekend Bike Rental Behavior:¶

Below we will find two graphs plotted to compare how Bike rentals behave on weekdays vs weekends. This is relevant as it can help the Bike rental company on how to strategize to maximize profit.¶

  • Weekday vs. Weekend:

    • Weekdays: Peak rental times likely coincide with commute hours.

    • Weekends: Rentals are more evenly distributed, indicating leisure usage.

In [11]:
df_weekday = df[df['weekend'] == 0]
df_weekend = df[df['weekend'] == 1]

df_weekday_hour = df_weekday.groupby('hour')['count'].sum().reset_index()
df_weekend_hour = df_weekend.groupby('hour')['count'].sum().reset_index()

fig, axes = plt.subplots(1, 2, figsize=(18, 6))

sns.barplot(data=df_weekday_hour, x='hour', y='count', hue='hour', legend=False, ax=axes[0])
axes[0].set_title("Bike rentals per hour on WEEK DAYS")
axes[0].set_xlabel("Hour")
axes[0].set_ylabel("Bikes rented(In millions)")

sns.barplot(data=df_weekend_hour, x='hour', y='count', hue='hour', legend=False, ax=axes[1])
axes[1].set_title("Bike rentals per hour on WEEKENDS")
axes[1].set_xlabel("Hour")
axes[1].set_ylabel("Bikes rented(In millions)")

plt.show()
No description has been provided for this image

6. Forecasting Future Bike Rentals with Prophet¶

Model Setup:¶

For forecasting, we employed Facebook's Prophet, a robust tool for time series predictions. The dataset was first aggregated by date, and then split into training (80%) and testing (20%) sets.¶

Data Preparation:¶
In [12]:
df_prophet = df[['date', 'count']].rename(columns={'date':'ds', 'count':'y'})
df_prophet = df_prophet.groupby('ds')['y'].sum().reset_index()

# Separating the data between training and testing
df_train = pd.DataFrame()
df_test = pd.DataFrame()

# Separating 80% of the data for training (548 out of 730 total)
df_train['ds'] = df_prophet['ds'][:584]
df_train['y'] = df_prophet['y'][:584]

# 20% for testing
df_test['ds'] = df_prophet['ds'][584:]
df_test['y'] = df_prophet['y'][584:]

Initial Forecasting and Evaluation:¶

A Prophet model with yearly seasonality was trained on the training set. The model forecasted 150 days into the future.¶

Performance:¶

The initial model yielded an RMSE of approximately 6,200, meaning that on average, the model’s predictions were off by around 6,200 bikes per day.¶

In [13]:
import numpy as np
from prophet import Prophet

np.random.seed(4482)

model1 = Prophet(yearly_seasonality=True)

# Training the model
model1.fit(df_train)

# Creating a dataframe for future predictions
future = model1.make_future_dataframe(periods=150, freq='D')

# Making the prediction
prediction = model1.predict(future)
12:00:13 - cmdstanpy - INFO - Chain [1] start processing
12:00:13 - cmdstanpy - INFO - Chain [1] done processing
In [14]:
from sklearn.metrics import mean_squared_error

df_prediction = prediction[['ds', 'yhat']]
df_comparation = pd.merge(df_prediction, df_test, on='ds')

mse = mean_squared_error(df_comparation['y'], df_comparation['yhat'])
rmse = np.sqrt(mse)

print(f'MSE:{mse}, RMSE:{rmse}')
MSE:37755631.75189984, RMSE:6144.561152100273

Improving Model Accuracy by Treating Outliers:¶

Outliers were then addressed to improve forecast accuracy. After removing days with abnormal rental counts, the model was retrained:¶

Improved Performance:¶

The refined model’s RMSE dropped to approximately 3,840, a significant improvement that better captures typical daily rental behavior.¶

In [15]:
np.random.seed(4482)

model2 = Prophet()
model2.fit(df_prophet)
future = model2.make_future_dataframe(periods=0)
prediction = model2.predict(future)
12:00:14 - cmdstanpy - INFO - Chain [1] start processing
12:00:14 - cmdstanpy - INFO - Chain [1] done processing
In [16]:
df_no_outliers = df_prophet[(df_prophet['y'] > prediction['yhat_lower']) & (df_prophet['y'] < prediction['yhat_upper'])]

After treating outliers, we can now separate again between testing and training and creating our model¶

In [17]:
df_train = pd.DataFrame()

df_train['ds'] = df_no_outliers['ds'][:505]
df_train['y'] = df_no_outliers['y'][:505]

df_test = pd.DataFrame()

df_test['ds'] = df_no_outliers['ds'][505:]
df_test['y'] = df_no_outliers['y'][505:]
In [18]:
np.random.seed(4482)

model3 = Prophet(yearly_seasonality=True)
model3.fit(df_train)
future = model3.make_future_dataframe(periods=365, freq='D')
prediction = model3.predict(future)
12:00:14 - cmdstanpy - INFO - Chain [1] start processing
12:00:14 - cmdstanpy - INFO - Chain [1] done processing

Data Visualization¶

Using Seaborn, a plot was generated to visualize how the model would predict future bike rentals. The red dots on the graphs are the actual future bike rental data points, very close to our model's prediction¶

In [27]:
fig1 = model3.plot(prediction)
plt.title("Prophet Model prediction of future bike rentals")
plt.xlabel("Date")
plt.ylabel("Bikes rented")
plt.plot(df_test['ds'], df_test['y'], '.r')
plt.show()
No description has been provided for this image

Refined model’s RMSE¶

In [20]:
df_pred = prediction[['ds', 'yhat']]
df_compar = pd.merge(df_pred, df_test, on='ds')

mse = mean_squared_error(df_compar['y'], df_compar['yhat'])
rmse = np.sqrt(mse)

print(f'MSE: {mse}, RMSE: {rmse}')
MSE: 14747779.651404256, RMSE: 3840.2837982894252

7. Conclusion and Future Work¶

This project demonstrates a complete end-to-end pipeline:¶

  • Data Preparation: Cleaned and preprocessed the dataset.

  • Exploratory Analysis: Uncovered key patterns such as seasonal trends, the influence of weather, and differences in weekday versus weekend behavior.

  • Forecasting: Employed Prophet for time series forecasting, and refined the model by addressing outliers.

Key Takeaways:¶

  • Bike rentals in London peak during the summer, with distinct patterns on weekdays versus weekends.

  • Weather and environmental conditions significantly influence rental volumes.

  • Addressing data anomalies can substantially enhance forecasting accuracy.

Future Enhancements:¶

  • Incorporate additional external factors (e.g., special events or public transportation changes) to further refine predictions.

  • Explore advanced machine learning models and ensemble methods for even greater accuracy.

Thank you so much for reading!

To contact me: estefanotuyama@gmail.com