In this study, we examine vehicle sales from 2002 to 2023. With this dataset we were able to find valuable insights into the trends and patterns of vehicle sales over this century, allowing us to understand better the dynamics of the automotive market and make informed decisions about the forecasting of the next years in the vehicle sales business.
Firstly we had to merge the first two columns of the original dataset in order to get the index of the time series, pass from the columns “Year” and “Month” to “Date”. Then we pass it to a tsibble so we can perform our analysis. Our study was based on the total number of new vehicles sold through the years.
vehicle_sales <- read_csv("C:/Users/migue/OneDrive/Ambiente de Trabalho/Forecasting_Project/vehicles_sales_dataset.csv")
#Convert month name to number because the format function was not recognizing the abbreviation of some months like February, April, etc.
vehicle_sales$MonthNum <- match(tolower(vehicle_sales$Month), tolower(month.abb))
# Creating Date with the month in numerical
vehicle_sales$Date <- as.Date(paste(vehicle_sales$Year, vehicle_sales$MonthNum, "01", sep = "-"), format = "%Y-%m-%d")
vehicle_sales <- vehicle_sales %>%
mutate(date = yearmonth(Date)) %>%
select(date, New, Used, `Total Sales New`, `Total Sales Used`) %>%
tsibble(index = date)
We began by analyzing our dependent variable(Total Sales New) over time.
vehicle_sales %>% autoplot(`Total Sales New`) +
labs(title = "New Vehicle Sales (2002–2023)", y = "Total Sales")
From 2002 to 2010, there is a noticeable decreasing trend, followed by an increase until 2023. Additionally, the variance is not constant, and the data exhibits clear seasonality. Let’s test some transformations: Square root
vehicle_sales %>%
autoplot(sqrt(`Total Sales New`))
It´s not stationary yet. Even though the variance, with the squared transformations, remained fairly constant, we could see a trend in the total sales of new vehicles. Let´s try Box-Cox Transformation:
#Estimate lambda
lambda <- vehicle_sales %>%
features(`Total Sales New`, features = guerrero) %>%
pull(lambda_guerrero)
vehicle_sales <- vehicle_sales %>%
mutate(BoxCox_Sales = box_cox(`Total Sales New`, lambda))
autoplot(vehicle_sales, BoxCox_Sales) +
labs(title = "Box-Cox Transformed Series")
We still continue with the trend problem. So,let’s find an order of differentiation in order to obtain stationary data. We will square root the “Total Sales New” column to reduce the variance. Then, since the data is monthly, we will apply seasonal difference with a lag of 12 months to remove the existing annual seasonal patterns. And finally, we will apply a first-order difference (lag = 1) to remove the trend.
vehicle_sales %>%
autoplot(difference(difference(sqrt(`Total Sales New`), 12), 1)) +
labs(title = "Transformed Series with Differencing")
Now, we can see a process that looks stationary. Let´s confirm:
summary(ur.df(na.omit(difference(difference(sqrt(vehicle_sales$`Total Sales New`), 12),1)),
type='none', selectlags='AIC'))
##
## ###############################################
## # Augmented Dickey-Fuller Test Unit Root Test #
## ###############################################
##
## Test regression none
##
##
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8533.8 -867.0 65.2 895.5 7981.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -1.40810 0.09098 -15.477 < 2e-16 ***
## z.diff.lag 0.27136 0.06095 4.452 1.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1819 on 247 degrees of freedom
## Multiple R-squared: 0.5867, Adjusted R-squared: 0.5834
## F-statistic: 175.3 on 2 and 247 DF, p-value: < 2.2e-16
##
##
## Value of test-statistic is: -15.4773
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau1 -2.58 -1.95 -1.62
We rejected the null hypothesis,since our t value is smaller than all of the critical values of each confidence interval percentage.This means that we don’t need to make more differences, meaning the series was stationary after the transformations.
vehicle_sales %>%
gg_tsdisplay(difference(difference(sqrt(`Total Sales New`), 12), 1),
plot_type = "partial", lag_max = 36)
After examining the ACF and PACF plots, we observed a seasonal MA(1) component (especially at lag = 12), and potential significance at lags 2, 7, and 9.
fit1 <- vehicle_sales %>%
model(
arima011 = ARIMA(sqrt(`Total Sales New`) ~ pdq(0, 1, 1) + PDQ(0,1,2)),
arima012 = ARIMA(sqrt(`Total Sales New`) ~ pdq(0, 1, 2) + PDQ(0,1,2)),
arima212 = ARIMA(sqrt(`Total Sales New`) ~ pdq(2, 1, 2) + PDQ(0,1,2)),
arima112 = ARIMA(sqrt(`Total Sales New`) ~ pdq(1, 1, 2) + PDQ(0,1,2)),
arima111 = ARIMA(sqrt(`Total Sales New`) ~ pdq(1, 1, 1) + PDQ(0,1,2)),
arima211 = ARIMA(sqrt(`Total Sales New`) ~ pdq(2, 1, 1) + PDQ(0,1,2)),
auto = ARIMA(sqrt(`Total Sales New`))
)
With this information, we fitted several models, using the Information Criteria procedure:
# Check the best models:
fit1 %>%
glance() %>%
arrange(AICc) %>%
select(.model:BIC)
## # A tibble: 7 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima212 1913052. -2176. 4367. 4367. 4391.
## 2 arima112 1926857. -2178. 4367. 4368. 4388.
## 3 arima012 1972539. -2180. 4371. 4371. 4389.
## 4 arima111 1998870. -2182. 4375. 4375. 4392.
## 5 arima211 2002951. -2182. 4376. 4376. 4397.
## 6 arima011 2103118. -2189. 4387. 4387. 4401.
## 7 auto 2160009. -2291. 4597. 4597. 4622.
The results said that the best models were ARIMA(2,1,2)(0,1,2)[12], ARIMA(1,1,2)(0,1,2)[12] comes second and ARIMA(0,1,2)(0,1,2)[12] comes third.The criterion with the greatest weight for choosing the best models was the smaller Akaike Information Criterion, due to the way it balances the model adjustment (log_lik) with the penalty for complexity.
To assess the adequacy of the candidate models, we analyze the residual diagnostics of the three best-performing models.
# Checking the residuals
fit1 %>%
select(arima212) %>%
gg_tsresiduals(lag_max=36)
fit1 %>%
select(arima112) %>%
gg_tsresiduals(lag_max=36)
fit1 %>%
select(arima012) %>%
gg_tsresiduals(lag_max=36)
We noticed that some lags in the residuals are significantly different from zero. This happens in all three models. This suggests that our model may not fully capture all the underlying patterns present in the data. The residuals should show an ACF close to 0 at all lags. I will proceed with a Ljung-Box test to statistically verify that the residuals have no autocorrelation - null hypothesis
fit1 %>%
select(arima212) %>%
augment() %>%
features(.innov, ljung_box, lag=12, dof = 6)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima212 14.3 0.0263
fit1 %>%
select(arima112) %>%
augment() %>%
features(.innov, ljung_box, lag=12, dof = 5)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima112 15.0 0.0366
fit1 %>%
select(arima012) %>%
augment() %>%
features(.innov, ljung_box, lag=12, dof = 4)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima012 24.3 0.00202
With this test we were able to conclude that, since the p-value of all of the models is less than the commonly used significance level of 0.05, we reject the null hypothesis. This suggests that there is significant autocorrelation in the residuals of the models at a lag of 12. After several different testings on the transformations of the data, and after looking to all of the possibilities of a SARIMA model, we can say that given the dataset’s constraints and the analytical tools at our disposal, the chosen model represents the best compromise between model complexity and predictive performance. While the identified model may not be perfect, it demonstrates a careful and thorough approach to time series analysis. Recognising the possibility of p-hacking increases the transparency of our methodology, ensuring that the reported results are robust and reliable in the context of the analysis.
We trained the models on data from 2002–2021 and used 2022–2023 as the test set.
fit1 %>%
select(arima212) %>%
forecast(h = 12) %>%
autoplot(vehicle_sales)
fit1 %>%
select(arima112) %>%
forecast(h = 12) %>%
autoplot(vehicle_sales)
fit1 %>%
select(arima012) %>%
forecast(h = 12) %>%
autoplot(vehicle_sales)
I can see that the forecasts for each model are very similar to each other. I split the dataset into training and testing sets to assess the predictive ability of the models and to ensure they generalize well to unseen data.
# Divide the dataset between train(2002-2021) and test(2022-2023)
vehicle_sales_train <- vehicle_sales[1:(nrow(vehicle_sales) - 24), ]
vehicle_sales_test <- vehicle_sales[(nrow(vehicle_sales) - 23):nrow(vehicle_sales), ]
# Visualizing the accuracy of the models
vehicle_sales_fit <- vehicle_sales_train %>%
model(
arima212 = ARIMA(sqrt(`Total Sales New`) ~ pdq(2, 1, 2) + PDQ(0,1,2)),
arima112 = ARIMA(sqrt(`Total Sales New`) ~ pdq(1, 1, 2) + PDQ(0,1,2)),
arima012 = ARIMA(sqrt(`Total Sales New`) ~ pdq(0, 1, 2) + PDQ(0,1,2))
)
vehicle_sales_fc <- vehicle_sales_fit %>% forecast(vehicle_sales_test)
vehicle_sales_fc %>%
autoplot(bind_rows(vehicle_sales_train, vehicle_sales_test), level = NULL) +
labs(y = "Total Sales", title = "Forecast: New Vehicle Sales") +
guides(colour = guide_legend(title = "Model"))
Comparing the accuracy measures:
accuracy(vehicle_sales_fc, vehicle_sales)
## # A tibble: 3 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima012 Test 55511360. 113859136. 99835228. 4.48 9.20 1.12 0.844 0.333
## 2 arima112 Test 62314989. 115652919. 99518967. 5.13 9.13 1.11 0.857 0.324
## 3 arima212 Test 51090406. 111706423. 98056592. 4.06 9.05 1.10 0.828 0.342
The ARIMA212 model had the best score in every performance measure. Therefore, it seems to be clearly the most suitable choice. It´s our model:
fit1 %>%
select(arima212) %>%
report()
## Series: Total Sales New
## Model: ARIMA(2,1,2)(0,1,2)[12]
## Transformation: sqrt(`Total Sales New`)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1 sma2
## -0.3882 0.2202 0.1049 -0.6845 -0.7830 -0.0691
## s.e. 0.1327 0.1407 0.1056 0.1174 0.0707 0.0678
##
## sigma^2 estimated as 1913052: log likelihood=-2176.34
## AIC=4366.67 AICc=4367.13 BIC=4391.35
In conclusion, this project’s exhaustive exploration of forecasting methods yielded valuable insights into the predictive modeling of new vehicle sales. Through rigorous experimentation with various data transformations and SARIMA model configurations, we have developed a set of models that strike a balance between complexity and prediction accuracy. While acknowledging the inherent limitations of any forecasting effort, we have attempted to maintain transparency and rigor throughout our analysis, avoiding the traps of p-hacking and ensuring the robustness of our reported results. By leveraging the forecasts generated by our top performing models, we anticipate a gradual increase in new vehicle sales over the next year, although with the typical fluctuations of the automotive market. Despite the inherent uncertainty, our chosen model provides useful guidance to decision-makers navigating this dynamic landscape. We have worked hard to provide a solid foundation for strategic planning and decision making in the automotive industry by paying close attention to methodological detail and emphasizing empirical validation.