Machine learning modeling

Statistical hypotheses: ANOVA test

Here is one result of our ANOVA test:

Question is :

Does the personal consumption vary based on different representative movie category per month?

From the result, we can see that since P-value is 0.024, which is much lower than the significant level 0.05 or 0.1, thus we can reject the h0 hypothesis and consider that personal consumption does vary based on different representative movie category per month. Thus, it is statistically significant that taking the personal consumption attribute into the machine learning model will help build a better model.

ANOVA result list:

Variable	P-value	Result
Real_Disposable_Personal_Income	0.032421	Reject H0 put in to model
Personal_Consumption_Expenditure	0.024296	Reject H0 put in to model
Treasury_Constant_Maturity_Rate_10_Year	0.063129	Properly Reject H0 put in to model

Real_Disposable_Personal_Income P-value:0.032421 —> (reject h0)

Personal_Consumption_Expenditure P-value:0.024296 —>. (reject h0)

Treasury_Constant_Maturity_Rate_10_Year P-value:0.063129 (reluctantly reject h0, but can also put into the final model since P-value is not high)

…..

Model building :

First Trial

More variables with less training data

Using 10-fold and cross-valid method to split the data the final accuracy of test data is:

Models	Accuracy
KNN	0.21
CART	0.16
Naïve Bayes	0.14
SVM	0.14
Random forest	0.22
Logistic Regression	0.25

However the accuracy is not as good as we expected, and we suspect that it is caused by the problem of data limitation, which is the lack of training data. As random forest should have a high accuracy than other models generally, while in this case, it does not achieve its potential, thus we consider a lack of data as the cause. (Our data is based on a monthly level, thus 10 years can give us only a little amount of data )

Since our data is based on monthly economic resource, in order to increase the amount of data, we need to broaden the time horizon from 2003-2011 to 1980-2011 at the sacrifice of some valuable variables which do not own a range down to 1980, such as retail_food_sale, related-stock change as some stock-related companies are not open to the public until 1995 and 2000.

Second Trial

Fewer variables with more training data

After broadening the time range of data, we successfully increase the monthly data from 106 to 382, increase by more than 350% percent.

Models	Accuracy
KNN	0.27
CART	0.24
Naïve Bayes	0.27
SVM	0.25
Random forest	0.30
Logistic Regression	0.27

Our random forest model’s accuracy has been increased up to more than 30%, which is a big improvement compared to the baseline accuracy of our 16-class classification model, which is 1/16 = 6.25%

Now let’s look at the result of the classification model.

To be simple, I just extract the parameters of LinearSVM and Multi-class Logistic regression.

LogisticRegression：

[per_consumption, per_income, treasury_ma]

[ 6.78860341e-03, -3.75306464e-02, -1.29090439e-01],
[ 4.13449855e-03, 4.93000987e-02, 3.92859687e-01],
[ 8.20688441e-03, 5.03760721e-02, 1.36663719e-01],
[ 4.44501935e-03, -4.75412673e-03, -7.12468355e-02],
[-3.70924132e-03, -1.86979582e-02, -9.30464484e-01],
[-2.53319913e-02, -4.15040247e-04, 2.28821694e-02],
[-1.47740621e-02, -1.04342313e-02, 1.96687223e-01],
[ 5.98540638e-03, 9.87903021e-04, -3.18856167e-02],
[ 1.72448928e-02, 4.78285854e-03, 3.20696107e-02],
[ 7.38588017e-03, -1.18712487e-02, 4.47611059e-02],
[ 2.96358700e-04, -7.35685295e-03, 7.94433142e-02],
[-1.60591906e-02, -2.50480637e-03, 2.21374320e-01],
[ 2.43973745e-02, -2.02009747e-02, 1.00888489e-01],
[-1.22466181e-02, 1.12729158e-02, -8.13903349e-02],
[-5.56754146e-03, -4.11944937e-03, 2.94500025e-02],
[-1.19627331e-03, 1.16548686e-03, -1.30019313e-02]

LinearSVC:

[per_consumption, per_income, treasury_ma]

[ 0.02966774, -0.14471867, -0.1282059 ],
[-0.00527054, 0.18164961, 0.4827523 ],
[ 0.02486671, 0.19207659, 0.25187282],
[ 0.01620719, -0.01925257, -0.07194419],
[ 0.02076962, -0.05778987, -1.23964571],
[-0.10527764, -0.00393354, 0.03419578],
[-0.07094835, -0.04614407, 0.22302254],
[ 0.02035817, 0.00216132, -0.03231524],
[ 0.0618008 , 0.01550899, 0.03794849],
[ 0.02249799, -0.04897322, 0.05007795],
[-0.00676591, -0.03216929, 0.08877703],
[-0.07750835, -0.01590667, 0.25169932],
[ 0.08732982, -0.08230996, 0.11911246],
[-0.04832314, 0.04340531, -0.07246604],
[-0.02776916, -0.01860495, 0.03311367],
[-0.00873979, 0.00258575, -0.01250391]

genre:

[‘Action’, ‘Adventure’, ‘Comedy’, ‘Crime’, ‘Drama’, ‘Family’,
‘Fantasy’, ‘History’, ‘Horror’, ‘Music’, ‘Mystery’, ‘Romance’,
‘Science Fiction’, ‘Thriller’, ‘War’, ‘Western’]

Extracted Formula:

The results from Logistic Regression and Linear SVC are consistently similar.

However, while analyzing the parameters of the classification model, the weight matters!

As the numerical value range of each variable is the same (-100%,100%), weight matters.