Association Rule Mining - Data preparation

Because our topic is about stock and stock analysis rarely have transaction data. Here taking another approach.

Sampled 20 people with whom have their own choice of stock. Here have these 20 transaction data and I used
R to load it.and after loading it it is shown as below.

##      items                   
## [1]  {Apple,NIO,Tesla}       
## [2]  {Doyu,Huya,NIO,Tesla}   
## [3]  {Apple,Doyu,Huya,Nvidia}
## [4]  {Doyu,Huya,Iqiyi}       
## [5]  {Huya,Iqiyi}            
## [6]  {Alibaba,Tesla,Xpev}    
## [7]  {Bilibili,Huya,Iqiyi}   
## [8]  {Bilibili,Huya,Iqiyi}   
## [9]  {Apple,Jd,Pdd,Tesla}    
## [10] {Apple,NIO,Tesla,Xpev}  
## [11] {JD,PDD}                
## [12] {Alibaba,NIO,Xpev}      
## [13] {NIO,Tsla,Xpev}         
## [14] {Amd,Apple,Nvdia}       
## [15] {Amd,Nvdia,Xpev}        
## [16] {Jd,NIO,Xpev}           
## [17] {Apple,Bilibili,Jd}     
## [18] {Bilibili,Doyu,Huya}    
## [19] {AMD,Apple,Tesla}       
## [20] {Nvidia,Tesla,Xpev}

Association Rule Mining - create rules and reorder

After loading data into r studio, here going to create rules for assoition rule mining and below is my code

stock_rule = arules::apriori(stocks,parameter = list(support=.1,confidence= .5,minlen = 2))

After running the code, here is the rules that find that

##      lhs                 rhs        support confidence coverage lift      count
## [1]  {Nvdia}          => {Amd}      0.10    1.0000000  0.10     10.000000 2    
## [2]  {Amd}            => {Nvdia}    0.10    1.0000000  0.10     10.000000 2    
## [3]  {Alibaba}        => {Xpev}     0.10    1.0000000  0.10      2.857143 2    
## [4]  {Jd}             => {Apple}    0.10    0.6666667  0.15      1.904762 2    
## [5]  {Iqiyi}          => {Bilibili} 0.10    0.5000000  0.20      2.500000 2    
## [6]  {Bilibili}       => {Iqiyi}    0.10    0.5000000  0.20      2.500000 2    
## [7]  {Iqiyi}          => {Huya}     0.20    1.0000000  0.20      2.857143 4    
## [8]  {Huya}           => {Iqiyi}    0.20    0.5714286  0.35      2.857143 4    
## [9]  {Bilibili}       => {Huya}     0.15    0.7500000  0.20      2.142857 3    
## [10] {Doyu}           => {Huya}     0.20    1.0000000  0.20      2.857143 4    
## [11] {Huya}           => {Doyu}     0.20    0.5714286  0.35      2.857143 4    
## [12] {NIO}            => {Xpev}     0.20    0.6666667  0.30      1.904762 4    
## [13] {Xpev}           => {NIO}      0.20    0.5714286  0.35      1.904762 4    
## [14] {NIO}            => {Tesla}    0.15    0.5000000  0.30      1.428571 3    
## [15] {Apple}          => {Tesla}    0.20    0.5714286  0.35      1.632653 4    
## [16] {Tesla}          => {Apple}    0.20    0.5714286  0.35      1.632653 4    
## [17] {Bilibili,Iqiyi} => {Huya}     0.10    1.0000000  0.10      2.857143 2    
## [18] {Huya,Iqiyi}     => {Bilibili} 0.10    0.5000000  0.20      2.500000 2    
## [19] {Bilibili,Huya}  => {Iqiyi}    0.10    0.6666667  0.15      3.333333 2    
## [20] {Apple,NIO}      => {Tesla}    0.10    1.0000000  0.10      2.857143 2    
## [21] {NIO,Tesla}      => {Apple}    0.10    0.6666667  0.15      1.904762 2    
## [22] {Apple,Tesla}    => {NIO}      0.10    0.5000000  0.20      1.666667 2

Top 15 rules for confidence

##      lhs                 rhs     support confidence coverage lift      count
## [1]  {Nvdia}          => {Amd}   0.10    1.0000000  0.10     10.000000 2    
## [2]  {Amd}            => {Nvdia} 0.10    1.0000000  0.10     10.000000 2    
## [3]  {Alibaba}        => {Xpev}  0.10    1.0000000  0.10      2.857143 2    
## [4]  {Iqiyi}          => {Huya}  0.20    1.0000000  0.20      2.857143 4    
## [5]  {Doyu}           => {Huya}  0.20    1.0000000  0.20      2.857143 4    
## [6]  {Bilibili,Iqiyi} => {Huya}  0.10    1.0000000  0.10      2.857143 2    
## [7]  {Apple,NIO}      => {Tesla} 0.10    1.0000000  0.10      2.857143 2    
## [8]  {Bilibili}       => {Huya}  0.15    0.7500000  0.20      2.142857 3    
## [9]  {Jd}             => {Apple} 0.10    0.6666667  0.15      1.904762 2    
## [10] {NIO}            => {Xpev}  0.20    0.6666667  0.30      1.904762 4    
## [11] {Bilibili,Huya}  => {Iqiyi} 0.10    0.6666667  0.15      3.333333 2    
## [12] {NIO,Tesla}      => {Apple} 0.10    0.6666667  0.15      1.904762 2    
## [13] {Huya}           => {Iqiyi} 0.20    0.5714286  0.35      2.857143 4    
## [14] {Huya}           => {Doyu}  0.20    0.5714286  0.35      2.857143 4    
## [15] {Xpev}           => {NIO}   0.20    0.5714286  0.35      1.904762 4

From the top 15 conficence rules, we can see that it is pretty obvious. For example, with confidence = 1,
it means that whenever A appears, B must appear. In here, it means that there is a strong correlation between
a and b in here. For example, Nvidia and AMD are both CPU/GPU manufactor. People who buy Nvdia
often buy Amd as well. For Doyu and Huya, it is also another example. Doyu and huya are both from streaming
industry so its very possible that they have a confidence = 1.

Below is code for top 15 confidence

sortedrulesk <- sort(stock_rule,by = "confidence",decreasing = TRUE)
inspect(sortedrulesk[1:15])

Top 15 rules for lift

##      lhs                 rhs        support confidence coverage lift      count
## [1]  {Nvdia}          => {Amd}      0.10    1.0000000  0.10     10.000000 2    
## [2]  {Amd}            => {Nvdia}    0.10    1.0000000  0.10     10.000000 2    
## [3]  {Bilibili,Huya}  => {Iqiyi}    0.10    0.6666667  0.15      3.333333 2    
## [4]  {Alibaba}        => {Xpev}     0.10    1.0000000  0.10      2.857143 2    
## [5]  {Iqiyi}          => {Huya}     0.20    1.0000000  0.20      2.857143 4    
## [6]  {Doyu}           => {Huya}     0.20    1.0000000  0.20      2.857143 4    
## [7]  {Bilibili,Iqiyi} => {Huya}     0.10    1.0000000  0.10      2.857143 2    
## [8]  {Apple,NIO}      => {Tesla}    0.10    1.0000000  0.10      2.857143 2    
## [9]  {Huya}           => {Iqiyi}    0.20    0.5714286  0.35      2.857143 4    
## [10] {Huya}           => {Doyu}     0.20    0.5714286  0.35      2.857143 4    
## [11] {Iqiyi}          => {Bilibili} 0.10    0.5000000  0.20      2.500000 2    
## [12] {Bilibili}       => {Iqiyi}    0.10    0.5000000  0.20      2.500000 2    
## [13] {Huya,Iqiyi}     => {Bilibili} 0.10    0.5000000  0.20      2.500000 2    
## [14] {Bilibili}       => {Huya}     0.15    0.7500000  0.20      2.142857 3    
## [15] {Jd}             => {Apple}    0.10    0.6666667  0.15      1.904762 2

When lift >1 if means that a and b are dependent, if =1 means that they are independent, if <1 it means
taht they are not related in some how. by the top 15 we can see that nvdia and amd, bilibili,huya and
iqiyi have a very high lift, which means that they are highly correlated.

Below is code for top 15 lift

sortedrulesl <- sort(stock_rule,by = "lift",decreasing = TRUE)
inspect(sortedrulesl[1:15])

Top 15 rules for support

##      lhs           rhs        support confidence coverage lift      count
## [1]  {Iqiyi}    => {Huya}     0.20    1.0000000  0.20      2.857143 4    
## [2]  {Huya}     => {Iqiyi}    0.20    0.5714286  0.35      2.857143 4    
## [3]  {Doyu}     => {Huya}     0.20    1.0000000  0.20      2.857143 4    
## [4]  {Huya}     => {Doyu}     0.20    0.5714286  0.35      2.857143 4    
## [5]  {NIO}      => {Xpev}     0.20    0.6666667  0.30      1.904762 4    
## [6]  {Xpev}     => {NIO}      0.20    0.5714286  0.35      1.904762 4    
## [7]  {Apple}    => {Tesla}    0.20    0.5714286  0.35      1.632653 4    
## [8]  {Tesla}    => {Apple}    0.20    0.5714286  0.35      1.632653 4    
## [9]  {Bilibili} => {Huya}     0.15    0.7500000  0.20      2.142857 3    
## [10] {NIO}      => {Tesla}    0.15    0.5000000  0.30      1.428571 3    
## [11] {Nvdia}    => {Amd}      0.10    1.0000000  0.10     10.000000 2    
## [12] {Amd}      => {Nvdia}    0.10    1.0000000  0.10     10.000000 2    
## [13] {Alibaba}  => {Xpev}     0.10    1.0000000  0.10      2.857143 2    
## [14] {Jd}       => {Apple}    0.10    0.6666667  0.15      1.904762 2    
## [15] {Iqiyi}    => {Bilibili} 0.10    0.5000000  0.20      2.500000 2

Support measure the occurance. in here, it means that the higher the support, the higher the popularity
the stock in the sample group. In here, iqiyi, huya and doyu,NIo etc are very popular stocks among these
20 people.

Below is code for top 15 support

sortedruless <- sort(stock_rule,by = "support",decreasing = TRUE)
inspect(sortedruless[1:15])

Visualization

Firstly, here created a histogram for each stock's frequency. We make a compreration with the support.

From this graph, we can see that the top 4 frequencies with 7 occurence is also 4 of the top support in
our rules. Therefore, we have a speculate conclusion for arm is that as the frequency is higher, the
support tends to be higher in association rule mining, this is also one of the visualization of ARM.

Next, mading a scatter plot between support and confidence and use color to tell the lift.

Because the number of our rules are not very high, the plot is relative scatter and sparse but this is the illustration of the rules's distribution.

Lastly, here present a NetworkD3 plot,incorporating it in this website, you can feel free to play around it.

From the networkD3 graph, we can see that there are three network. Nvidia and Amd are always bonding together.
Huya seem to be a center for "Doyu,iqiyi,bilibli". The last cluster is centered around Tesla and
apple and NIO has a stroog rules around it. It helps to illustrate that there are three industry that
stock holders tend to buy and these three clusters are these three industry.

Lastly, using igraph to plotted a association rule mining result.

From this graph, we have basically the same result as the NetworkD3 one, but it shows the direction and more
staticly show the relationships. ALso, In here, we have a new group which is between Xpev and Alibaba we
can see that people with Alibaba tend to hold Xpev maybe because Alibaba have shares in Xpev.

Conclusion

Talking about stock market is very normal in our daily live. When somebody refer you a stock,
you are more likely to buy it than you discovered yourself. My purpose of using ARM to analysis this is to help
establish a recommendation system. Just like when you are browsing amazon, when you buy something,
It will have something shows up to recommend you. I want the stock trading software can also do this and I
am using ARM to find which stock people tend to buy when they buy related stocks.

Also, beside recommending stock, my findings can also help fund manager to identify people's risk preference.
Therefore to better help the customer to establish their own stock/fund/fix income portfolio. Risk
manager can use this to avoid overrecommend or underrecommend, what he/she need to do is simply pick stock
from the related risk sensitive funds.

My complete R code is attached below

MY R CODE