Data Clustering using python

For this portfolio is doing stock price analysis, and main topic is related to gdp and stock market indicator, it is hard to gather record data and cluster it, because it is already being clustered. Therefore, here picked 6 different individual stock from 3 different industies.

The first industry is streaming industy and the stocks here picked are "DOYU" and "HUYA". The second industry is new energy car, which consist of "NIO" and "Li", the third industry is online video broadcasting, and here picked "IQ" and "BILI" from it.

Here first grabbed recent 2 weeks of stock price and did a prelimiary data cleaning. Next created a new column called perc which means the percentage change over the day. Nextly, combined all 6 stock into one dataframe.As always, attached my picture below.

Next is to calculate the distance matrix using three method: Eulidean_distances, manhattan_distances and cosine_distances

After that, used k-mean method with k = 3 to cluster my data. After clustering, made a pair to pair comparision to see whenever the cluster result is working as expected. The result is that it work as expected and successfully clusted my stock according to its industry

After that, now going to use hierachical clustering to cluster my data. and here is the result.

After the record data, here going to analysis pure text data. Text data are from comments/news/review for APPLE's stock.

Above code for record data can be find at MY PYTHON CODE

code for text data can be found at here: Text cluster code

DAta Clustering using R studio

We are using the same record data and text data here, but going to do a more in deep analysis here.

'

Firstly, loaded data into r and deleted first column and saved it for later use. Data looks like this

Next, measured three different distance matrix: Manhattan and Euclidean, they look like this:

After that, used three different method to help me determine what is the best k for my k-mean clustering.

The first is Silhouette, from the picture we can see that the best k is three

The next is Elbow method, can see that 3 is also the best cluster

The next is Gap statistic

After that, here have a concensus that 3 is the best optimal cluster.

After the K-mean, we try a hierichical clustering as well, and the result is illustrates below

Also did a heat map showing the relationship between each row

After analysing the record data, we going to analysis text data.

First loaded in the data, and did a document-term matrix.

And lastly, doing a word cloud showing the sentiment of people's attitude toward apple. From the wordcloud, can see that maybe tesla is improving his battery capacity, and is overall a positive attitude, therefore, the stock should have an increasing trend. And indeed, it has an increasing trend over these days.

you can access my r code here : R cluster code