Data Cleaning

Cleaning data and feature generation

  • After gathering data, my next step is to clean my data and make any necessary adjustment. The first step I did is to make a box plot to visualize my data set. Firstly, I made a boxplot for my nasdaq data. I have attached my picture below.

  • After Doing a box plot, I notice that there are many abnormal volume that may seem like outlier, as we can see from the left most boxplot diagram above. However, after considering domain knowlegde, high volume of stock transction is usually have important affect on the stock market. Therefore, I would rather save those values for further analysis.

  • My next step is to remove any NA values, after I did a check on NA values, I found that there are some rows that are entirely missing, and filled my NA. To deal with those NA values, I can't simply remove them because I need to find the relationship between NADQ price and individual stocks. Also, I can't just replace it with mean or median because the stock price has an important arrtibute of time sensitive. I can't use the mean of median of the stock market over 20 years to replace the specific time price for example on 2020-06-01. Therefore, I replaced the NA values with its previous value, thereby, makeing it a smooth transition. After my initial NA values detection and replacement, i have a complete data set and I have attached it below. The values inside my red box is initially na values and I replaced it with the values in blue box, There are many values like that and it is only an example.

  • ]
  • After removing NA values, my next step is to generate features. Combine with my domain knowledge, if only with these data, i can do some analysis but it is not easy. Therefore, I will do some data discretization and feature generation to further help me do analysis. Firstly, I created a new colume named "Trend", which will help me to see if this stock goes up or goes down over a period. i use the close price to compare with open price, if the close price is higher then the open price, it is True, and false otherwise. Nextly, I calculated the percentage change over these two variables and named it "perc_change". I have attached them below.

  • ]
  • As required, I used R studio to analysis some of my data and did some cleaning. I Checked NAs in my gdp data and I didn't find any missing values. And further, I generate two variable called trend to see if gdp increased, and perc change to see how much increased. my r code is http://haoming.georgetown.domains/rcodecleaning.html As always, I will include picture below for the cleaned data I did using R