Data Validation Strategy

Autor: arun2305 • February 23, 2016 • Essay • 559 Words (3 Pages) • 1,139 Views

Page 1 of 3

Arun Sharma

Data Validation Strategies

Company: Verisk Innovation Analytics

Project: POLITICAL EVENTS LIKELIHOOD PREDICTION

Data for this project will be collected from various articles published in magazines. So, the first step is to select the authors and the magazines which have history of publishing articles on political situation. Once the data collection is finalized, it is essential to check for the sanity of the collected data. This step will ensure that the data collected is as per our requirements. So, any missing data, length of article, author validity etc. will be corrected in this step. Once data sanity step is done, we can move towards to data exploration to check for any inconsistencies in data. Removing common English words, words irrelevant to our analysis, names, from our data will ensure that data is compact and contains only that data which is good for our model building. Further EDA will be carried out to ensure that outlier treatment is done, ranges are satisfied and data is normalized. Post EDA, we will pay attention on getting appropriate sample from the data. Samples will be thoroughly checked for any biases. Since we may have large number of features in our data, it is essential to select important features for our data. Feature selection techniques like principal component analysis(PCA) and singular value decomposition(SVD) will be used to get most important features from the data set.

Company: IronPlanet

Project: PRICING RECOMMENDER SYSTEM

Data for this project is supplied by the IronPlanet. First step in the data validation is to check for the sanity of the data. Since the data is maintained in-house by IronPlanet so it is essential to understand ‘how Ironplanet is getting this data and how they are storing it’. Data will be thoroughly checked for any inconsistencies like data type mismatch, missing data, outliers in data, demographics mismatch etc. Once we have sanitized data with us, we can do ‘exploratory data analysis’ to summarize their main characteristics. This will help us further in removing any data inconsistencies and to make the data normalized. Once we have normalized data, we will employ sampling techniques to get appropriate sample from the data. Since Ironplanet will have high number of features in a very large data set, it is necessary to use only most important features for our analysis. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) will be used to get most important features. If there is high correlation among the features, then binning and combining features will be done.

...

Download as: txt (3.6 Kb) pdf (59.1 Kb) docx (8.8 Kb)

Continue for 2 more pages »

Read Full Essay Save