Data Minning

Autor: Yunhan Zhao • September 26, 2015 • Case Study • 2,013 Words (9 Pages) • 895 Views

Page 1 of 9

The database that I selected is the same data that I used for project 1 “Sales Prices of Houses in the City of Windsor”, it is from

Verbeek, Marno (2004) A guide to modern econometrics, John Wiley and Sons, http://www.econ.kuleuven.ac.be/GME, chapter 3.

Journal of Applied Econometrics data archive : http://jae.wiley.com/jae/.

There are 546 observations in this database, which are getting by a cross-section from 1987 in city of Windsor in Canada. In this database included twelve variables which are:

price: sale price of a house

lotsize: the lot size of a property in square feet

bedrooms: number of bedrooms

bathrms: number of full bathrooms

stories: number of stories excluding basement

driveway: does the house has a driveway? (1 for yes, 0 for no)

recroom: does the house has a recreational room? (1 for yes, 0 for no)

fullbase; does the house has a full finished basement? (1 for yes, 0 for no)

gashw: does the house uses gas for hot water heating? (1 for yes, 0 for no)

airco: does the house has central air conditioning? (1 for yes, 0 for no)

garagepl: number of garage places

prefarea: is the house located in the preferred neighborhood of the city?

(1 for yes, 0 for no)

In order to do the analyses, I classify the dependent variable into two categories basis on third quartile. 82,000. Those the price equal or above 82,000 defined as high price (with number of 1), those below 82,000 defined as low price (with number of 0). Since usually only few housing are sold in high price.

KNN

Since the KNN can only use quantitative variables, therefor I created an individual data file named Housing windsor for KNN.csv for analyzed the data only use Price, lotsize, bedrooms, bathrms, stories, and garagepl as variables. Since my data has 546 observations, and defined 75% as train, and 25% as test. I used 7, 9,11,13,15 as deference of the K for to get the difference misclassification rate, for each K I run five time and get the average of misclassification rate. Followed with the table that I got from the running:

K=	First run	Second run	Third run	Fourth run	Fifth run	Average
7	15.33%	16.79%	21.90%	22.63%	14.60%	18.25%
9	15.33%	10.22%	21.90%	17.52%	12.41%	15.47%
11	16.79%	9.49%	10.21%	11.68%	13.14%	12.26%
13	18.98%	13.87%	14.60%	12.41%	13.87%	14.75%
15	15.33%	16.79%	8.03%	16.06%	16.06%	14.45%

The best k for KNN is use 11 nearest neighbors, since when K=11, it has smallest average misclassification rate, 12.26%.

Tree

For the tree I selected all the variable from the date, and change the price into two categories and name the file as Housing windsor for Tree.csv.

...

Download as: txt (11.7 Kb) pdf (199.6 Kb) docx (245.2 Kb)

Continue for 8 more pages »

Read Full Essay Save