Business Intelligence Methods: cluster analysis
How to use Weka Software to get the output: (I have done this part and uploaded the output, but you can do it for better view if you like)
1- Open the program
2- Click on Explorer
3- Open file
4- Files of type: choose (csv), and then choose the excel file
5- Click on Cluster => choose => SimpleKMeans => then click on the bold word “SimpleKMeans” to choose the number of the clusters you want, then OK => start: - now you can see the output of Cluster suing SimpleKMeans.
6- For the another technique use REP Tree:
Click on Classify => choose => trees => REP => start: - now you can see the output of REP design tree
Here is the like to download Weka:-
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
Use version weka-3-6-10jre-x64.exe
========================================================================
Important notes;
- It must be written in a very simple language by using only very simple and Basic English words. So I don’t want it to be like professional writing, I want it to be like a first year student college writing level by using simple and basic words.
- I have uploaded the 3 excel files for the (data), they are all the same I used only one of these, I used the one that has number 1 in its name.
- I have done the lab part by using Weka program, you are only allow to use this program for the output. And i got the output of the Cluster using SimpleKMeans in Weka program, and I got also the output of the other technique using REP Tree in Weka. (I have uploaded them, each one in different word file).
- You may find this helpful:
Male = 1
The normalization is f(x) =( x - min)/ (max- min)
So you can use this work sheet to plug in your values for the clusters.
In general, though, the higher the number, the higher the income and age, or other variable. Now you can conver
-
My completed old assignment:-
Business Intelligence Methods: Classification: Decision Trees
Frame and Business Objective
The essence of this task is to use a J48 decision tree in Weka with adjustments for a maximum number of tree leaves. As such, adjusting the number assumes that attributes in Naive Bayes are equally important and statistically independent given the class value. Despite the fact that there might be inaccuracies in assuming the statistical independence of attributes, it works flawlessly in practice. It is imperative to note that classification does not require accurate probability estimates so long as each class has a corresponding greatest chance, and it is correct. However, in practice, adding redundant attributes might cause problems. Thus, it is mandatory to deploy distinct attributes in the selection process. It was imperative to have an objective at this point. As such, it meant that an increase in the number of subscribers opening bank accounts would result in a subsequent increase in revenue.
Information Gaps
There were two information gaps that the researcher discovered. First, there were instances where the information was unknown. For example, some people did not have contacts; thus, the metric outcome was significantly affected. In fact, in some cases, the result displayed was "other" or "failure," indicating a lack of authenticity of data presented. Second, income presented an information gap. Account balances play an insignificant role in explaining why individuals open accounts. However, their income is imperative to the process because it depicts the possibility of opening a bank account. According to the scenario presented, it is worth noting that if there were individuals with high income, they would most probably open bank accounts than low-income earners. Measurements
The assignment entailed using a range of measures to describe data. First, 'age' was represented by a numeric attribute. Second, 'job' was categorical depending on an individual's preference. In this case, individuals were described as 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services' 'student', 'technician', 'unemployed', and 'unknown'. Third, 'marital' depicted their marital status. As such, in a similar representation to 'job', it was categorical. Thus, individuals were 'single', 'married', or 'divorced'. It is imperative to note that the term 'divorced' applied to both widowed and separated individuals because they were no longer living together, a factor that would likely contribute to opening a bank account.
Fourth, 'education' was a categorical attribute with 'unknown', 'secondary', 'primary', and 'tertiary' as options. Fifth, 'default' tested if the client had credit in default. As such, the entry was binary: 'yes' or 'no'. Sixth, 'balance' was a numeric attribute that entailed the average yearly balance in Euros. Seventh, 'housing' was a binary attribute for a housing loan. Eighth, 'loan' was binary and inquired if the prospect had a personal loan.
The following were related to the last contact with the current campaign. First, 'contact' was categorical and it implied contact communication type, that is, 'unknown', 'telephone', or 'cellular'. Second, 'day' was a numeric option that implied the last contact day of the month. Third, 'month' was a categorical option implying the last contact month of year, that is 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', or 'dec'. Lastly, 'duration' was a numeric input for the last contact duration in seconds. Other attributes used in the analysis were as follows.
First, 'campaign' was the number of contacts performed during this campaign and for this client in numeric form. Second, 'pdays' was the number of days that passed by after the customer was last contacted from a previous campaign in numeric format. Third, 'previous' implied the number of contacts performed before this campaign and for this client and it was numeric. Fourth, 'poutcome' entailed the outcome of the previous marketing campaign. It was categorical with options 'unknown', 'other', 'failure', or 'success'. Lastly, the output variable 'y' was binary and it queried whether the client had subscribed a term deposit or not.
Analytical Method
The analytic method used in this essence was J48. It is imperative to note that J48 is a machine-learning model that decides the dependent variable, that is, target value, from a set of data. It has different nodes denoting distinct attributes and nodes for classification. In this regard, the dependent variable was 'poutcome'. On a different note, independent variables were 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', and 'pdays'.
System Implementation
The program was run as a single instance analysis using J48 and SimpleCart.
Presentation of Results
It is critical to note that individuals will open bank accounts based on the independent variables listed above. In this regard, the four most important entail jobs, age, education, and loan. First, people in management are more likely to open bank accounts than technicians, retired, unemployed, administrators, or people in blue-collar jobs. Second, individuals in their 30s and 40s were more likely to open accounts because of current responsibilities, nature of jobs, and the need to save for the future. Third, education was imperative to making the choice. As such, those who had attained a degree from a tertiary institution would open bank accounts than their counterparts who achieved secondary school education but still outperformed primary school graduates. Finally, a loan was central to decision-making because most individuals with loans were unlikely to open an account for the fear of incurring additional debt.

