-- Cross Industry Standard Process for Data Mining (CRISP-DM)
Step 1: Business Understanding: defining the context of the analysis
Step 2: Data Understanding: aligning data elements to the context
Step 3: Data Preparation: gathering and priming the data for analysis
Step 4: Modeling: conducting the analysis
Step 5: Evaluation: determining whether the analysis supports the goal of the business understanding
Step 6: Deployment: implementing the analysis
-- three primary elementts defining the development business process:
1: Base Development: annual giving and membership programs
2: Prospecting: identifying and prioritizing prospects for major giving
3: Major Giving: cultivating and stewarding major gifts
-- annual and mid-level giving is generally predictable, but major donor and corporate level giving is generally more volatile.
-- Correlation
-- Regression Analysis
-- Analytical metrics:
- Capacity Rating - measure of patron's ultimate giving ability over a five year period.
- Capacity Yield - five previous year's giving vs. the Capacity Rating, or how much of their capacity is being realized as gifts. If Capacity Rating is $1M and five year giving $100,000, Capacity Yield is 10%. 11% is a good Capacity Yield.
- Target Efficiency - how the target ask compares to the Capacity Rating.
- Target Accuracy - how the actual ask compares to the target ask.
- Close Yield - gift amount compared to ask amount. If ask is $1M and gift is $900,000, Close Yield is 90%.
-- example of Arts Single-Point Modeling system:
- Attendee: 1 point
- Subscriber: 1 point
- Arts background: 1 point
- Other arts giving: 1 point
- Donor: 1 point
- Gift in last five years: 1 point
- Five or more gifts: 1 point
- Managed: 1 point
- Researched: 1 point
- Visited: 1 point
- Attended event: 1 point
- Trustee: 1 point
- example of Arts Single-Point Modeling system:
- Attendee: 5 points
- Subscriber: 20 points
- Donor: 5 points
- 10 plus gifts: 10 points
- 5 to 9 gifts: 5 points
- Managed: 20 points
- Constituency or committee membership: 5 points per group/committee
- Total contacts x 3: points
- Current campaign prospect: 30 points
- 0-2 years since last gift: 10 points
- 2-5 years since last gift: 5 points
- 5-10 years since last gift: 2 points
- Arts background: 5 points
- Other arts giving: 5 points
-- Solicitation life cycle
1: Identification: feeding the pipeline with suspects
2: Qualification: qualifying suspects with data and through interaction
3: Cultivation and Solicitation: developing strong portfolios, developing cultivation strategies to align case and personal values, moving prosepcts with financial capacity closer to the organization, and soliciting prosepcts at the appropriate amount
4: Stewardship: stragegy to move prospects back into cultivation
-- annual giving is high-volume/low-dollar, and major giving is low-volume/high-dollar. CSO skews more low-volume/high-dollar
-- first year renewal - if you don't renew a patron within the first year, likelihood of a renewal drops considerable. Goal should first be to just secure a renewal and then once the giving patter is established, then move goal toward increased giving. For this reason, many organizations don't ask for upgrades in the first or second year.
-- RFM ANALYSIS - transactional recency, frequency, and amount
-- LYBUNT, SYBUNT, New
- LYBUNT - gave last year but not this
- SYBUNT (lapsed) - gave between 2 and 3 years ago, but not last year or this
- Long lapsed - intermediate category, gave between 2 and 5 years ago but not last year or this
- New donor - a new donor or someone who last gave more than 5 years ago
-- DATA MINING INTRO
- separate individuals from organizations since giving tends to be so different
-------------------
-- DATA ANALYSIS --
-------------------
- activate DataPak, use Data > DataTools > Histogram to create a histogram that counts the frequency of certain values. You set up a column of bins that are the boundaries of each group. In Excel histogram, the Input Range will be your array of values, and the Bin Range will be the column of bins. The spreadsheet will be set up like this:
contribution bins
20 0
11 1000
13.56 3500
13.42 10000
11 25000
20 50000
13.42 100000
...
The historgram will count each value as art of a bin and then give you the total of the bin counts:
0 Frequency
1000 20904
3500 1268
10000 717
25000 320
50000 132
100000 58
More 36
-- cross tabulation is comparing the values of two fields. For example, how many event attendees also have an email address?
--------------------------
-- CORRELATION IN EXCEL --
--------------------------
- highlight two or more columns with numeric data and it'll give you a cross tabulation of the correlation. If you're wanting to find a correlation between text data, you'll need to convert the text to numeric values (such as using an IF statement to assign a 1 to one category of values, a 2 to the next category, a 3 to the next, etc.
- correlation will check if there is a correlation between all of the columns. You'll get a result set which is a cross-tabulation showing the correlation between each variable, with a score given of -1 to 1. 1 means a perfect positive correlation (as one goes up, the other goes up in lockstep), and -1 means a perfect negative correlation (as one goes up, the other goes down in lockstep). 0 means there is no correlation. Any value in-between 0 and 1 shows a correlation of varying strength.
Most often you migh pay the most attention to correlation values above 0.1 with an acceptable significance. These are the strongest correlations. Correclation values above 0.05 are still noteworth, and mild correlations of 0.025 may be useful at times.
contribution Gift Pledge Received_amt
contribution 1
Gift 0.267457358 1
Pledge 0.961139925 -0.008940714 1
Received_amt 0.967004052 0.314316732 0.913472687 1
- correlation won't give you a P-Value, so you get that by using (in ANOVA table, Regrssion Line under Significance F. If P-Value is less than 0.05, then the correlation is statistically significant (it's usable). If P-Value is is less than 0.05, then the measurement is significant in more than 95% of the population. If the value is less that 0.01, then the measurement is significant in more that 99% of the population. You can also get the correlation co-efficient from the Regression process if you are comparing two columns - this number will be located in the first line, Regression Statistics > Multiple R.
- a popular way to get a quick score is to do a Correlation Ranking. To do this:
1) run a correlation to find the five to ten variables that have the strongest relationships
2) recode your inverse relationsihps, so 1 becomes a 0 and 0 becomes a 1. If there's an inverse correlation, you don't want those people, so flip the numbers so that you look at the people who don't have the attribute who is invesrse. Example: subscribing to a certain series has a negative correlation with donating, flip the rankings so you focus on people who don't subscribe to that certain series (and due to the negative correlation, are therefore more likely to give)
3) add the numbers to get your scores.
4) you can then run a cross tabulation between the donor variable and the correlation score. As the score increases, the precentage of donors increases by group.
5) you can then focus on people who have high scores but are not currently high level donors
Example: say you have five variables, and you score each 0 or 1. That means that each donor will end up with a combined score of 0 to 5 (5 being the most promising). If you cross tabulate these scores against who is actually a donor, you'll see how the scores actually behave. You can then try to focus on the people who have a high score but who are not giving.
DONOR
0 (not a donor) 1 (is a donor) TOTAL
CORRELATION SCORE 0 Count 389 120 509
% of Cor Score 76.4% 23.6% 100%
1 Count 10679 3673 14442
% of Cor Score 74.6% 25.4% 100%
2 Count 13641 12121 25762
% of Cor Score 52.0% 47.0% 100%
3 Count 3270 4212 7482
% of Cor Score 43.7% 56.3% 100%
4 Count 286 1319 1605
% of Cor Score 17.8% 82.2% 100%
5 Count 4 196 200
% of Cor Score 2.0% 98.0% 100%
-------------------
-- RFM ANALYSIS --
-
-------------------
- example of scoring
**RECENCY, max score = 20
Gave in last year 20
Gave 1-2 years ago 15
Gave 2-3 years ago 10
Gave 3-4 years ago 5
Gave 4-5 years ago 2
Gave 5+ years ago 1
No giving 0
**FREQUENCY, max score = 30
FREQUENCY % SCORE
Gift count / Last gift year - first gift year
100% 10 (to be multiplied by years-giving multiplier below)
90-99% 8
80-89% 6
70-79% 4
60-69% 2
< 60% 1
YEARS-GIVING MULTIPLIER
20+ years giving (Frequency % score) * 3.0 = overall frequency score
15-19 years giving (Frequency % score) * 2.5 = overall frequency score
10-14 years giving (Frequency % score) * 2.0 = overall frequency score
5-9 years giving (Frequency % score) * 1.5 = overall frequency score
< 5 years giving (Frequency % score) * 1.0 = overall frequency score
No giving 0 = overall frequency score
EXAMPLE
Donor has 20 gifts, first gift was 1995, last gift was 2018
Frequency % score = 20 / (2018-1995) = 0.86, or 86%, which is a Frequency % Score of 6
- donor has giving in last year, so that's a Year's-giving multiplier of 3.0.
- therefore FREQUENCY = 6 * 3.0 = 18
**MONETARY VALUE, max score = 50 (set gift amounts that make sense for your organization)
$5,000+ outright gift 25
$2,000-$4,999 outright gift 15
$1,000-$1,999 outright gift 10
$500-$999 outright gift 5
<$500 1
$20,000+ cumulative giving Additional 25
$10,000-$19,999 cumulative Additional 15
$5,000-$9,999 Additional 5
RFM Score, max score = 100
Sum of Recency, Frequency, and Monetary values
-----------------------
-- ATTACHMENT SCORE --
-
-----------------------
Donor: 5 points
Gift count
10 or more gifts: 10 points
5 to 9 gifts: 5 points
Managed: 20 points
Any board or group membership: 5 points per board or group
Total contacts x 3 = points
Years since last gift
Lowest through 2: 10 points
...through 5: 5 points
...through 10: 2 points
-------------------------
-- REGRESSION ANALYSIS --
-------------------------
- dependent variable is apparently the varable you're trying to model or predict. Examples of dependent variables:
Major giving prospect: Major Donors coded 1, all others 0
Planned giving prospect: existing PG donors 1, all other 0
Overall giving likelihood for never donor appeals: all donors 1, all others 0
Affinity to institution: constituents with survey responses indicating affinity 1, all others 0
Predicting donors likely to renew or upgrade gift: donors with consecutive increasing gifts 1, all others 0
-- types of models you can use that depend on the variables in your model.
NOMINAL - also called categorical or logistic, nominal variable values represent categories with no intrinsic ranking; for example, the department of the company at which an employee works. Examples of nominal variables include yes/no variables, channles, volunteers, gender, marital status, region, ZIP code, or religious affiliation.
ORDINAL - the values of ordinal variables represent categories with some intrinsic ranking; for example, levels of service satisfaction, from highly dissatisfied to highly satisfied. Examples of ordinal variables include attitude scores, preference ranking scores, "A", "B", "C" propensity rankings, and gift levels.
SCALE - a variable can be treated as a scale when its values are represented linerly, so that distance comparisons between values are appropriate. Examples of scale variables include total giving dollars, attachment scores, RFM scores, age in years, and capacity in dollars.
When analysts have binary nominal variables (1/0) as their dependent, they might choose to use decision trees or binary logistic regression. If they have many nominal categories in their dependent variable, they might use multinomial logistic regression or discriminant analysis. If analysts are trying to predict a ranking or a gift level, they might use ordinal regression. When they predict giving or large numeric scores, they would use linear regression. If they do not know the groups for a depent and want the software to discover it, they would use cluster analysis.
BINARY LOGISTIC REGRESSION
- tests a sample using a dependent variable and tests against the independent variables. In this case the dependent variable is binary, so it'll be a 1 or a 0. Then it calculates a probablility that the dependent variable is true based on each independent variable, and tests against members of the sample. Each tests succeeds (correctly identifies the dependent variable as either 1 or 0), or fails (incorrectly predicts the variable to be 0 when it is actually 1, or incorrectly predicts the variable is 1 when it is actually 0). You then remove the variables that fail in order to increase the overall accuracy of the model. Once the model is refined to be accurate, it can be applied to the rest of the population and you can make predictions.
To do this, you run the model for the population and convert the predicted probability percentage number into a 0-1000 score by multiplying it by 1000.
You can also create an oridinal score by breaking the groups into quintiles or deciles using the score you've just calculated.
You then cross tabulate your rankings against the dependent variable to see if it actually matches the population. For exampl,e you compare the scores against the donor indicator (1 for donor and 0 for non-donor) and see if there is an upward progression (the higher the score, the higher the percentage of donors in that group).
1) create your sample. For major giving models, try to make the sample half major donors and then half of a random population.
--------------------------
-- REGRESSION IN EXCEL --
-
--------------------------
- go to Data > Regression. It will ask you to fill in the Y Axis and the X Axis. The Y Axis is the dependent variable, and there can be only one column which must be numeric. The X Axis are the independent variables, and you can have more than one column here (multiple regression).
SUMMARY OUTPUT OF SINGLE REGRESSION
Regression Statistics
!!Multiple R!! 0.967004052
R Square 0.935096837
Adjusted R Square 0.935094067
Standard Error 4244.018571
Observations 23435
ANOVA
df SS MS F !!Significance F!!
Regression 1 6.08097E+12 6.08097E+12 337612.5792 0
Residual 23433 4.22068E+11 18011693.63
Total 23434 6.50304E+12
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -4.322165597 27.86638385 -0.15510321 0.876741287 -58.94209556 50.29776437 -58.94209556 50.29776437
Received_amt 1.129077031 0.001943186 581.0443866 0 1.12526826 1.132885801 1.12526826 1.132885801
SUMMARY OUTPUT OF MULTIPLE REGRESSION
SUMMARY OUTPUT
Regression Statistics
Multiple R 1
R Square 1
Adjusted R Square 1
Standard Error 4.66075E-09
Observations 23435
ANOVA
df SS MS F Significance F
Regression 3 6.50304E+12 2.16768E+12 9.97894E+28 0
Residual 23431 5.08981E-13 2.17226E-17
Total 23434 6.50304E+12
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -2.64853E-10 3.08759E-11 -8.577996747 1.02428E-17 -3.25372E-10 -2.04334E-10 -3.25372E-10 -2.04334E-10
Gift 1 1.08572E-14 9.21049E+13 0 1 1 1 1
Pledge 1 7.25712E-15 1.37796E+14 0 1 1 1 1
Received_amt 1.30996E-16 8.60064E-15 0.015231005 0.987848016 -1.67268E-14 1.69888E-14 -1.67268E-14 1.69888E-14
-------------------------------------------
-- EXAMPLE OF MULTIPLE LINEAR REGRESSION --
-------------------------------------------
- Y value is dependent variable
- X value is independent variable, can be more than one column.
- run regression, and look at P-value. Any value greater than 0.15 isn't predictive, and you should exclue. Any value less that 0.15 is predictive and you should keep.
-re-run the regression excluding any variables that have P-value less than 0.15
- predicitive formula is Y = Constant + B1 *(X1) + B2*(X2) + BnXn
EXPLANATION - Constant is found in the Intercept Coefficient of the regression output in Excel. B1 is the value of the independent variable you're trying to predict times that variable's coefficient from the regression output. If you have 3 independent variables that passed the P-Value test (less that 0.15), you'll multiply the target value of variable 1 times the coefficient for that variable, variable 2 times the coefficient for that variable, and variable 3 times the coefficient for that variable, and it will predict the value of the dependent variable.
Y = 35475.3 + (800*5.320968) + (1000*5.417138)