An introduction for the program is in Mining Risk Patterns in Medical Data. You may find it can be applied to other data too.
This program has been successfully applied to a number of real world medical and health data sets, finding adverse drug reactions in large medical linkage data (see poster A new method of pharmaco-vigilance” for automatically identifying Adverse Drug Reactions (ADRs) in large populations, paper Representing association classification rules mined from health data and paper Association rule discovery with unbalanced class distributions), finding risk patterns in emergency department data of Base Hospital of TWB, and finding risk patterns from a health servery data.
Try it. You will find it is very effective.
This software works on Linux 9.0.
Save the file to an empty directory and then run the following commands:
(Please remove the extra .tar at the end of the downloaded file if there is one.)
tar -xzvf lirule.tar.gz
chmod +x lirule
You will have three files in the current directory. They are hypothyroid.data, hypothyroid.names, lirule, and lirule is an executable file.
Run the following commands:
./lirule -f hypothyroid
You will get hypothyroid.report file. (Actually, you will get other three files too, but ignore them for the time being.)
Open hypothyroid.report file, you will find a number of risk and preventive pattern. I use one as an example to demonstrate them.
Pattern 28: Length =
3
/* Pattern number and length*/
OR = 97.7236 (42.2630) RR =
20.3447 /*
Odds Ratio (standard deviation) and Risk Ratio */
T3 =
0\.75+
/* Patten, attribute = value */
TT4 =
24\.5-55\.5
/* Conjunction of attribute value pairs */
T4U =
0\.935+
/* defines a cohort */
Cohort size = 35, Percentage =
1.11%
/* Population and percentage */
Contingency table
|
hypothyroid |
negative |
pattern |
28 |
7 |
non-pattern |
123 |
3005 |
Meanings: people with the pattern is 20 times more risky in getting hypothyroid than people without.
This is artificial data set and the results is a bit exaggerated. You will be very excited if you find patterns with RR>3 in real world data sets.
Please find a good explanation of odds ratio and relative risk at http://www.childrens-mercy.org/stats/journal/oddsratio.asp
Given a contingency
table:
|
hypothyroid |
negative |
pattern |
a |
c |
non-pattern |
b |
d |
OR = (ad) / (bc)
Stderr = OR * sqrt (1/a + 1/b + 1/c + 1/d)
RR = (a(b+d)) / (b(a+c))
When b or c = 0, then OR = 10000.0000 in the report.
You need to prepare your data in C4.5 data format, as hypothyroid.names and hypothyroid.data.
The program only works on two class problem, and put the focused class, hypothyroid in the example, first in .names file.
If you edit your names and data files by using a window's editor, you need to use command dos2unix filenamse to convert files to UNIX format.
This program does not work on continuous attributes.
Run
./lirule -f hypothyroid -l 6 -s 0.1
-l the maximum length of patterns (default 4), and -s the minimum support in a class (default 0.05).
You may download a program coded by Khim Hong How to generate a HTML report.
perl rpt2html.pl hypothyroid
You will find two new html files: one for risk patterns and the other for preventative patterns.