Mining Risk and Preventive Patterns

Introduction

An introduction for the program is in Mining Risk Patterns in Medical Data. You may find it can be applied to other data too.

This program has been successfully applied to a number of real world medical and health data sets, finding adverse drug reactions in large medical linkage data (see poster A new method of pharmaco-vigilance” for automatically identifying Adverse Drug Reactions (ADRs) in large populations, paper Representing association classification rules mined from health data and paper Association rule discovery with unbalanced class distributions), finding risk patterns in emergency department data of Base Hospital of TWB, and finding risk patterns from a health servery data.

Try it. You will find it is very effective.

This software works on Linux 9.0.

Get program

Download here

Save the file to an empty directory and then run the following commands:

(Please remove the extra .tar at the end of the downloaded file if there is one.)

tar -xzvf lirule.tar.gz

chmod +x lirule

You will have three files in the current directory. They are hypothyroid.data, hypothyroid.names, lirule, and lirule is an executable file.

Run program

Run the following commands:

./lirule -f hypothyroid

You will get hypothyroid.report file. (Actually, you will get other three files too, but ignore them for the time being.)

Interpret results

Open hypothyroid.report file, you will find a number of risk and preventive pattern. I use one as an example to demonstrate them.

Pattern 28:    Length = 3                                                             /* Pattern number and length*/
                    OR = 97.7236 (42.2630)     RR = 20.3447            /* Odds Ratio (standard deviation) and Risk Ratio */

                    T3 = 0\.75+                                                            /* Patten, attribute = value */
                    TT4 = 24\.5-55\.5                                                   /* Conjunction of attribute value pairs */
                    T4U = 0\.935+                                                        /* defines a cohort */

                    Cohort size = 35, Percentage = 1.11%                    /* Population and percentage */
                    Contingency table

	hypothyroid	negative
pattern	28	7
non-pattern	123	3005

Meanings: people with the pattern is 20 times more risky in getting hypothyroid than people without.

This is artificial data set and the results is a bit exaggerated. You will be very excited if you find patterns with RR>3 in real world data sets.

Find definitions

Please find a good explanation of odds ratio and relative risk at http://www.childrens-mercy.org/stats/journal/oddsratio.asp

Given a contingency table:

	hypothyroid	negative
pattern	a	c
non-pattern	b	d

OR = (ad) / (bc)

Stderr = OR * sqrt (1/a + 1/b + 1/c + 1/d)

RR = (a(b+d)) / (b(a+c))

When b or c = 0, then OR = 10000.0000 in the report.

Work on your data

You need to prepare your data in C4.5 data format, as hypothyroid.names and hypothyroid.data.

The program only works on two class problem, and put the focused class, hypothyroid in the example, first in .names file.

If you edit your names and data files by using a window's editor, you need to use command dos2unix filenamse to convert files to UNIX format.

This program does not work on continuous attributes.

Change parameters

Run

./lirule -f hypothyroid -l 6 -s 0.1

-l the maximum length of patterns (default 4), and -s the minimum support in a class (default 0.05).

Reports in HTML

You may download a program coded by Khim Hong How to generate a HTML report.

perl rpt2html.pl hypothyroid

You will find two new html files: one for risk patterns and the other for preventative patterns.

BACK