Please be fair to the author. Pay your Shareware Fee HERE and receive a copy of CPH by download.
REGRESSION ANALYSIS BASICS
Multiple regression related the dependent variable Y to a number of independent variables, for example Y = A1 * X1 + A2 * X2 ... +B.
Non linear or polynomial regression provides relationships that involve powers, roots, or other non-linear functions, such as logarithms or exponentials.
Excel and Lotus 1-2-3 offer some simple linear and non-linear regression models, but more sophisticated software is required for multiple regression. A good freeware package is Statcato (www.statcato.org). It is a java based program: right-click and "Save Target As" >> Stats / Regression Package, unzip the files to a folder, and click "Statcato.jar".
The graph at right (courtesy Dick Woodhouse) shows four different lines.
The "Y-on-X" line is the one that will result from use of spreadsheet software. Y is the dependent axis (predicted variable) and X is the independent axis (the variable doing the predicting). The line minimized the errors in the vertical direction (Y axis) using a least-squares solution.
The "X-on-Y line reverses the roles of the two axes, minimizing the error in the horizontal direction (as the graph is drawn here)..
The RMA line, the reduced major axis, assumes that neither axis depends on the other and is very nearly halfway between the first two lines. It minimizes the error at right angles to the line. The ER, or error ratio line, minimizes the error on both X and Y directions. There is not usually much difference between the RMA and ER lines. All four lines intersect at the centroid of the data.
of Best Fit Line
on Y Axis
of Best Fit Lines
Reduced Major Axis regression line is the regression line that
usually represents the most useful relationship between the X
and Y axes. It assumes that both axes are equally error prone.
An approximation to this line is halfway between the two independent
regression lines. Solve equation 6 for Y:
slope and intercept of equations 5 and 7:
The coefficient of determination is a measure of "best fit" and is capable of being calculated as data is entered and processed (e.g.: as in a hand calculator). Other measures of fit require two passes through the data - the first to find the average X and average Y values, then a second pass to find the differences between each individual X and the average X, and the differences between the individual Y and the average Y values.
alternate form of the above equation is:
Both equations give the same answer.
These data are used in the following statistical measures.
The b's are termed the "regression coefficients". Instead of fitting a line to data, we are now fitting a plane (for 2 independent variables), a space (for 3 independent variables).
The estimation can still be done according the principles of linear least squares. The algebraic formulae for the solution (i.e. finding all the b's) are UGLY. However, the matrix solution is elegant:
The matrix model is:
Statistical analysis of data, such as regression analysis or frequency distributions, can be described both graphically and mathematically. The math for very basic statistical analysis of petrophysical data is covered here.
The majority of crossplots are X - Y coordinate graphs, often called scatter plots. They are useful for showing the relationship between two measurements, for example, resistivity versus gamma ray readings. By making the symbol that is plotted vary in colour with a third parameter, for example the PE curve, we have a 3-D crossplot. In this case it shows the variation of lithology with changes in resistivity and gamma ray value.
Although not widely used, the shape of the characters used to plot each data point can be varied to represent a fourth variable, for example the frequency of occurrence of data at this location on the plot. These are 4-D plots, invented by the author in 1976.
Groupings of data may represent important petrophysical parameters, such as shale properties, water or hydrocarbon zone location, or mineralogy. The use of a particular crossplot is dictated by common sense rules. Some crossplots, especially those related to mineralogy, benefit from a background template showing the location of the pure mineral values observed in the laboratory.
Histograms of the distribution of log data are used for choosing petrophysical properties, as in the GR example at left. They are also used to help in normalizing log data between wells by suggesting the linear shift needed to match the distribution from a model or key well.
Regression analysis of log data, or core versus log data, is very commonly used to find relationships that predict or calibrate petrophysical results, as at the right. The equation of the best fit line can be used in user-defined equation sets in most computer or spreadsheet software.
The other common crossplot
with core data are regressions of core porosity against sonic,
density, neutron, or answer porosity, used to establish calibration