Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

comparing structured data in java

I've successfully implemented a java program that uses two common data structures: a Tree and a Stack along with an interface that allows a user to enter in a tree node ID and get information about it in relation to its parent. You can look at the latest version of this program here at my GitHub src for this program

Background

This ad hoc program I wrote is used to study the evolution of gene flow across hundreds of organisms by comparing data in a file that consists of: FeatureIDs = String primitives (further down these are listed in the first column as "ATM-0000011", "ATM-0000012", and so on), and consists of the scores that are associated with their presence or absence at a particular node in the tree and these are double primitives.

Here is what the data file looks like:

"FeatureID","112","115","120","119","124",...//this line has all tree node IDs
"ATM-0000011",2.213e-03,1.249e-03,7.8e-04,9.32e-04,1.472e-03,... //scores on these lines
"ATM-0000012",2.213e-03,1.249e-03,7.8e-04,9.32e-04,1.472e-03,...//correspond to node ID
"ATM-0000013",0.94,1.249e-03,7.8e-04,9.32e-04,1.472e-03,...//order in the first line
... //~30000 lines later
"ATM-0036186",0.94,0.96,0.97,0.95,0.95,...

The Problem

Previously, it was good enough to just make a 2D array of the doubles from the data file (the array excluded the first line in the file and the FeatureIDs, because they're Strings), and use the 2D array to then make double stacks. The stacks would be made for parent and child nodes as determined by user input and the Tree.

The data in the parent and child stacks would then be popped off at the same time (thus ensuring that the same FeatureIDs were being compared without actually having to include that data in the DS) and have their values compared based on whether they met a defined condition (ie. if both values were >= 0.75). Iff they did, a counter would be incremented. Once the comparisons were finished (stacks were empty) the program would return the count(s).

Now what I want to do instead of just counting, is make a list(s) of which FeatureIDs met the comparison criteria. So instead of returning the counter that says there were 4100 FeatureIDs between node A and node B that met the criteria, I want a list of all 4100 FeatureID Strings that met the criteria being compared between node A and node B. I'm going to save that list as a file later but that's not of concern here. This means that I'll probably have to abandon the double 2D array/double stack scheme which had previously worked so well.

The Question

Knowing what the problem is, is there a clever fix to this problem where I could make a change to the input data file, or somewhere in my code (tlacMain.java), without adding much more data to the process? I just need ideas.

like image 859
Alexander Sobin Avatar asked Nov 01 '22 06:11

Alexander Sobin


1 Answers

I'm not quite sure if I understand your question correctly, but instead of incrementing a counter you could just add the currently compared FeatureID to an ArrayList and later write that to a file.

If you need a List for every comparison you could have something like HashMap<Comparison, ArrayList<String>>.

edit: I read your comment and tried to come up with a solution without changing too much:

        String[] firstLine = sc.nextLine().split(regex);
        //line is the line of input being read in thru the inputFile
        int line = 0;
        //array of doubles will hold the data to be put in the stacks
        double [][] theData = new double [28420][firstLine.length];
        while(sc.hasNext())
        {
            String lineIn = sc.nextLine();
            String[] lineInAsString = lineIn.split(regex);
            for(int i = 1; i < lineInAsString.length; i++)
            {
                theData[line][i] = Double.parseDouble(lineInAsString[i]);
            }
            line++;
        }

        sc.close();

        return theData;

In this part of your getFile() function, you read the csv into a double matrix. For each column i in the matrix we need also the corresponding featureID. To return both the doubles matrix and a list with featureIDs, you need a container class.

class DataContainer {
    public double[][] matrix;
    public int[] featureIds;

    public DataContainer(double[][] matrix, int[] featureIds) {
        this.matrix = matrix;
        this.featureIds = featureIds;
    }
}

Now we can change the code above to return both.

    String[] firstLine = sc.nextLine().split(regex);
    // array of ids
    int[] featureIds = new int[firstLine.length];

    for(int i = 1; i < lineInAsString.length; i++)
    {
        featureIds[i] = Integer.parseInt(firstLine[i]);
    }

    // ... same stuff as before

    return new DataContainer(newMatrix, featureIds);

In your main function you can now extract both structures. So instead of

double newMatrix[][] = getFile(args);

you can write

DataContainer data = getFile(args);
double[][] newMatrix = data.matrix;
int[] featureIds = data.featureIds;

You can now use the featureIds array to match it up with your matrix columns in your calculations. Instead of incrementing an int inside addedInternal, you can create an ArrayList<Integer> and add(id) for every match. Then return the ArrayList, so you can use it for reporting outside of that function.

ArrayList<Integer> addedFeatureIds = addedInternal(parentStackOne, childStackOne, featureIdStack);
like image 89
felixbr Avatar answered Nov 13 '22 07:11

felixbr