I would like to create an incidence matrix.
I have a file with 3 columns, like:
id x y
A 22 2
B 4 21
C 21 360
D 26 2
E 22 58
F 2 347
And I want a matrix like (without col and row names):
2 4 21 22 26 58 347 360
A 1 0 0 1 0 0 0 0
B 0 1 1 0 0 0 0 0
C 0 0 1 0 0 0 0 1
D 1 0 0 0 1 0 0 0
E 0 0 0 1 0 1 0 0
F 1 0 0 0 0 0 1 0
I have started the code like:
haps = readdlm("File.txt",header=true)
hap1_2 = map(Int64,haps[1][:,2:end])
ID = (haps[1][:,1])
dic1 = Dict()
for (i in 1:21)
dic1[ID[i]] = hap1_2[i,:]
end
X=[zeros(21,22)]; #the original file has 21 rows and 22 columns
X1 = hcat(ID,X)
The problem now is that I don't know how to fill the matrix with 1s in the specific columns as in the example above.
I'm also not sure if I'm on the right way.
Any suggestion that could help me??
Thanks!
NamedArrays
is a neat package which allows naming both rows and columns and seems to fit the bill for this problem. Suppose the data is in data.csv
, here is one method to go about it (install NamedArrays
with Pkg.add("NamedArrays")
):
data,header = readcsv("data.csv",header=true);
# get the column names by looking at unique values in columns
cols = unique(vec([(header[j+1],data[i,j+1]) for i in 1:size(data,1),j=1:2]))
# row names from ID column
rows = data[:,1]
using NamedArrays
narr = NamedArray(zeros(Int,length(rows),length(cols)),(rows,cols),("id","attr"));
# now stamp in the 1s in the right places
for r=1:size(data,1),c=2:size(data,2) narr[data[r,1],(header[c],data[r,c])] = 1 ; end
Now we have (note I transposed narr
for better printout):
julia> narr'
10x6 NamedArray{Int64,2}:
attr ╲ id │ A B C D E F
──────────┼─────────────────
("x",22) │ 1 0 0 0 1 0
("x",4) │ 0 1 0 0 0 0
("x",21) │ 0 0 1 0 0 0
("x",26) │ 0 0 0 1 0 0
("x",2) │ 0 0 0 0 0 1
("y",2) │ 1 0 0 1 0 0
("y",21) │ 0 1 0 0 0 0
("y",360) │ 0 0 1 0 0 0
("y",58) │ 0 0 0 0 1 0
("y",347) │ 0 0 0 0 0 1
But, if DataFrames
are necessary, similar tricks should apply.
---------- UPDATE ----------
In case the column of a value should be ignored i.e. x=2 and y=2 should both set a 1 on column for value 2, then the code becomes:
using NamedArrays
data,header = readcsv("data.csv",header=true);
rows = data[:,1]
cols = map(string,sort(unique(vec(data[:,2:end]))))
narr = NamedArray(zeros(Int,length(rows),length(cols)),(rows,cols),("id","attr"));
for r=1:size(data,1),c=2:size(data,2) narr[data[r,1],string(data[r,c])] = 1 ; end
giving:
julia> narr
6x8 NamedArray{Int64,2}:
id ╲ attr │ 2 4 21 22 26 58 347 360
──────────┼───────────────────────────────────────
A │ 1 0 0 1 0 0 0 0
B │ 0 1 1 0 0 0 0 0
C │ 0 0 1 0 0 0 0 1
D │ 1 0 0 0 1 0 0 0
E │ 0 0 0 1 0 1 0 0
F │ 1 0 0 0 0 0 1 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With