I have a bit trouble with some data stored in a text file on hand for regression analysis using Python.
The data are stored in the format that look like this:
2104,3,399900 1600,3,329900 2400,3,369000 ....
I need to do some analysis like finding mean by this: (2104+1600+...)/number of data
I think the appropriate steps is to store the data into array. But I have no idea how to store it. I think of two ways to do so. The first one is to set 3 array that stores like
a=[2104 1600 2400 ...] b=[3 3 3 ...] c=[399900 329900 36000 ...]
The second way is to store in
a=[2104 3 399900], b=[1600 3 329900] and so on.
Which one is better?
Also, how to write code that allows the data can be stored into array? I think of like this:
with open("file.txt", "r") as ins:
array = []
elt.strip(',."\'?!*:') for line in ins:
array.append(line)
Is that correct?
Use the fs. readFileSync() method to read a text file into an array in JavaScript, e.g. const contents = readFileSync(filename, 'utf-8'). split('\n') . The method will return the contents of the file, which we can split on each newline character to get an array of strings.
You can read a text file using the open() and readlines() methods. To read a text file into a list, use the split() method. This method splits strings into a list at a certain character. In the example above, we split a string into a list based on the position of a comma and a space (“, ”).
To convert String to array in Python, use String. split() method. The String . split() method splits the String from the delimiter and returns the splitter elements as individual list items.
You could use :
with open('data.txt') as data:
substrings = data.read().split()
values = [map(int, substring.split(',')) for substring in substrings]
average = sum([a for a, b, c in values]) / float(len(values))
print average
With this data.txt
, :
2104,3,399900 1600,3,329900 2400,3,369000
2105,3,399900 1601,3,329900 2401,3,369000
It outputs :
2035.16666667
Using pandas and numpy you can get the data into an array as follows:
In [37]: data = "2104,3,399900 1600,3,329900 2400,3,369000"
In [38]: d = pd.read_csv(StringIO.StringIO(data), sep=',| ', header=None, index_col=None, engine="python")
In [39]: d.values.reshape(3, d.shape[1]/3)
Out[39]:
array([[ 2104, 3, 399900],
[ 1600, 3, 329900],
[ 2400, 3, 369000]])
Instead of having multiple arrays a
, b
, c
... you could store your data as an array of arrays (a 2 dimensional array). For example:
[[2104,3,399900],
[1600,3,329900],
[2400,3,369000]...]
This way you don't have to deal with dynamically naming your arrays. How you store your data, i.e. 3 * array of length n or n * array of length 3 is up to you. I would prefer the second way. To read the data into your array you should then use the split()
function, which will split your input into an array. So in your case:
with open("file.txt", "r") as ins:
tmp = ins.read().split(" ")
array = [i.split(",") for i in tmp]
>>> array
[['2104', '3', '399900'], ['1600', '3', '329900'], ['2400', '3', '369000']]
Edit: To find the mean e.g. for the first element in each list you could do the following:
arraymean = sum([int(i[0]) for i in array]) / len(array)
Where the 0
in i[0]
specifies the first element in each list. Note that this code uses list comprehension, which you can learn more about in this post if you want to.
Also this code stores the values in the array as strings, hence the cast to int in the part to get the mean. If you want to store the data as int
directly just edit the part in the file reading section:
array = [[int(j) for j in i.split(",")] for i in tmp]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With