I have the following text:
Cluster 7: {4, 15, 21, 28, 33, 35, 43, 47, 53, 57, 59, 66,
69, 70, 74, 86, 87, 88, 90, 114, 136, 148, 201,
202, 212, 220, 227, 250, 252, 253, 259, 262, 267,
270, 282, 296, 318, 319, 323, 326, 341}
Cluster 8: {9, 10, 11, 20, 39, 55, 79, 101, 108, 143, 149,
221, 279, 284, 285, 286, 287, 327, 333, 334, 335,
336}
Cluster 9: {3, 64, 83, 93, 150, 153, 264, 269, 320, 321, 322}
Cluster 10: {94, 123, 147}
And i want to extract by cluster the number in each set.
I have tryed using regex without much luck
I have tried:
regex="(Cluster \d+): \{((\d+)[,\}][\n ]+)+|(?:(\d+),[\n ])"
But the groups dont match.
I would like an output as:
["Cluster 7", '4', '15', '21', '28', '33', '35', '43', '47', '53', '57', '59', '66', '69', '70', '74', '86', '87', '88', '90', '114', '136', '148', '201', '202', '212', '220', '227', '250', '252', '253', '259', '262', '267', '270', '282', '296', '318', '319', '323', '326', '341', "Cluster 8", '9', '10', '11', '20', '39', '55', '79', '101', '108', '143', '149', '221', '279', '284', '285', '286', '287', '327', '333', '334', '335', '336', "Cluster 9", '3', '64', '83', '93', '150', '153', '264', '269', '320', '321', '322', "Cluster 10", "94", "123", "147"]
Or maybe this is not the best approach to do this.
Thanks
I would not use regex for this. Your text is within yaml
spec and can be loaded directly with an order-preserving yaml loader such as oyaml.
import oyaml as yaml # pip install oyaml
data = yaml.load(text)
To unpack that dict to the desired "flat" structure, it's just a list comprehension:
[x for (k, v) in data.items() for x in (k, *v)]
Note: I'm the author of oyaml.
You can create a more generic regex:
import re
s = '\nCluster 7: {4, 15, 21, 28, 33, 35, 43, 47, 53, 57, 59, 66,\n 69, 70, 74, 86, 87, 88, 90, 114, 136, 148, 201,\n 202, 212, 220, 227, 250, 252, 253, 259, 262, 267,\n 270, 282, 296, 318, 319, 323, 326, 341}\nCluster 8: {9, 10, 11, 20, 39, 55, 79, 101, 108, 143, 149,\n 221, 279, 284, 285, 286, 287, 327, 333, 334, 335,\n 336}\nCluster 9: {3, 64, 83, 93, 150, 153, 264, 269, 320, 321, 322}\nCluster 10: {94, 123, 147}\n'
data = re.findall('Cluster \d+|\d+', s)
Output:
['Cluster 7', '4', '15', '21', '28', '33', '35', '43', '47', '53', '57', '59', '66', '69', '70', '74', '86', '87', '88', '90', '114', '136', '148', '201', '202', '212', '220', '227', '250', '252', '253', '259', '262', '267', '270', '282', '296', '318', '319', '323', '326', '341', 'Cluster 8', '9', '10', '11', '20', '39', '55', '79', '101', '108', '143', '149', '221', '279', '284', '285', '286', '287', '327', '333', '334', '335', '336', 'Cluster 9', '3', '64', '83', '93', '150', '153', '264', '269', '320', '321', '322', 'Cluster 10', '94', '123', '147']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With