R strsplit using Regex

Tags:

I want to use R to split some chat messages, here is an example:

example <- "[29.01.18, 23:33] Alice: Ist das hier ein Chatverlauf?\n[29.01.18, 23:45] Bob: Ja ist es!\n[29.01.18, 23:45] Bob: Der ist dazu da die funktionsweise des Parsers zu demonstrieren\n[29.01.18, 23:46] Alice: ‎PTT-20180129-WA0025.opus (Datei angehängt)\n[29.01.18, 23:46] Bob: Ah, er kann also auch erkennen ob Voicemails gesendet wurden!\n[29.01.18, 23:46] Bob: Das ist praktisch!\n[29.01.18, 23:47] Bob: Oder?\n[29.01.18, 23:47] Alice: ja |Emoji_Grinning_Face_With_Smiling_Eyes| \n[29.01.18, 23:47] Alice: und Emojis gehen auch!\n[29.01.18, 23:47] Bob: Was ist mit normalen Smilies?\n[29.01.18, 23:49] Alice: ‎Keine Ahnung, lass uns das doch mal ausprobieren\n[29.01.18, 23:50] Bob: Alles klar :) :D\n[29.01.18, 23:51] Alice: Scheint zu funktionieren!:P\n[29.01.18, 23:51] Bob: Meinst du, dass URLS auch erkannt werden?\n[29.01.18, 23:52] Bob: ‎Schick doch mal eine zum ausprobieren!\n[29.01.18, 23:53] Alice: https://github.com/JuKo007\n[29.01.18, 23:58] Alice: ‎Scheint zu funktionieren!\n[29.01.18, 23:59] Alice: Sehr schön!\n[30.01.18, 00:00] Alice: Damit sollten sich WhatsApp Verläufe besser quantifizieren lassen!\n[30.01.18, 00:02] Bob: ‎Alles klar, los gehts  |Emoji_Relieved_Face| \n"

Basically, I want to split the string right in front of the date-time indicator in the brackets, here is what I tried so far:

  # Cutting the textblock into individual messages
  chat <- strsplit(example,"(?=\\[\\d\\d.\\d\\d.\\d\\d, \\d\\d:\\d\\d\\])",perl=TRUE)
  chat <- unlist(chat)

The weird thing is, that in the output, it seems that the split occurs after the first square bracket, not in front:

 [1] "["                                                                                           
 [2] "29.01.18, 23:33] Alice: Ist das hier ein Chatverlauf?\n"                                     
 [3] "["                                                                                           
 [4] "29.01.18, 23:45] Bob: Ja ist es!\n"                                                          
 [5] "["                                                                                           
 [6] "29.01.18, 23:45] Bob: Der ist dazu da die funktionsweise des Parsers zu demonstrieren\n"     
 [7] "["                                                                                           
 [8] "29.01.18, 23:46] Alice: ‎PTT-20180129-WA0025.opus (Datei angehängt)\n"                        
 [9] "["                                                                                           
[10] "29.01.18, 23:46] Bob: Ah, er kann also auch erkennen ob Voicemails gesendet wurden!\n"       
[11] "["                                                                                           
[12] "29.01.18, 23:46] Bob: Das ist praktisch!\n"                                                  
[13] "["                                                                                           
[14] "29.01.18, 23:47] Bob: Oder?\n"                                                               
[15] "["                                                                                           
[16] "29.01.18, 23:47] Alice: ja |Emoji_Grinning_Face_With_Smiling_Eyes| \n"                       
[17] "["                                                                                           
[18] "29.01.18, 23:47] Alice: und Emojis gehen auch!\n"                                            
[19] "["                                                                                           
[20] "29.01.18, 23:47] Bob: Was ist mit normalen Smilies?\n"                                       
[21] "["                                                                                           
[22] "29.01.18, 23:49] Alice: ‎Keine Ahnung, lass uns das doch mal ausprobieren\n"                  
[23] "["                                                                                           
[24] "29.01.18, 23:50] Bob: Alles klar :) :D\n"                                                    
[25] "["                                                                                           
[26] "29.01.18, 23:51] Alice: Scheint zu funktionieren!:P\n"                                       
[27] "["                                                                                           
[28] "29.01.18, 23:51] Bob: Meinst du, dass URLS auch erkannt werden?\n"                           
[29] "["                                                                                           
[30] "29.01.18, 23:52] Bob: ‎Schick doch mal eine zum ausprobieren!\n"                              
[31] "["                                                                                           
[32] "29.01.18, 23:53] Alice: https://github.com/JuKo007\n"                                        
[33] "["                                                                                           
[34] "29.01.18, 23:58] Alice: ‎Scheint zu funktionieren!\n"                                         
[35] "["                                                                                           
[36] "29.01.18, 23:59] Alice: Sehr schön!\n"                                                       
[37] "["                                                                                           
[38] "30.01.18, 00:00] Alice: Damit sollten sich WhatsApp Verläufe besser quantifizieren lassen!\n"
[39] "["                                                                                           
[40] "30.01.18, 00:02] Bob: ‎Alles klar, los gehts  |Emoji_Relieved_Face| \n"

When I try to test the Regex pattern online or use it in python, it works just as intended, so to me it seems that this is a feature of the strsplit function? Any recommendation on how to change my R code to make this work are very welcome! I know that it would be easy to just paste this output back together to get my desired output but I would really like to understand whats going on with strsplit and do it properly instead of patching it back together. What I want is:

 [1] "[29.01.18, 23:33] Alice: Ist das hier ein Chatverlauf?\n"                                                                                                                           
 [2] "[29.01.18, 23:45] Bob: Ja ist es!\n"                                                                                                                                                  
 [3] "[29.01.18, 23:45] Bob: Der ist dazu da die funktionsweise des Parsers zu demonstrieren\n"                                                                                         
 [4] "[29.01.18, 23:46] Alice: ‎PTT-20180129-WA0025.opus (Datei angehängt)\n"                                                                                                      
[5] "[29.01.18, 23:46] Bob: Ah, er kann also auch erkennen ob Voicemails gesendet wurden!\n"                                                                                          
[6] "[29.01.18, 23:46] Bob: Das ist praktisch!\n"                                                                                                                                    
[7] "[29.01.18, 23:47] Bob: Oder?\n"                                                                                                                                                   
[8] "[29.01.18, 23:47] Alice: ja |Emoji_Grinning_Face_With_Smiling_Eyes| \n"                                                                                                            
[9] "[29.01.18, 23:47] Alice: und Emojis gehen auch!\n"                                                                                                                          
[10] "[29.01.18, 23:47] Bob: Was ist mit normalen Smilies?\n"                                                                                                                         
[11] "[29.01.18, 23:49] Alice: ‎Keine Ahnung, lass uns das doch mal ausprobieren\n"                                                                                                    
[12] "[29.01.18, 23:50] Bob: Alles klar :) :D\n"                                                                                                                                       
[13] "[29.01.18, 23:51] Alice: Scheint zu funktionieren!:P\n"                                                                                                                        
[14] "[29.01.18, 23:51] Bob: Meinst du, dass URLS auch erkannt werden?"                                                                                                             
[15] "[29.01.18, 23:52] Bob: ‎Schick doch mal eine zum ausprobieren!\n"                                                                                                                       
[16] "[29.01.18, 23:53] Alice: https://github.com/JuKo007\n"                                                                                                                                  
[17] "[29.01.18, 23:58] Alice: ‎Scheint zu funktionieren!\n"                                                                                                                                  
[18] "[29.01.18, 23:59] Alice: Sehr schön!\n"                                                                                                                                                
[19] "[30.01.18, 00:00] Alice: Damit sollten sich WhatsApp Verläufe besser quantifizieren lassen!\n"                                                                                           
[20] "[30.01.18, 00:02] Bob: ‎Alles klar, los gehts  |Emoji_Relieved_Face| \n"

319

asked Jul 12 '19 15:07

Ju Ko

2 Answers

You could add a negative lookahead (?!^) to assert not the start of the string.

Your updated line might look like:

chat <- strsplit(example,"(?!^)(?=\\[\\d\\d.\\d\\d.\\d\\d, \\d\\d:\\d\\d\\])",perl=TRUE)

R demo

Result

 [1] "[29.01.18, 23:33] Alice: Ist das hier ein Chatverlauf?\n"                                     
 [2] "[29.01.18, 23:45] Bob: Ja ist es!\n"                                                          
 [3] "[29.01.18, 23:45] Bob: Der ist dazu da die funktionsweise des Parsers zu demonstrieren\n"     
 [4] "[29.01.18, 23:46] Alice: ‎PTT-20180129-WA0025.opus (Datei angehängt)\n"                        
 [5] "[29.01.18, 23:46] Bob: Ah, er kann also auch erkennen ob Voicemails gesendet wurden!\n"       
 [6] "[29.01.18, 23:46] Bob: Das ist praktisch!\n"                                                  
 [7] "[29.01.18, 23:47] Bob: Oder?\n"                                                               
 [8] "[29.01.18, 23:47] Alice: ja |Emoji_Grinning_Face_With_Smiling_Eyes| \n"                       
 [9] "[29.01.18, 23:47] Alice: und Emojis gehen auch!\n"                                            
[10] "[29.01.18, 23:47] Bob: Was ist mit normalen Smilies?\n"                                       
[11] "[29.01.18, 23:49] Alice: ‎Keine Ahnung, lass uns das doch mal ausprobieren\n"                  
[12] "[29.01.18, 23:50] Bob: Alles klar :) :D\n"                                                    
[13] "[29.01.18, 23:51] Alice: Scheint zu funktionieren!:P\n"                                       
[14] "[29.01.18, 23:51] Bob: Meinst du, dass URLS auch erkannt werden?\n"                           
[15] "[29.01.18, 23:52] Bob: ‎Schick doch mal eine zum ausprobieren!\n"                              
[16] "[29.01.18, 23:53] Alice: https://github.com/JuKo007\n"                                        
[17] "[29.01.18, 23:58] Alice: ‎Scheint zu funktionieren!\n"                                         
[18] "[29.01.18, 23:59] Alice: Sehr schön!\n"                                                       
[19] "[30.01.18, 00:00] Alice: Damit sollten sich WhatsApp Verläufe besser quantifizieren lassen!\n"
[20] "[30.01.18, 00:02] Bob: ‎Alles klar, los gehts  |Emoji_Relieved_Face| \n"

answered Sep 18 '22 22:09

The fourth bird

You can use stringi and extract the info you want by slightly modifying the end of your pattern (i.e., matching everything until the next [). You could include more of your pattern to ensure there aren't any false-matches but this should get your started. Good luck!

library(stringi)

stri_extract_all(example, regex = "\\[\\d\\d.\\d\\d.\\d\\d, \\d\\d:\\d\\d\\][^\\[]*")
[[1]]
 [1] "[29.01.18, 23:33] Alice: Ist das hier ein Chatverlauf?\n"                                     
 [2] "[29.01.18, 23:45] Bob: Ja ist es!\n"                                                          
 [3] "[29.01.18, 23:45] Bob: Der ist dazu da die funktionsweise des Parsers zu demonstrieren\n"     
 [4] "[29.01.18, 23:46] Alice: \016PTT-20180129-WA0025.opus (Datei angehängt)\n"                    
 [5] "[29.01.18, 23:46] Bob: Ah, er kann also auch erkennen ob Voicemails gesendet wurden!\n"       
 [6] "[29.01.18, 23:46] Bob: Das ist praktisch!\n"                                                  
 [7] "[29.01.18, 23:47] Bob: Oder?\n"                                                               
 [8] "[29.01.18, 23:47] Alice: ja |Emoji_Grinning_Face_With_Smiling_Eyes| \n"                       
 [9] "[29.01.18, 23:47] Alice: und Emojis gehen auch!\n"                                            
[10] "[29.01.18, 23:47] Bob: Was ist mit normalen Smilies?\n"                                       
[11] "[29.01.18, 23:49] Alice: \016Keine Ahnung, lass uns das doch mal ausprobieren\n"              
[12] "[29.01.18, 23:50] Bob: Alles klar :) :D\n"                                                    
[13] "[29.01.18, 23:51] Alice: Scheint zu funktionieren!:P\n"                                       
[14] "[29.01.18, 23:51] Bob: Meinst du, dass URLS auch erkannt werden?\n"                           
[15] "[29.01.18, 23:52] Bob: \016Schick doch mal eine zum ausprobieren!\n"                          
[16] "[29.01.18, 23:53] Alice: https://github.com/JuKo007\n"                                        
[17] "[29.01.18, 23:58] Alice: \016Scheint zu funktionieren!\n"                                     
[18] "[29.01.18, 23:59] Alice: Sehr schön!\n"                                                       
[19] "[30.01.18, 00:00] Alice: Damit sollten sich WhatsApp Verläufe besser quantifizieren lassen!\n"
[20] "[30.01.18, 00:02] Bob: \016Alles klar, los gehts  |Emoji_Relieved_Face| \n"

answered Sep 19 '22 22:09

Andrew

Related questions
                            
                                Is it possible to access native cell-phone or device APIs using Blazor to access camera, contacts etc?
                            
                                Bring to front the panel grid
                            
                                Can't install R 3.6 in Raspberry pi 3 B in raspbian stretch
                            
                                SwiftUI: How to Properly Code AVPlayer After Loading A Video From the Device with an ImagePickerController?
                            
                                I cant get the expected URL with TestCafe
                            
                                error TS2420: Class 'NgRedux<RootState>' incorrectly implements interface 'ObservableStore<RootState>'
                            
                                bootstrap vue input on a modal autofocus
                            
                                Finding the first time a value shows up in a list efficiently
                            
                                The method stopLoading of react-native-webview causes the website to freeze
                            
                                what is the cluster manager used in Databricks ? How do I change the number of executors in Databricks clusters?
                            
                                How to use CDN in local javascript file
                            
                                How to prevent downtime in App Engine Flex when instances are automatically restarted

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With