Hoping someone can help me understand why errant \n characters are showing up in a vector of strings that I'm creating in R.
Trying to import and clean up a very wide data file that's in fixed width format (http://www.state.nj.us/education/schools/achievement/2012/njask6/, 'Text file for data runs'). Followed the UCLA tutorial on using read.fwf and this excellent SO question to give the columns names after import.
Because the file is really wide, the column headers are LONG - all together, just under 29,800 characters. I'm passing them in as a simple vector of strings:
column_names <- c(...)
I'll spare you the ugly dump here but I dropped the whole thing on pastebin.
Was cleaning up and transforming some of the variables for analysis when I noticed that some of my subsets were returning 0 rows. After puzzling over it (did I misspell something?) it realized that somehow a bunch of '\n' newline characters had been introduced into my column headers.
If I loop over the column_names vector that I created
for (i in 1:length(column_names)) {
print(column_names[i])
}
I see the first newline character in the middle of the 81st line -
SPECIAL\nEDUCATION SCIENCE Number Enrolled Science
Avenues that I tried to resolve this:
1) Is it something about my environment? I'm using the regular script editor in R, and my lines do wrap - but the breaks on my screen don't match the placement of the \n characters, which to me suggests that it's not the R script editor.
2) Is there a GUI setting? Did some searching, but couldn't find anything.
3) Is there a pattern? Seems like the newline characters get inserted about every 4000 characters. Did some reading on R/S primitives to try to figure out if this had something to do with basic R data structures, but was pretty quickly in over my head.
I tried breaking up the long string into shorter chunks, and then subsequently combining them, and that seemed to solve the problem.
column_names.1 <- c(...)
column_names.2 <- c(...)
column_names_combined <- c(column_names.1, column_names.2)
so I have an immediate workaround, but would love to know what's actually going on here.
Some of the posts that had to do with problems with character vectors suggested that I run memory profile:
memory.profile()
NULL symbol pairlist closure environment promise
1 9572 220717 4734 1379 5764
language special builtin char logical integer
63932 165 1550 18935 10302 30428
double complex character ... any list
2039 1 60058 0 0 20059
expression bytecode externalptr weakref raw S4
1 16553 725 150 151 1162
I'm running R 2.15.1 (64-bit) R on Windows 7 (Enterprise, SP 1, 8 gigs RAM). Thanks!
In programming languages, such as C, Java, and Perl, the newline character is represented as a '\n' which is an escape sequence.
It is a character in a string which represents a line break, which means that after this character, a new line will start. There are two basic new line characters: LF (character : \n, Unicode : U+000A, ASCII : 10, hex : 0x0a): This is simply the '\n' character which we all know from our early programming days.
The C programming language provides the escape sequences '\n' (newline) and '\r' (carriage return). However, these are not required to be equivalent to the ASCII LF and CR control characters.
In Windows and DOS, the line break code is two characters: a carriage return followed by a line feed (CR/LF). In the Unix/Linux/Mac world, the code is just the line feed character (LF).
I doubt this is a bug. Instead, it looks like you're running into a known limitation of the console. As it says in Section 1.8 - R commands, case sensitivity, etc. of An Introduction to R:
Command lines entered at the console are limited[3] to about 4095 bytes (not characters).
[3] some of the consoles will not allow you to enter more, and amongst those which do some will silently discard the excess and some will use it as the start of the next line.
Either put the command in a file and source
it, or break the code into multiple lines by inserting your own newlines at appropriate points (between commas). For example:
column_names <-
c("County Code/DFG/Aggregation Code", "District Code", "School Code",
"County Name", "District Name", "School Name", "DFG", "Special Needs",
"TOTAL POPULATION TOTAL POPULATION Number Enrolled LAL", ...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With