Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why and where are \n newline characters getting introduced to c()?

Tags:

text

r

Hoping someone can help me understand why errant \n characters are showing up in a vector of strings that I'm creating in R.

Trying to import and clean up a very wide data file that's in fixed width format (http://www.state.nj.us/education/schools/achievement/2012/njask6/, 'Text file for data runs'). Followed the UCLA tutorial on using read.fwf and this excellent SO question to give the columns names after import.

Because the file is really wide, the column headers are LONG - all together, just under 29,800 characters. I'm passing them in as a simple vector of strings:

column_names <- c(...)

I'll spare you the ugly dump here but I dropped the whole thing on pastebin.

Was cleaning up and transforming some of the variables for analysis when I noticed that some of my subsets were returning 0 rows. After puzzling over it (did I misspell something?) it realized that somehow a bunch of '\n' newline characters had been introduced into my column headers.

If I loop over the column_names vector that I created

for (i in 1:length(column_names)) {
  print(column_names[i])
}

I see the first newline character in the middle of the 81st line -

SPECIAL\nEDUCATION SCIENCE Number Enrolled Science

Avenues that I tried to resolve this:

1) Is it something about my environment? I'm using the regular script editor in R, and my lines do wrap - but the breaks on my screen don't match the placement of the \n characters, which to me suggests that it's not the R script editor.

2) Is there a GUI setting? Did some searching, but couldn't find anything.

3) Is there a pattern? Seems like the newline characters get inserted about every 4000 characters. Did some reading on R/S primitives to try to figure out if this had something to do with basic R data structures, but was pretty quickly in over my head.

I tried breaking up the long string into shorter chunks, and then subsequently combining them, and that seemed to solve the problem.

column_names.1 <- c(...)
column_names.2 <- c(...)
column_names_combined <- c(column_names.1, column_names.2)

so I have an immediate workaround, but would love to know what's actually going on here.

Some of the posts that had to do with problems with character vectors suggested that I run memory profile:

  memory.profile()
        NULL      symbol    pairlist     closure environment     promise 
           1        9572      220717        4734        1379        5764 
    language     special     builtin        char     logical     integer 
       63932         165        1550       18935       10302       30428 
      double     complex   character         ...         any        list 
        2039           1       60058           0           0       20059 
  expression    bytecode externalptr     weakref         raw          S4 
           1       16553         725         150         151        1162 

I'm running R 2.15.1 (64-bit) R on Windows 7 (Enterprise, SP 1, 8 gigs RAM). Thanks!

like image 401
Andrew Avatar asked Oct 24 '12 22:10

Andrew


People also ask

What is the newline character in C?

In programming languages, such as C, Java, and Perl, the newline character is represented as a '\n' which is an escape sequence.

How do you use newline characters?

It is a character in a string which represents a line break, which means that after this character, a new line will start. There are two basic new line characters: LF (character : \n, Unicode : U+000A, ASCII : 10, hex : 0x0a): This is simply the '\n' character which we all know from our early programming days.

Which of the following command is used to for newline character?

The C programming language provides the escape sequences '\n' (newline) and '\r' (carriage return). However, these are not required to be equivalent to the ASCII LF and CR control characters.

What is newline character windows?

In Windows and DOS, the line break code is two characters: a carriage return followed by a line feed (CR/LF). In the Unix/Linux/Mac world, the code is just the line feed character (LF).


1 Answers

I doubt this is a bug. Instead, it looks like you're running into a known limitation of the console. As it says in Section 1.8 - R commands, case sensitivity, etc. of An Introduction to R:

Command lines entered at the console are limited[3] to about 4095 bytes (not characters).

[3] some of the consoles will not allow you to enter more, and amongst those which do some will silently discard the excess and some will use it as the start of the next line.

Either put the command in a file and source it, or break the code into multiple lines by inserting your own newlines at appropriate points (between commas). For example:

column_names <-
  c("County Code/DFG/Aggregation Code", "District Code", "School Code",
    "County Name", "District Name", "School Name", "DFG", "Special Needs",
    "TOTAL POPULATION TOTAL POPULATION Number Enrolled LAL", ...)
like image 117
Joshua Ulrich Avatar answered Nov 16 '22 00:11

Joshua Ulrich