I'm trying to read a plain text file that contains names like this: "CASTAÑEDA"
The code is basically like this:
file, err := os.Open("C:/Files/file.txt")
defer file.Close()
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
Then, when "CASTAÑEDA" is read it prints "CASTA�EDA"
There's any way to handle that characters when reading with bufio?
Thanks.
Your file is, most propably, non UTF-8. Because of that (go expects all strings to be UTF-8) your console output looks mangled. I would advise usage of the packages golang.org/x/text/encoding/charmap
and golang.org/x/text/transform
in your case, to convert the file's data to UTF-8. As I might presume, looking at your file path, you are on Windows. So your character encoding might be Windows1252
(if you have edited it e.g. with notepad.exe).
Try something like this:
package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/charmap"
"golang.org/x/text/transform"
)
func main() {
file, err := os.Open("C:/temp/file.txt")
defer file.Close()
if err != nil {
log.Fatal(err)
}
dec := transform.NewReader(file, charmap.Windows1252.NewDecoder()) <- insert your enconding here
scanner := bufio.NewScanner(dec)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}
You can find more encodings in the package golang.org/x/text/encoding/charmap
, that you can insert into my example to your liking.
The issue you're encountering is that your input is likely not UTF-8 (which is what bufio and most of the Go language/stdlib expect). Instead, your input probably uses some extended-ASCII codepage, which is why the unaccented characters are passing through cleanly (UTF-8 is also a superset of 7-bit ASCII), but that the 'Ñ' is not passed through intact.
In this situation, the bit-representation of the accented character is not valid UTF-8, so the unicode replacement character (U+FFFD) is being produced. You've got a few options:
os.Stdout.Write(scanner.Bytes()); fmt.Println();
This might avoid the bytes being interpreted as UTF-8 beyond newline splitting. Writing the bytes directly to os.Stdout
will further avoid any (mis)interpretation of the contents.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With