Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading unicode characters from csv file

I have a csv file which contains words in english followed by their Hindi translation. I am trying to read the csv file and do some further processing with it. The csv file looks like so:

English,,Hindi,,,  
,,,,,  
Cat,,बिल्ली,,,  
Rat,,चूहा,,,  
abandon,,छोड़ देना,त्याग देना,लापरवाही की स्वतन्त्रता,जाने देना  

I am trying to read the csv file line by line and display what has been written. The code snippet (Java) is as follows:

   //Step 2. Read csv file and get the string.
            FileInputStream fis = null;
            BufferedReader br = null;
            try {
                fis = new FileInputStream(new File(csvFile));
            } catch (FileNotFoundException e1) {
                // TODO Auto-generated catch block
                e1.printStackTrace();
            }

            boolean startSeen = true;
            if(fis != null) {
                try {
                    br = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
                } catch (UnsupportedEncodingException e2) {
                    // TODO Auto-generated catch block
                    e2.printStackTrace();
                    System.out.print("Unsupported encoding");
                }
                String line = null;
                if(br != null) {
                    try {
                        while((line = br.readLine()) != null) {
                            if(line.contains("English") == true) {
                                startSeen = true;
                            }

                            if((startSeen == true) && (line != null)) {
                                StringBuffer sbuf = new StringBuffer();
                                //Step 3. Parse the line.
                                sbuf.append(line);
                                System.out.println(sbuf.toString());
                            }
                        }
                    } catch (IOException e1) {
                        // TODO Auto-generated catch block
                        e1.printStackTrace();
                    }
                }  
}

However, the following output is what I get:

English,,Hindi,,,
,,,,,
Cat,,??????,,,
Rat,,????,,,
abandon,,???? ????,????? ????,???????? ?? ???????????,???? ????  

My Java is not that great and though I have gone through a number of posts on SO, I need more help in figuring out the exact cause of this problem.

like image 246
Sriram Avatar asked Jan 16 '13 06:01

Sriram


2 Answers

For reading text file it is better to use character stream e.g by using java.util.Scanner directly instead of FileInputStream. About encoding you have to make sure first that the text file that you want to read is saved as 'UTF-8' and not otherwise. I also notice in my system, I have to save my java source file as 'UTF-8' as well to make it shown hindi char properly.

However I want to suggest simpler way to read csv file as follow:

Scanner scan = new Scanner(new File(csvFile));
while(scan.hasNext()){
   System.out.println(scan.nextLine());
}

see the output

like image 180
Jon Kartago Lamida Avatar answered Oct 03 '22 06:10

Jon Kartago Lamida


I think your console cannot show Hindi chars. Try

System.out.println("Cat,,बिल्ली,,,");

to test

like image 38
Evgeniy Dorofeev Avatar answered Oct 03 '22 06:10

Evgeniy Dorofeev