Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 529 posts at DZone. You can read more from them at their website. View Full User Profile

Java: Parsing CSV files

10.04.2012
| 19937 views |
  • submit to reddit

As I mentioned in a previous post I recently moved a bunch of neo4j data loading code from Ruby to Java and as part of that process I needed to parse some CSV files.

In Ruby I was using FasterCSV which became the standard CSV library from Ruby 1.9 but it’s been a while since I had to parse CSV files in Java so I wasn’t sure which library to use.

I needed a library which could parse a comma separated file where there might be commas in the values of one of the fields. I think that’s fairly standard behaviour in any CSV library but my googling led me to OpenCSV.

It can be downloaded from here and so far seems to do the job!

This is an example of how I’m using it:

String filePath = "/Users/mneedham/data/awesome-csv-file.csv";
CSVReader reader = new CSVReader(new FileReader(filePath), ',');
 
List<String[]> csvEntries = reader.readAll();
Iterator<String[]> iterator = csvEntries.iterator();
 
while (iterator.hasNext()) {
    String[] row = iterator.next();
    System.out.println("field 1: " + row[0]);
}

There are more use cases described on the home page.

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Lance Semmens replied on Fri, 2012/10/05 - 3:24am

As a side note, openCSV supports an iterator style reading pattern which can be handy with large files where you don't want to read the entire file into memory at once.

String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
// do stuff

 

 

Jean Said replied on Fri, 2012/10/05 - 9:44am

I beg to differ.

CSV parsing is by no way generic and I think it's quite hard to build up a generic library for this. Why? Because CSV is not a file format, it's a multitude of file formats with no clear standard. Even the basic functionality is by no way standard. I mean, if I use ';' as a field separator, do I accept trailing characters as a field? Do I ignore these? Which charset(s) do I accept? Do I use some heuristics to detect multiple charsets with no bom? Then, which charset are available for use? Unless you have a clear and flawlessly used standard, csv is a nightmare.

In my company, we were forced to create our own smart and configurable CSV parser as no lib out there was  up to our needs. Of course we could use some shaky generic lib, but we didn't and for sadly valid reasons.

 

CSV files are just as much a problem as "text files". It means basically nothing. Our input is CSV is almost as precise as saying that our input is a binary file with records in it.

Even if it looks simple, don't assume you'll get a silver bullet lib to parse it...

Evagelos Sinapidis replied on Fri, 2012/10/05 - 7:43am

I use SuperCSV

Cristofer Weber replied on Fri, 2012/10/05 - 10:27am

There's also a project for batch import CSV files into Neo4j

https://github.com/maxdemarzi/batch-import 

 

malakooti replied on Fri, 2012/10/05 - 12:26pm

Parsing of CSV can be done in a generic manner using an array that holds the Definition of each field. In fact the best way is to create a CSV parser that:

1- Throws a new document event when file is opened.

2- Reads a line and splits it to an array of fields

3- Create a For each field gets the field name from the Definition of the field and throw a new element event for each field with the value dervied from splitted fields.

In fact you can extend this concept to reading the fixed length files and you can treat all legacy data files as modern (XML) files.  This also alignes you with the new XML specification which has provision for trnsferring XML data in a non XML format.  

Greg Brown replied on Fri, 2012/10/05 - 2:01pm

CSV is so underrated. It is ideal for transferring tabular data (as long as you know what the column data types are), since it is so much less verbose than XML or JSON.

Seb Cha replied on Fri, 2012/10/05 - 2:24pm

I agree with Jean Said

CSV is not a standard, it's not backed up by W3C. How do i declare a null field ? How do i declare a blank field ? How do i escape special characteres ? How do i know the charactere encoding ?...

XML will be verbose if you use Element instead of Attributes.

Greg Brown replied on Fri, 2012/10/05 - 6:58pm in response to: Seb Cha

CSV is not a standard

It may not be a "formal" standard but it is a de facto standard.

How do i declare a null field ? How do i declare a blank field ? How do i escape special characteres ? How do i know the charactere encoding ?...

These questions may be open to interpretation but they are certainly not difficult to solve.

XML will be verbose if you use Element instead of Attributes.

It's true that XML will be less verbose with attributes, but it will still be far more verbose than CSV. In XML, the element and attribute names must be repeated for every record, in addition to the field values. In CSV, only the field values are required.

Don't get me wrong - I like XML and JSON, too. Each technology has it's pros and cons - just choose the right tool for the job.

matt inger replied on Fri, 2012/10/05 - 8:48pm

There's also a handy StrTokenizer class in commons-lang3. 

 

StrTokenizer tokenizer = StrTokenizer.getCSVInstance();
while ( (line = reader.readLine()) != null) {
  tokenizer.reset(line); 
  String toks[] = tokenizer.getTokenArray(); 

 

 

The tokenizer itself is quite configurable allowing you to tell it what constitutes whitespace, what to do with empty tokens, what the delimiters are, etc...  Though in most cases, just calling "getCSVInstance()" or "getTSVInstance()" is sufficient for common cases.  It also implements the ListIterator interface as well.

 The one downside is it's line oriented, so it's up to the programmer to feed it a single record at a time (using a LineNumberReader or something of that nature).

 

Full disclosure:  I am one of the authors of this class. 

Lund Wolfe replied on Sat, 2012/10/06 - 2:49pm

OpenCSV is an excellent choice.  If there is an interoperable data standard it is probably CSV or some text equivalent.  That's usually what you see when there is an import/export option from a proprietary format.  CSV is often messy because it originated from Excel and you may get trailing commas for fake columns and trailing fake rows.  There are lots of Excel users out there, and your expectations should be pretty low.

You do need to know your data, and clean up and validate it accordingly.  That's necessary for any conversion.  This is one of the biggest opportunities for improvement when moving data from one information system to another.

Mahmoud Ben Hassine replied on Sat, 2012/11/17 - 1:53am

Hi,

There's also a project for CSV batch processing :

http://benas.github.com/cb4j/

It is POJO based, uses a declarative approach for data validation and has many other features :  batch stats, JMX monitoring, etc

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.