Saturday, February 21, 2009

How to read and parse flat files in Java

Parsing files and their formats can be pretty painful no matter what programming language you are using. One open source project, Flatworm, looks to make reading, parsing, and writing files, much easier in Java. You define the file format in XML and Flatworm will break out the records into Java beans for you. You can read large files, file formats that have multiple-line records, and any other flat file format in existence today with this Java API.

Simple Example

Let's look at a simple example where we need to parse out a flat file into Java objects for processing. In our example we will need to parse client data using the file format below.

NameStartEndLengthType
Type122Char
First32725Char
Middle285225Char
Last537725Char
Acct. ID789215Char

Below is the sample flat file we will be parsing.
CDJOHN                     MARK                     DOE                      111111111111111
CDPAUL                     RICHARD                  STEPHENS                 222222222222222
CDRINGO                    JACK                     ERICSON                  333333333333333
Now that we know the file format of our flat file and we have some sample data to parse we'll need to create an XML document describing our file format for Flatworm.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE file-format SYSTEM "http://www.blackbear.com/dtds/flatworm-data-description_1_0.dtd">
<file-format>
<converter name="char" class="com.blackbear.flatworm.converters.CoreConverters" method="convertChar" return-type="java.lang.String"/>
<record name="clientData">
<record-ident>
<field-ident field-start="0" field-length="2">
<match-string>CD</match-string>
</field-ident>
</record-ident>
<record-definition>
<bean name="client" class="org.javaconfessions.sample.Client"/>
<line>
<record-element length="2"/>
<record-element length="25" beanref="client.firstName" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="25" beanref="client.middleName" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="25" beanref="client.lastName" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="15" beanref="client.accountId" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
</line>
</record-definition>
</record>
</file-format>
Closer Look at the Descriptor File

<file-format> - This tag is required and serves as the root node of our descriptor file.

<converter> - This tag is used to declare new converters to be used by the flatworm parser.

Looking further at the XML descriptor file, you can see it is rather simple to describe our file format for Flatworm. The record tag is the beginning of describing our client data records. Within the record tag, we have our record-ident tag. This is so Flatworm knows how to identify the types of records in a flat file. Most flat file formats have different types of records including header, footer, detail, batch headers, batch footers, etc. This mechanism allows Flatworm to parse out all of these different record types from the same file. The field-ident tag gives the specifics on how to identify the record. Field-start and field-length identifies what to test to identify the record type. Within the match-string tags is where the text that would be used to identify this record as a clientData record. In the descriptor above, we have described clientData records as starting with the characters CD.

The next section of the record description is the record-definition tag. This is where we actually map out each record element to a bean property for Flatworm. This section of the document starts with a bean definition that tells Flatworm which Java class to use when parsing this record type. The record-element tags setup where each field in the record is located, the data type, and where to plug it into the Java bean during parsing.

Here is the source code for my Client bean.
package org.javaconfessions.sample;

public class Client {

private String firstName;
private String middleName;
private String lastName;
private String accountId;

public String getFirstName() {
return firstName;
}

public void setFirstName(String pFirstName) {
firstName = pFirstName;
}

public String getMiddleName() {
return middleName;
}

public void setMiddleName(String pMiddleName) {
middleName = pMiddleName;
}

public String getLastName() {
return lastName;
}

public void setLastName(String pLastName) {
lastName = pLastName;
}

public String getAccountId() {
return accountId;
}

public void setAccountId(String pAccountId) {
accountId = pAccountId;
}

public String toString() {
return "First Name: " + firstName + "\nMiddleName: " + middleName
+ "\nLastName: " + lastName + "\nAccount ID: " + accountId
+ "\n";
}

}
Now that we have all of our data model code setup, the step is to write the code that will be responsible for populating our Java bean with the parsed data. I have posted a simple parsing Class below that will parse the flat file using Flatworm, and print each item out to the console.
package org.javaconfessions.sample;

import com.blackbear.flatworm.ConfigurationReader;
import com.blackbear.flatworm.FileFormat;
import com.blackbear.flatworm.MatchedRecord;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.logging.Level;
import java.util.logging.Logger;

public class ClientDataParser {

public static void main(String[] args) {
ConfigurationReader parser = new ConfigurationReader();
try {
FileFormat ff = parser.loadConfigurationFile(args[0]);
InputStream in = new FileInputStream( args[1] );
BufferedReader bufIn = new BufferedReader( new InputStreamReader( in ) );
MatchedRecord results;
while( ( results = ff.getNextRecord(bufIn)) != null ) {
System.out.println( results.getBean("client") );
}
} catch (Exception ex) {
Logger.getLogger(ClientDataParser.class.getName()).log(Level.SEVERE, null, ex);
}

}

}
Now after compiling, we just need to execute java ClientDataParser /path/to/format.xml /path/to/datafile.txt and you should get the following output:
First Name: JOHN
MiddleName: MARK
LastName: DOE
Account ID: 111111111111111

First Name: PAUL
MiddleName: RICHARD
LastName: STEPHENS
Account ID: 222222222222222

First Name: RINGO
MiddleName: JACK
LastName: ERICSON
Account ID: 333333333333333
Summary
Now you have an idea of how Flatworm works to simplify parsing flat files. Next learn how to write flat files using Flatworm.

11 comments:

Anonymous said...

Nice Post Michael.Can you tell us how to use Flatworm to write data files.Please post as early as possible.

Anonymous said...

Hi please tell how to use Flatworm to write data files.

Michael said...

Thanks. Here is a post on writing with Flatworm. It is a simple example, but it should get you started.
http://javaconfessions.com/2009/04/writing-flat-files-in-java-with.html

haridi said...

i become IllegalArgumentException No bean specified:

java.lang.IllegalArgumentException: No bean specified
at org.apache.commons.beanutils.PropertyUtilsBean.setNestedProperty(PropertyUtilsBean.java:1596)
at org.apache.commons.beanutils.PropertyUtilsBean.setProperty(PropertyUtilsBean.java:1677)
at org.apache.commons.beanutils.PropertyUtils.setProperty(PropertyUtils.java:559)
at com.blackbear.flatworm.PropertyUtilsMappingStrategy.mapBean(PropertyUtilsMappingStrategy.java:47)
at com.blackbear.flatworm.Line.mapField(Line.java:216)
at com.blackbear.flatworm.Line.parseInput(Line.java:155)
at com.blackbear.flatworm.Record.parseRecord(Record.java:233)

Michael said...

haridi,

Are you using the code from above or something you wrote? If it's something you wrote, please post it so I can take a look.

Daniel said...

Great Post Michael,
I was wondering how to compose the XML file so that any records that don't match "CD" will be ignored.
In other words if I add :
GHMIKE JORDAN SMITH 5555555555555
This line should be ignored, and not return a FlatwormInvalidRecordException

Thanks.

Michael said...

At the bottom of the file you can add another record without the identifiying information that will serve as your default. If nothing matches the other records, this one will be used. You don't even have to parse it if you don't want. Here's an example:
<record name="ignoreRecord">
<record-definition>
<line/>
</record-definition>
</record>

Babuni said...

Hi, I have a huge flat file. So, want to parse in batches. May be batch size as 10,000 lines. Can you suggest how to do this ?

Thanks in advance,
HD

David said...

Hi,
is it possible to parse a file like /path/to/datafile.txt (the example) but with no carriage return that delimit the line?
Can you post me the related format.xml file?

Thanks

Michael said...

David,

You can use the delimit attribute on the line tag under the record-definition tag to change the record delimiter. For example:

<record-definition>
   ...
   <line delimit=";">
   ...
</record-definition>

Anonymous said...

Is it possible

1) to update just one record in a flat file ?
2) Append to the flat file without writing whole file all over again?

© 2010 Confessions of a Java Programmer, All Rights Reserved