Friday, March 23, 2012

problem reading tab delimited flat file source with missing tabs

I have a tab delimited flat file with say 60 columns. All columns can have null values. The file contains a blank tab for nulls.

Now, if a row has null value for last 10 columns, ideally, there should be 10 blank tabs padded to the end of that row. But, this is not the case with this file. There is line break after 50 columns of that row, in the above example.

Extrapolating this behavior, most of the rows have null values for last 'x' columns where 'x' is variable for each row.

When I try to customize a flat file connection for this file, SSIS is not padding up the missing tabs. In the example above, it should insert null in the last 10 columns for that row. Instead, and weirdly, it is continuing to read next row. So, column 1 of next row is inserted into 51 st columns of the above row.

Even Excel 2000 can handle this same file easily with the same delimiters (row - LF and column - tab).

Any ideas?The flat file parser parses data per-column using current column delimiters, and then advancing to the next one. The row delimiter is considered as nothing more then a column delimiter of the last column.

We considered adding what you are asking for; always check if we get the row delimiter in addition to checking for the current column delimiter, and if the row delimiter comes before expected column delimiter pad the remaining columns of that row.

We decided not to implement this by default for performance reasons. It is likely we will have it in the next version of the product, but only as an option that will need to be explicitly selected.

HTH.

Thanks,|||Thanks Bob for clarifying that this feature was dropped.

However, I am not disappointed with the performance of DTS 2000 for such a file with 350,000 rows. It would be nice to see this feature back... even if it needs explicit selection. I can bet it will be better than writing my own buggy ;-) parser routine.

Now, will this solution work? :

Step1: Create a DTS 2000 package that can read the file properly and output it in either a txt format with vertical bar delimiter OR in SQL Server 2005 database table.

Step2: Migrate this package to SQL Server 2005 using Migration Wizard.

Step 3: In SSIS designer, use 'Execute Package' control flow task to execute the above package.
Any inputs?|||For other people having same problem, here is the script you can use to fix the missing tabs using a transformation "Script Component":

Please note that this approach will work only if first row has column names.
Feel free to mail me if you need more insight on this.

Thanks,
nitesh.ambastha@.csfb.com
-

Imports System

Imports System.Text

Imports System.Data

Imports System.Math

Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

Public Class ScriptMain

Inherits UserComponent

Private columnCount As Int32 = 0

; First row contains column names.

Private dataRow As Boolean = False

Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer)

Dim columns As String() = Nothing

Dim outputRow As String = Nothing

Dim buffer As StringBuilder = New StringBuilder()

buffer.Append(Row.Column0.Substring(0, Row.Column0.Length - 1))

columns = Row.Column0.Split(New [Char]() {Chr(9)})

If Not dataRow Then

columnCount = columns.Length

End If

Dim N As Integer

If columns.Length < columnCount Then

For N = 0 To columnCount - columns.Length Step 1

buffer.Append(Chr(9))

Next N

End If

dataRow = True

buffer.Append(Chr(13))

Row.columnOut = buffer.ToString()

End Sub

End Class

|||This is great

I'm having a similar problem with a file with pipe delimited columns. I think I understand your script (I'm getting up to speed with programming). to help me understand it better what part of the code would you chang in my case. Also the column that is sometimes missing is the the last column.

I know there are other ways to get clean data i.e make sure the data extract is correct before I receive it but this looks like a good exercise for later..

Thanks for your help|||

has anybody modified this script to handle to opposite problem which is also not backwards compatible.. ie we have some rows with extra delimiters on the end which are getting included into the data for the last column..

it would be something like this

if columns.Length > columnCount Then


End IF

WHERE YOU have to pull off X number of charcters off of the buffer where x

= columns.length - ColumnCount (ie pull of x number of commas off the end of the buffer before dumping out)

just need to figure out how to do that part.

thanks

|||

Just set the last column x number length, and change error output to ignore truncation for that th end data column - so you are just truncating the extra stuff. make sense?

knowthyself, Is your script a source or transform component? We too have discovered this flaw, which is going to cause painstaking hours tweaking just so we don't get extra row {crlf} in the focused row transform! Are you handling/filtering bad rows similarly like the ff_source allows? if so, how do you do it?

|||

here is an alternate solution using a stored procedure to fix rows it handles both missing and extra columns and also text deliminited fields..

set ANSI_NULLS ON
set QUOTED_IDENTIFIER ON
go

ALTER PROCEDURE [dbo].[Fix_input_line](

@.columns int,
@.inline varchar(8000),
@.terminator char(1),
@.textdelimiter char(1),
@.debug int,
@.outputline varchar(8000) output)

AS

BEGIN


declare @.outline varchar(8000);
declare @.current varchar(8000);
declare @.done int;
declare @.len int;
Declare @.last int;
Declare @.currentloc int;
Declare @.patindex varchar(3);
Declare @.word varchar(1000);
Declare @.colcount int;
Declare @.end int;
Declare @.totallinelen int;

set @.patindex = '%' + @.terminator + '%';

set @.current=@.inline;
set @.done = 0;
set @.last = 0;
set @.colcount = 0;
set @.totallinelen = len(@.inline);
set @.outline = '';

set @.currentloc = Patindex(@.patindex,@.current);
if (@.currentloc = 0)
set @.last = 1;


while ((@.currentloc != 0) or (@.done != 1))
begin

set @.len = len(@.current);

if (@.last = 1)
begin
set @.currentloc = @.len;
set @.end = @.len;
end
else
begin
set @.end = @.currentloc - 1;
end

set @.word = Substring(@.current,1,@.end);
if (@.debug = 1) print 'got column = ' + @.word

if (@.textdelimiter is not null and @.textdelimiter != '')
begin
if (@.debug = 1) print 'comparing ' + substring(@.word,@.end,@.end) + ' to text deliminiter'

if (left(@.word,1) = @.textdelimiter) and (substring(@.word,@.end,@.end) != @.textdelimiter)
begin
declare @.currentalt varchar(8000);
declare @.char char(1);
declare @.endnotfound int;
-- found a text delimited column keep going
if (@.debug = 1) print ' found text deliminted string skipping...';
set @.currentalt = Substring(@.current,@.currentloc+1,@.len-1)
set @.char = Substring(@.current,@.currentloc,1)
set @.endnotfound = 0;
while (@.char != @.textdelimiter)
begin

if (@.debug = 1) print 'in while loop searching for ' + @.textdelimiter + ' cc = ' + @.char;

set @.currentloc = @.currentloc + 1;
if (@.currentloc = @.totallinelen + 1)
begin
set @.endnotfound = 1;
break;
end
set @.char = Substring(@.current,@.currentloc,1)
end
if (@.debug = 1) print 'found text ending delimintor';
if (@.endnotfound = 1)
begin
print 'ERROR: in input file NO text terminator found';
set @.outline = @.inline;
return;
end

set @.currentloc = @.currentloc + 1;
set @.char = Substring(@.current,@.currentloc,1)

if ((@.char != @.terminator) and (@.currentloc != @.len -1))
begin
print 'ERROR: in input file no terminator after text deliminitor';
set @.outline = @.inline;
return;
end

-- now reset data to go on
set @.len = len(@.current);
if (@.last = 1)
begin
set @.currentloc = @.len;
set @.end = @.len;
end
else
begin
set @.end = @.currentloc - 1;
end

set @.word = Substring(@.current,1,@.end);
end -- found text del. string
end -- have text delimited operations


set @.colcount = @.colcount + 1;
if (@.debug = 1) print 'doing column checks colcount = ' + cast(@.colcount as varchar(15)) + ' columns = ' + cast(@.columns as varchar(15));
if (@.colcount < @.columns)
set @.outline = @.outline + @.word + @.terminator;
if (@.colcount = @.columns)
set @.outline = @.outline + @.word;

else
if (@.last = 1)
begin
while (@.colcount < @.columns-1)
begin
set @.outline = @.outline + @.terminator;
set @.colcount = @.colcount + 1;
end
end

if (@.last != 1)
begin
set @.current= Substring(@.current,@.currentloc+1,@.len-1);
set @.currentloc = Patindex(@.patindex,@.current);
if (@.currentloc = 0)
begin
set @.last = 1;
end
end
else
begin
set @.done = 1;
set @.currentloc = 0;
end
if (@.debug = 1)
begin
print 'at end of loop outline so far = ' + @.outline;
print ''
end

end -- while loop

select @.outputline = @.outline;
end

|||

Truly Ragged comma delimited formats are very common in the telecommunication world where gigabytes of data in a day are spit out in this format.

It's inefficient in the real time world to pad the rows with commas or whatnot if a line can contain only one column of data or 200.

the row delimiter used to be read before the column delimiter in the DTS world... it seems logical. I do understand that someone simply tried to generalize, but this shouldn't have gone out the door without an option to threat a row of data as a row.

Having said that, is there any way to extend the connection manager/ data source to make the row delimiter to be read first? Any fix available, etc? We're currently stuck using SSIS for this, and I do not want to be releasing hacks such as the ones listed below into production.

But surly if this can not be achieved in a clean way, the managements might believe the techs that SQL2005 is still a beta product and won't be production value until a service pack or two. And thus can be convinced to go with what the techs prefer in the first place Oracle. Although this might sound extreame for a simple import, it's a flamming sword in the fight against letting MS technologies into a unix dominated world.

|||

Sadly I must agree with loki55, it is truly disheartening to see this not available in the main product. It is becoming quite a task now to scope out this new workaround, and our organization and new prospects are looking outside of SSIS for this particular feature in new projects forward. Is there anyone at MS (Kirk?) that can help shed some light on the reality of this matter, is it going to be incorporated into a service pack, or are there any customizations (outside of parsing, then re-parsing) that aren't going to be so tedious to implement?

|||Same problem...

DID the MS FIX it yet?|||

no ms didn't fix it here is our final solution that is better than the sp..

since we normally call dts from a batch file any way I wrote a c program that fixes column issues

and call it first before calling the dts

here is the file

C:\J:\dbd\fixi>fixinput
FixInput (c) 2006 larry kahn - icentrix.com (kahn@.lgk.com)

syntax: inputfile outputfile columns terminator delimiter debug

Note to pass a quote or command put it in quotes ie: fixinput infilename.csv out
filename.csv 10 "," """ 1
to pass a tab character as the terminator pass T ie: fixinput infilename.tx
t outfilename.txt "T" "" 1

here is an example

C:\OPar>cat icxauth.bat
fixinput ICXAUTH.txt FIXEDICXAUTH.txt 17 "|" """ 0
dtexec /F F:\OPar\ICXAUTH.dtsx
C:\OPar>

it will pad extra columns as necessary or delete if needed.

here is the file..

<a href="http://stage1.icentrix.com/fixinput.exe">fixinput</a>

|||Instead of transformation - I wrote a Script Task:

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime

Public Class ScriptMain
Public Sub Main()
Dim rowCount As Int32 = 0, columnCount As Int32 = 0
Dim columns As String(), currentRow As String
Dim sourceFile As String = CStr(Dts.Variables("MySourceFile").Value)
Dim destFile As String = Dts.Connections("MyDestinationFile").ConnectionString
Dim reader As System.IO.StreamReader = New System.IO.StreamReader(sourceFile)
Dim writer As System.IO.StreamWriter = New System.IO.StreamWriter(destFile, False)
Do
currentRow = reader.ReadLine()
writer.Write(currentRow)
columns = currentRow.Split(New [Char]() {Chr(9)})
If rowCount = 0 Then
columnCount = columns.Length
End If

Dim N As Integer
If columns.Length < columnCount Then
For N = 1 To columnCount - columns.Length Step 1
writer.Write(Chr(9))
Next N
End If

writer.Write(Chr(13))
writer.Write(Chr(10))
rowCount += 1
Loop Until reader.Peek = -1
reader.Close()
writer.Flush()
writer.Close()
reader = Nothing
writer = Nothing

Dts.TaskResult = Dts.Results.Success
End Sub
End Class|||

when will the next version be available? It's very helpful to us. Thanks,

|||After reading all the posts I agree that this was a feature in DTS that unfortunately was not carried forward into SSIS. Too bad. When using DTS we used csv instead of .xls because DTS didn't handle xls as well. Now things have seemed to reverse itself.

No comments:

Post a Comment