Thursday, December 14, 2017

Working with current and previous rows in text files

An issue came up today where the vendor was sending a text file, but the text for some of the lines was on multiple lines, rather than a single line.  After researching, I found out that I needed to compare a line of text and based on certain criteria append it to the prior line.  I couldn't just strip certain end of line markers because it was mixed.

I am using a dummy test file in this example, but here is how everything played out:

  • We get a text file from a vendor and expected it to look similar to this:

 

  • Instead, the file looked like this: 

 

In this dummy file, the field "likes" represents a string field the users have and use for a variety of comments.  They will use hard returns, commas, and double quotes freely - all of which are used as possible delimiters.

The automated process that consumes this file uses an SSIS package which looks for a LF as the end of line character. 

The issues with this file which prevented it from being imported properly were as follows:

  • Line 4 and 5: the LF at the end of line 4 prevents line 5 from being included with line 4.  Therefore, line 5 needs to be appended to line 4.
  • Lines 8, 9 and 10: the LF at the end of line 8 prevents line 9 from being included with line 8; in addition, the CRLF in some import processes may or may not include line 10 with line 9 (and subsequently line 8).

For this purpose, I am able to manipulate the data enough to replace a LF or CRLF that is part of the current line easily.  However, that wont solve combining lines as necessary, for some, not all of the rows in the file.  Therefore, I had to have a solution which compared each line and appended it to the previous line or not, depending on certain criteria.

This may not be the most elegant solution, but in the short timeframe I had (isn't someone always breathing down our necks?), I came up with this:

  • Parse the file
  • Ignore the header line
  • Find the first tab in each line and get whatever is before it in a string variable
    • If the line didn't have a tab - then it needs to be appended to the previous line
    • If the value before the tab is not numeric - then it needs to be appended to the previous line
    • If the value before the tab is numeric (the first field is ID and is always numeric) - then it is a good line and doesn't need anything done to it
    • (Keep in mind... if the value before the tab is numeric, but it should be appended to the previous line, this wont work for that line.  I realize that the probability of that happening is low and therefore, will be looking for other ways of handling this scenario when I have more time.)

I decided it was easier to write the output to a new file, rather than update the existing file.  That way, I didn't have to re-read the previous line and manipulate it, I could just save the value in a variable, add the current line to the previous line when needed, and write it to a new file.

There are many posts online that talk about how Get-Content is slow for larger files, so I decided to use the .NET StreamReader instead.  Here is the code I came up with - once all was said and done.  Yes, I will be adding more checks and balances in it, and more logging, but this works and does what I need it to.

Click here for the powershell script.  This will provide the current version, even when its updated.  To see prior versions, please use the downloads page from the main menu.

 

 

Archives

Powered by mod LCA