Parsing textfiles with LINQ (or LINQ-to-TextReader)

Reading and parsing files is really no difficult task with the .NET framework. The System.IO namespace has several good classes to aid that task.

We can use streams and readers to get the data from the file.

When using a the StreamReader we state our intent in a imperative way. We create loops  to iterate over all the lines, maybe after skipping a header row, complete with logic to tell when we reached the end of the file.

LINQ shows us alternate ways to write code, introducing a more declarative coding paradigm. To use LINQ over the lines of a file, we can read all the lines in the file into a collection, and use LINQ over that collection. There’s some overhead to this; the need to read the entire file upfront and to fit the entire file in memory at once.

Can we combine the best of both worlds? Can we use LINQ over a file? No, there are no types dealing with files or streams in the .NET framework that support LINQ yet. But let’s give it a try anyway:

Let’s LINQ-enable the TextReader class.

We can use LINQ over any collection or type that implements IEnumerable, so I’ve written the AsEnumerable() extension method to the TextReader class:

public static class LinqToTextReader
{
	public static IEnumerable<string> AsEnumerable(this TextReader reader)
	{
		string line;
		while ((line=reader.ReadLine()) != null)
		{
			yield return line;
		}
	}
}

This will let us use LINQ statements like:

using (var reader = new StreamReader(@"testdata.txt"))
{
	var query = from line in reader.AsEnumerable().Skip(1) //skip header row
				let columns = line.Split('\t')
				select new {
					Firstname = columns[0],
					Lastname = columns[1],
					Age = columns[2]
				};
 
	//...do something with the results of query...
}

This is already simpler than looping over the reader.

Most of the times I’ll need to split the line into pieces as in the above example, so I created another extension method in the LinqToTextReader class that would return columns in a delimited textfile:

public static IEnumerable<string> GetSplittedLines(this TextReader reader, params char[] separators)
{
	foreach (var line in reader.AsEnumerable())
	{
		yield return line.Split(separators);
	}
}

(I just picked one of the overloads for string.Split, and a full-fledged LinqToTextReader class could implement the other six overloads as well.)

UPDATE: You’ll find a better and more robust solution in the “The Trouble With Delimited” post.

Now the same query as above can be written like this:

using (var reader = new StreamReader(@"testdata.txt"))
{
	var query = from line in reader. GetSplittedLines('\t')).Skip(1) //skip header row
				select new {
					Firstname = columns[0],
					Lastname = columns[1],
					Age = columns[2]
				}
 
	//...do something with the results of query...
}

6 Replies to “Parsing textfiles with LINQ (or LINQ-to-TextReader)”

  1. this looks great. too bad it dosn’t work. this is one of those annoying posts that leaves out just enough info so that if you couldn’t have done it in the first place anyway you won’t be able to use it.
    where does the reader.GetLines() method come from.
    how does the array “line” get the Split method. let’s just keep it a secret.

    1. Thanks for the response, and sorry for my typo. While working on my my post, I changed the name of the GetLines method to AsEnumerable. The GetLines you found must have slipped through…
      Anyway, its fixed now, and your second questions should be easy to answer now too: line is a string, and strings have a Split method on them…

  2. hi, thanks for that…
    sorry for being rude, was a bit frustrated.
    Expanding on this idea..
    say my splits is a bit more complex:

    “Salutation”,”First, Name”,”Middle,, Name”,”Last,,, Name”

    the file is delimited by commas and fields have quotes if there is data. if there happens to be a comma in the quotes we DON”T want to split.

    so i was thinking regex like this:

    public static IEnumerable AsEnumerable(this TextReader reader)
    {
    string expr = “(?<=\”),(?=\”)”;
    string line;
    while ((line = reader.ReadLine()) != null)
    {
    yield return Regex.Split(line, expr);
    }
    }

    dosn’t work tho.. can’t implicitly convert string[] to string.

  3. DUH.. it goes here:

    using (var reader = new StreamReader(@”c:\rejects.txt”))
    {
    var query = from line in reader.AsEnumerable() //skip header row
    //let columns = line.Split(‘,’)
    let columns = Regex.Split(line, expr)
    select new
    {
    Firstname = columns[0],
    Lastname = columns[1],
    Age = columns[2]
    };

Leave a Reply

Your email address will not be published. Required fields are marked *