Title: | Fast Access to Large ASCII Files |
---|---|
Description: | Methods for fast access to large ASCII files. Currently the following file formats are supported: comma separated format (CSV) and fixed width format. It is assumed that the files are too large to fit into memory, although the package can also be used to efficiently access files that do fit into memory. Methods are provided to access and process files blockwise. Furthermore, an opened file can be accessed as one would an ordinary data.frame. The LaF vignette gives an overview of the functionality provided. |
Authors: | Jan van der Laan [aut, cre] |
Maintainer: | Jan van der Laan <[email protected]> |
License: | GPL-3 |
Version: | 0.8.6 |
Built: | 2025-02-11 05:58:34 UTC |
Source: | https://github.com/djvanderlaan/laf |
When a connection is opened to a "laf"
object; this
object can then be indexed roughly as one would a data.frame
.
## S4 method for signature 'laf' x[i, j, drop] ## S4 method for signature 'laf_column' x[i, j, drop]
## S4 method for signature 'laf' x[i, j, drop] ## S4 method for signature 'laf_column' x[i, j, drop]
x |
an object of type |
i |
an logical or numeric vector with indices. The rows which should be selected. |
j |
a numeric vector with the columns to select. |
drop |
a logical indicating whether or not to convert the result to a
vector when only one column is selected. As in when indexing a
|
Selecting columns from an laf
object works as it does for a
data.frame
.
## S4 method for signature 'laf' x[[i]] ## S4 method for signature 'laf' x$name
## S4 method for signature 'laf' x[[i]] ## S4 method for signature 'laf' x$name
x |
an object of type |
i |
index of column to select. This should be a numeric or character vector. |
name |
the name of the column to select. |
Returns an object of type laf_column
. This object behaves almost the
same as an laf
object except that is it no longer necessary
(or possible) to specify which column should be used for functions that
require this.
Sets the file pointer to the beginning of the file. The next call to
next_block
returns the first lines of the file. This method is
usually used in combination with next_block
.
begin(x, ...) ## S4 method for signature 'laf' begin(x, ...)
begin(x, ...) ## S4 method for signature 'laf' begin(x, ...)
x |
an object the supports the |
... |
passed to other methods. |
Close the connection to the Large File
## S4 method for signature 'laf' close(con, ...)
## S4 method for signature 'laf' close(con, ...)
con |
a |
... |
unused. |
Methods for calculating simple statistics of columns of a file: mean, sum, standard deviation, range (min and max), and number of missing values.
colsum(x, ...) ## S4 method for signature 'laf' colsum(x, columns, na.rm = TRUE, ...) ## S4 method for signature 'laf_column' colsum(x, na.rm = TRUE, ...) colmean(x, ...) ## S4 method for signature 'laf' colmean(x, columns, na.rm = TRUE, ...) ## S4 method for signature 'laf_column' colmean(x, na.rm = TRUE, ...) colfreq(x, ...) ## S4 method for signature 'laf' colfreq(x, columns, useNA = c("ifany", "always", "no"), ...) ## S4 method for signature 'laf_column' colfreq(x, na.rm = TRUE, ...) colrange(x, ...) ## S4 method for signature 'laf' colrange(x, columns, na.rm = TRUE, ...) ## S4 method for signature 'laf_column' colrange(x, na.rm = TRUE, ...) colnmissing(x, ...) ## S4 method for signature 'laf' colnmissing(x, columns, na.rm = TRUE, ...) ## S4 method for signature 'laf_column' colnmissing(x, na.rm = TRUE, ...)
colsum(x, ...) ## S4 method for signature 'laf' colsum(x, columns, na.rm = TRUE, ...) ## S4 method for signature 'laf_column' colsum(x, na.rm = TRUE, ...) colmean(x, ...) ## S4 method for signature 'laf' colmean(x, columns, na.rm = TRUE, ...) ## S4 method for signature 'laf_column' colmean(x, na.rm = TRUE, ...) colfreq(x, ...) ## S4 method for signature 'laf' colfreq(x, columns, useNA = c("ifany", "always", "no"), ...) ## S4 method for signature 'laf_column' colfreq(x, na.rm = TRUE, ...) colrange(x, ...) ## S4 method for signature 'laf' colrange(x, columns, na.rm = TRUE, ...) ## S4 method for signature 'laf_column' colrange(x, na.rm = TRUE, ...) colnmissing(x, ...) ## S4 method for signature 'laf' colnmissing(x, columns, na.rm = TRUE, ...) ## S4 method for signature 'laf_column' colnmissing(x, na.rm = TRUE, ...)
x |
an object of type laf or laf_column. |
... |
Currently ignored. |
columns |
a numeric vector with the columns for which the statistics should be calculated. |
na.rm |
whether or not to ignore missing values. By default missing values are ignored. |
useNA |
method with which to treat missing values: "ifany" adds a field containing the number of missing values if there are any; "always" will always add a field with the number of missing values even when there are none; "none" will never add a field containing the number of missing values. |
Get the current line in the file
current_line(x) ## S4 method for signature 'laf' current_line(x)
current_line(x) ## S4 method for signature 'laf' current_line(x)
x |
an object the supports the Returns the next line that will be read by |
Automatically detect data models for CSV-files. Opening of files using the
data models can be done using laf_open
.
detect_dm_csv( filename, sep = ",", dec = ".", header = FALSE, nrows = 1000, nlines = NULL, sample = FALSE, stringsAsFactors = TRUE, factor_fraction = 0.4, ... )
detect_dm_csv( filename, sep = ",", dec = ".", header = FALSE, nrows = 1000, nlines = NULL, sample = FALSE, stringsAsFactors = TRUE, factor_fraction = 0.4, ... )
filename |
character containing the filename of the csv-file. |
sep |
character vector containing the separator used in the file. |
dec |
the character used for decimal points. |
header |
does the first line in the file contain the column names. |
nrows |
the number of lines that should be read in to detect the column types. The more lines the more likely that the correct types are detected. |
nlines |
(only needed when the sample option is used) the expected number of lines in the file. If not specified the number of lines in the file is first calculated. |
sample |
by default the first |
stringsAsFactors |
passed on to |
factor_fraction |
the fraction of unique string in a column below which the column is converted to a factor/categorical. For more information see details. |
... |
additional arguments are passed on to |
The argument factor_fraction
determines the fraction of unique strings
below which the column is converted to factor/categorical. If all column need
to be converted to character a value larger than one can be used. A value
smaller than zero will ensure that all columns will be converted to
categorical. Note that LaF stores the levels of a categorical in memory.
Therefore, for categorical columns with a very large number of (almost) unique
levels can cause memory problems.
read_dm
returns a data model which can be used by
laf_open
. The data model can be written to file using
write_dm
.
See write_dm
to write the data model to file. The data models
can be used to open a file using laf_open
.
# Create temporary filename tmpcsv <- tempfile(fileext="csv") # Generate test data ntest <- 10 column_types <- c("integer", "integer", "double", "string") testdata <- data.frame( a = 1:ntest, b = sample(1:2, ntest, replace=TRUE), c = round(runif(ntest), 13), d = sample(c("jan", "pier", "tjores", "corneel"), ntest, replace=TRUE), stringsAsFactors = FALSE ) # Write test data to csv file write.table(testdata, file=tmpcsv, row.names=FALSE, col.names=TRUE, sep=',') # Detect data model model <- detect_dm_csv(tmpcsv, header=TRUE) # Create LaF-object laf <- laf_open(model) # Cleanup file.remove(tmpcsv)
# Create temporary filename tmpcsv <- tempfile(fileext="csv") # Generate test data ntest <- 10 column_types <- c("integer", "integer", "double", "string") testdata <- data.frame( a = 1:ntest, b = sample(1:2, ntest, replace=TRUE), c = round(runif(ntest), 13), d = sample(c("jan", "pier", "tjores", "corneel"), ntest, replace=TRUE), stringsAsFactors = FALSE ) # Write test data to csv file write.table(testdata, file=tmpcsv, row.names=FALSE, col.names=TRUE, sep=',') # Detect data model model <- detect_dm_csv(tmpcsv, header=TRUE) # Create LaF-object laf <- laf_open(model) # Cleanup file.remove(tmpcsv)
Determine number of lines in a text file
determine_nlines(filename)
determine_nlines(filename)
filename |
character containing the filename of the file of which the lines are to be counted. |
The routine counts the number of line endings. If the last line does not end in a line ending, but does contain character, this line is also counted.
The file size is not limited by the amount of memory in the computer.
Returns the number of lines in the file.
See readLines
to read in all lines a text file;
get_lines
and sample_lines
can be used to read in
specified, or random lines.
# Create temporary filename tmpcsv <- tempfile(fileext="csv") # Generate file writeLines(letters[1:20], con=tmpcsv) # Count the lines determine_nlines(tmpcsv) # Cleanup file.remove(tmpcsv)
# Create temporary filename tmpcsv <- tempfile(fileext="csv") # Generate file writeLines(letters[1:20], con=tmpcsv) # Count the lines determine_nlines(tmpcsv) # Cleanup file.remove(tmpcsv)
Read in specified lines from a text file
get_lines(filename, line_numbers)
get_lines(filename, line_numbers)
filename |
character containing the filename of the file from which the lines should be read. |
line_numbers |
A vector containing the lines that should be read. |
Line numbers larger than the number of lines in the file are ignored. Missing values are returned for these.
Returns a character vector with the specified lines.
See readLines
to read in all lines a text file;
sample_lines
can be used to read in random lines.
# Create temporary filename tmpcsv <- tempfile(fileext="csv") writeLines(letters[1:20], con=tmpcsv) get_lines(tmpcsv, c(1, 10)) # Cleanup file.remove(tmpcsv)
# Create temporary filename tmpcsv <- tempfile(fileext="csv") writeLines(letters[1:20], con=tmpcsv) get_lines(tmpcsv, c(1, 10)) # Cleanup file.remove(tmpcsv)
Sets the current line to the line number specified. The next call to
next_block
will return the data on the specified line in the
first row. The number of the current line can be obtained using
current_line
.
goto(x, i, ...) ## S4 method for signature 'laf,numeric' goto(x, i, ...)
goto(x, i, ...) ## S4 method for signature 'laf,numeric' goto(x, i, ...)
x |
an object the supports the |
i |
the line number . |
... |
additional parameters passed to other methods. |
Representation of a column in a Large File object. This class itself is a
subclass of the class laf
. In principle all methods that can be used
with a laf
object can also be used with a laf_column
object
except the the column
or columns
arguments of these methods are
not needed.
Object of this class are usually created by using the $
operator on
laf
objects.
Uses a data model to create a connection to a file. The data model contains all the information needed to open the file (column types, column widths, etc.).
laf_open(model, ...)
laf_open(model, ...)
model |
a data model, such as one returned by read_dm or detect_dm_csv. |
... |
additional arguments can be used to overwrite the values specified
by the data model. These are listed in the argument documentation for
|
Depending on the field ‘type’ laf_open
uses laf_open_csv
and laf_open_fwf
to open the file. The data model should contain
all information needed by these routines to open the file.
Object of type laf
. Values can be extracted from this
object using indexing, and methods such as read_lines
,
next_block
.
See read_dm
and detect_dm_csv
for ways of creating
data models.
# Create some temporary files tmpcsv <- tempfile(fileext="csv") tmp2csv <- tempfile(fileext="csv") tmpyaml <- tempfile(fileext="yaml") # Generate test data ntest <- 10 column_types <- c("integer", "integer", "double", "string") testdata <- data.frame( a = 1:ntest, b = sample(1:2, ntest, replace=TRUE), c = round(runif(ntest), 13), d = sample(c("jan", "pier", "tjores", "corneel"), ntest, replace=TRUE) ) # Write test data to csv file write.table(testdata, file=tmpcsv, row.names=FALSE, col.names=FALSE, sep=',') # Create LaF-object laf <- laf_open_csv(tmpcsv, column_types=column_types) # Write data model to file write_dm(laf, tmpyaml) # Read data model and open file laf <- laf_open(read_dm(tmpyaml)) # Write test data to second csv file write.table(testdata, file=tmp2csv, row.names=FALSE, col.names=FALSE, sep=',') # Read data model and open second file, demonstrating the use of the optional # arguments to laf_open laf2 <- laf_open(read_dm(tmpyaml), filename=tmp2csv) # Cleanup file.remove(tmpcsv) file.remove(tmp2csv) file.remove(tmpyaml)
# Create some temporary files tmpcsv <- tempfile(fileext="csv") tmp2csv <- tempfile(fileext="csv") tmpyaml <- tempfile(fileext="yaml") # Generate test data ntest <- 10 column_types <- c("integer", "integer", "double", "string") testdata <- data.frame( a = 1:ntest, b = sample(1:2, ntest, replace=TRUE), c = round(runif(ntest), 13), d = sample(c("jan", "pier", "tjores", "corneel"), ntest, replace=TRUE) ) # Write test data to csv file write.table(testdata, file=tmpcsv, row.names=FALSE, col.names=FALSE, sep=',') # Create LaF-object laf <- laf_open_csv(tmpcsv, column_types=column_types) # Write data model to file write_dm(laf, tmpyaml) # Read data model and open file laf <- laf_open(read_dm(tmpyaml)) # Write test data to second csv file write.table(testdata, file=tmp2csv, row.names=FALSE, col.names=FALSE, sep=',') # Read data model and open second file, demonstrating the use of the optional # arguments to laf_open laf2 <- laf_open(read_dm(tmpyaml), filename=tmp2csv) # Cleanup file.remove(tmpcsv) file.remove(tmp2csv) file.remove(tmpyaml)
A connection to the file filename is created. Column types have to be specified. These are not determined automatically as for example read.csv does. This has been done to increase speed.
laf_open_csv( filename, column_types, column_names = paste("V", seq_len(length(column_types)), sep = ""), sep = ",", dec = ".", trim = FALSE, skip = 0, ignore_failed_conversion = FALSE )
laf_open_csv( filename, column_types, column_names = paste("V", seq_len(length(column_types)), sep = ""), sep = ",", dec = ".", trim = FALSE, skip = 0, ignore_failed_conversion = FALSE )
filename |
character containing the filename of the CSV-file |
column_types |
character vector containing the types of data in each of the columns. Valid types are: double, integer, categorical and string. |
column_names |
optional character vector containing the names of the columns. |
sep |
optional character specifying the field separator used in the file. |
dec |
optional character specifying the decimal mark. |
trim |
optional logical specifying whether or not white space at the end of factor levels or character strings should be trimmed. |
skip |
optional numeric specifying the number of lines at the beginning of the file that should be skipped. |
ignore_failed_conversion |
ignore (set to |
After the connection is created data can be extracted using indexing (as in a
normal data.frame) or methods such as read_lines
and
next_block
can be used to read in blocks. For processing the
file in blocks the convenience function process_blocks
can be
used.
The CSV-file should not contain headers. Use the skip
option to skip
any headers.
In case of an incomplete line (at line with less columns than it should
have): when the line is completely empty the reader stops at that point and
considers that as the end of the file. In other cases a warning is issued
and the remaining columns are considered empty. For character columns this
results in an empty string for numeric columns a NA
.
Object of type laf
. Values can be extracted from this
object using indexing, and methods such as read_lines
,
next_block
.
See read.csv
for conventional access of CSV files. And
detect_dm_csv
to automatically determine the column types.
# Create temporary filename tmpcsv <- tempfile(fileext="csv") # Generate test data ntest <- 10 column_types <- c("integer", "integer", "double", "string") testdata <- data.frame( a = 1:ntest, b = sample(1:2, ntest, replace=TRUE), c = round(runif(ntest), 13), d = sample(c("jan", "pier", "tjores", "corneel"), ntest, replace=TRUE) ) # Write test data to csv file write.table(testdata, file=tmpcsv, row.names=FALSE, col.names=FALSE, sep=',') # Create LaF-object laf <- laf_open_csv(tmpcsv, column_types=column_types) # Read from file using indexing first_column <- laf[ , 1] first_row <- laf[1, ] # Read from file using blockwise operators begin(laf) first_block <- next_block(laf, nrows=2) second_block <- next_block(laf, nrows=2) # Cleanup file.remove(tmpcsv)
# Create temporary filename tmpcsv <- tempfile(fileext="csv") # Generate test data ntest <- 10 column_types <- c("integer", "integer", "double", "string") testdata <- data.frame( a = 1:ntest, b = sample(1:2, ntest, replace=TRUE), c = round(runif(ntest), 13), d = sample(c("jan", "pier", "tjores", "corneel"), ntest, replace=TRUE) ) # Write test data to csv file write.table(testdata, file=tmpcsv, row.names=FALSE, col.names=FALSE, sep=',') # Create LaF-object laf <- laf_open_csv(tmpcsv, column_types=column_types) # Read from file using indexing first_column <- laf[ , 1] first_row <- laf[1, ] # Read from file using blockwise operators begin(laf) first_block <- next_block(laf, nrows=2) second_block <- next_block(laf, nrows=2) # Cleanup file.remove(tmpcsv)
A connection to the file filename is created. Column types have to be specified. These are not determined automatically as for example read.fwf does. This has been done to increase speed.
laf_open_fwf( filename, column_types, column_widths, column_names = paste("V", seq_len(length(column_types)), sep = ""), dec = ".", trim = TRUE, ignore_failed_conversion = FALSE )
laf_open_fwf( filename, column_types, column_widths, column_names = paste("V", seq_len(length(column_types)), sep = ""), dec = ".", trim = TRUE, ignore_failed_conversion = FALSE )
filename |
character containing the filename of the fixed width file. |
column_types |
character vector containing the types of data in each of the columns. Valid types are: double, integer, categorical and string. |
column_widths |
numeric vector containing the width in number of character of each of the columns. |
column_names |
optional character vector containing the names of the columns. |
dec |
optional character specifying the decimal mark. |
trim |
optional logical specifying whether or not whitespace at the end of factor levels or character strings should be trimmed. |
ignore_failed_conversion |
ignore (set to |
After the connection is created data can be extracted using indexing (as in a normal data.frame) or methods such as read_lines and next_block can be used to read in blocks. For processing the file in blocks the (faster) convenience function process_blocks can be used.
Only use ignore_failed_conversion
when you are sure that the column
specification is correct. Otherwise, this option can hide an incorrect
specification.
Object of type laf
. Values can be extracted from this object
using indexing, and methods such as read_lines
, next_block
.
See read.fwf
for conventional access of fixed width files.
A Large File object. This is a reference to a dataset on disk. The data itself is not read into memory (yet). This can be done by the methods for blockwise processing or by indexing the object as a data.frame. The code has been optimised for fast access.
Objects can be created by opening a file using one of the methods
laf_open_csv
or laf_open_fwf
. These create a
reference to either a CSV file or a fixed width file. The data in these
files can either be accessed using blockwise operations using the methods
begin
, next_block
and goto
. Or by indexing the laf
object as you would a data.frame. In the following example a CSV file
is opened and its first column (of type integer) is read into memory:
laf <- laf_open_csv("file.csv", column_types=c("integer", "double")) data <- laf[ , 1]
Get and change the levels of the column in a Large File object
## S4 method for signature 'laf' levels(x) ## S4 replacement method for signature 'laf' levels(x) <- value ## S4 method for signature 'laf_column' levels(x) ## S4 replacement method for signature 'laf_column' levels(x) <- value
## S4 method for signature 'laf' levels(x) ## S4 replacement method for signature 'laf' levels(x) <- value ## S4 method for signature 'laf_column' levels(x) ## S4 replacement method for signature 'laf_column' levels(x) <- value
x |
a |
value |
a list with the levels for each column. |
Get and set the names of the columns in a Large File object
## S4 method for signature 'laf' names(x) ## S4 replacement method for signature 'laf' names(x) <- value
## S4 method for signature 'laf' names(x) ## S4 replacement method for signature 'laf' names(x) <- value
x |
a |
value |
a character vector with the new column names |
Get the number of columns in a Large File object
## S4 method for signature 'laf' ncol(x)
## S4 method for signature 'laf' ncol(x)
x |
a |
Read the next block of data from a file.
next_block(x, ...) ## S4 method for signature 'laf' next_block(x, columns = 1:ncol(x), nrows = 5000, ...) ## S4 method for signature 'laf_column' next_block(x, nrows = 5000, ...)
next_block(x, ...) ## S4 method for signature 'laf' next_block(x, columns = 1:ncol(x), nrows = 5000, ...) ## S4 method for signature 'laf_column' next_block(x, nrows = 5000, ...)
x |
an object the supports the |
... |
passed to other methods. Reads the next block of lines from a file. The method returns a
|
columns |
an integer vector with the columns that should be read in. |
nrows |
the (maximum) number of rows to read in one block |
Get the number of rows in a Large File object
## S4 method for signature 'laf' nrow(x)
## S4 method for signature 'laf' nrow(x)
x |
a |
Reads the specified file block by block and feeds each block to the specified function.
process_blocks(x, fun, ...) ## S4 method for signature 'laf' process_blocks( x, fun, columns = 1:ncol(x), nrows = 5000, allow_interupt = FALSE, progress = FALSE, ... )
process_blocks(x, fun, ...) ## S4 method for signature 'laf' process_blocks( x, fun, columns = 1:ncol(x), nrows = 5000, allow_interupt = FALSE, progress = FALSE, ... )
x |
an object the supports the |
fun |
a function to apply to each block (see details). |
... |
additional parameters are passed on to |
columns |
an integer vector with the columns that should be read in. |
nrows |
the (maximum) number of rows to read in one block |
allow_interupt |
when TRUE the function |
progress |
show a progress bar. Note that this triggers a calculation
of the number of lines in the file which for CSV files can take some time.
When numeric |
The function should accept as the first argument the next block of data. When
the end of the file is reached this is an empty (zero row) data.frame
.
As the second argument the function should accept the output of the previous
call to the function. The first time the function is called the second
argument has the value NULL
.
Using these routines data models can be written and read. These data models
can be used to create LaF object without the need to specify all arguments
(column names, column types etc.). Opening of files using the data models can
be done using laf_open
.
read_dm(modelfile, ...) write_dm(model, modelfile)
read_dm(modelfile, ...) write_dm(model, modelfile)
modelfile |
character containing the filename of the file the model is to be written to/read from. |
... |
additional arguments are added to the data model or, when they are also present in the file are used to overwrite the values specified in the file. |
model |
a data model or an object of type |
A data model is a list containing information which open routine should be
used (e.g. laf_open_csv
or laf_open_fwf
), and the
arguments needed for these routines. Required elements are ‘type’, which can
(currently) be ‘csv’, or ‘fwf’, and ‘columns’, which should be a
data.frame
containing at least the columns ‘name’ and ‘type’, and for
fwf ‘width’. These columns correspond to the arguments column_names
,
column_types
and column_widths
respectively. Other arguments of
the laf_open_*
routines can be specified as additional elements of the
list.
write_dm
can also be used to write a data model that is created
from an object of type laf
. This is probably one of the
easiest ways to create a data model.
The data model is stored in a text file in YAML format which is a format in which data structures can be stored in a readable and editable format.
read_dm
returns a data model which can be used by
laf_open
.
See detect_dm_csv
for a routine which can automatically
create a data model from a CSV-file. The data models can be used to open a
file using laf_open
.
# Create some temporary files tmpcsv <- tempfile(fileext="csv") tmp2csv <- tempfile(fileext="csv") tmpyaml <- tempfile(fileext="yaml") # Generate test data ntest <- 10 column_types <- c("integer", "integer", "double", "string") testdata <- data.frame( a = 1:ntest, b = sample(1:2, ntest, replace=TRUE), c = round(runif(ntest), 13), d = sample(c("jan", "pier", "tjores", "corneel"), ntest, replace=TRUE) ) # Write test data to csv file write.table(testdata, file=tmpcsv, row.names=FALSE, col.names=FALSE, sep=',') # Create LaF-object laf <- laf_open_csv(tmpcsv, column_types=column_types) # Write data model to stdout() (screen) write_dm(laf, stdout()) # Write data model to file write_dm(laf, tmpyaml) # Read data model and open file laf2 <- laf_open(read_dm(tmpyaml)) # Write test data to second csv file write.table(testdata, file=tmp2csv, row.names=FALSE, col.names=FALSE, sep=',') # Read data model and open seconde file, demonstrating the use of the optional # arguments to read_dm laf2 <- laf_open(read_dm(tmpyaml, filename=tmp2csv)) # Cleanup file.remove(tmpcsv) file.remove(tmp2csv) file.remove(tmpyaml)
# Create some temporary files tmpcsv <- tempfile(fileext="csv") tmp2csv <- tempfile(fileext="csv") tmpyaml <- tempfile(fileext="yaml") # Generate test data ntest <- 10 column_types <- c("integer", "integer", "double", "string") testdata <- data.frame( a = 1:ntest, b = sample(1:2, ntest, replace=TRUE), c = round(runif(ntest), 13), d = sample(c("jan", "pier", "tjores", "corneel"), ntest, replace=TRUE) ) # Write test data to csv file write.table(testdata, file=tmpcsv, row.names=FALSE, col.names=FALSE, sep=',') # Create LaF-object laf <- laf_open_csv(tmpcsv, column_types=column_types) # Write data model to stdout() (screen) write_dm(laf, stdout()) # Write data model to file write_dm(laf, tmpyaml) # Read data model and open file laf2 <- laf_open(read_dm(tmpyaml)) # Write test data to second csv file write.table(testdata, file=tmp2csv, row.names=FALSE, col.names=FALSE, sep=',') # Read data model and open seconde file, demonstrating the use of the optional # arguments to read_dm laf2 <- laf_open(read_dm(tmpyaml, filename=tmp2csv)) # Cleanup file.remove(tmpcsv) file.remove(tmp2csv) file.remove(tmpyaml)
Read in Blaise data models
read_dm_blaise(filename, datafilename = NA, encoding = "latin1")
read_dm_blaise(filename, datafilename = NA, encoding = "latin1")
filename |
the filename of the file containing the data model. |
datafilename |
the filename of the data file to which the data model belongs. |
encoding |
the encoding used in the file. See |
The function reads the data model from file and returns a list that can be
used by laf_open
to open the file for reading. Only a subset of
the most common features found in Blaise files are supported. If part of the
data model can not be parsed a warning is given.
Returns a data model (which is a list containing all the relevant information
to open a file using laf_open
. When the file contains more than
one data model a list of data models is returned and a warning issued.
See write_dm
to write the data model to file. The data models
can be used to open a file using laf_open
.
# Create some temporary files tmpdat <- tempfile(fileext="dat") tmpbla <- tempfile(fileext="bla") # Generate test data lines <- c( " 1M 1.45Rotterdam ", " 2F12.00Amsterdam ", " 3 .22 Berlin ", " M22 Paris ", " 4F12345London ", " 5M Copenhagen", " 6M-12.1 ", " 7F -1Oslo ") writeLines(lines, con=tmpdat) # Create a file containing the data model writeLines(c( "DATAMODEL test", "FIELDS", " id : INTEGER[2]", " gender : STRING[1]", " x : REAL[5] {comment}", " city : STRING[10]", "ENDMODEL"), con=tmpbla) model <- read_dm_blaise(tmpbla, datafilename=tmpdat) laf <- laf_open(model) # Cleanup file.remove(tmpbla) file.remove(tmpdat)
# Create some temporary files tmpdat <- tempfile(fileext="dat") tmpbla <- tempfile(fileext="bla") # Generate test data lines <- c( " 1M 1.45Rotterdam ", " 2F12.00Amsterdam ", " 3 .22 Berlin ", " M22 Paris ", " 4F12345London ", " 5M Copenhagen", " 6M-12.1 ", " 7F -1Oslo ") writeLines(lines, con=tmpdat) # Create a file containing the data model writeLines(c( "DATAMODEL test", "FIELDS", " id : INTEGER[2]", " gender : STRING[1]", " x : REAL[5] {comment}", " city : STRING[10]", "ENDMODEL"), con=tmpbla) model <- read_dm_blaise(tmpbla, datafilename=tmpdat) laf <- laf_open(model) # Cleanup file.remove(tmpbla) file.remove(tmpdat)
Reads the specified lines and columns from the data file.
read_lines(x, ...) ## S4 method for signature 'laf' read_lines(x, rows, columns = 1:ncol(x), ...) ## S4 method for signature 'laf_column' read_lines(x, rows, columns = 1:ncol(x), ...)
read_lines(x, ...) ## S4 method for signature 'laf' read_lines(x, rows, columns = 1:ncol(x), ...) ## S4 method for signature 'laf_column' read_lines(x, rows, columns = 1:ncol(x), ...)
x |
an object the supports the |
... |
passed on to other methods. |
rows |
a numeric vector with the rows that should be read from the file. |
columns |
an integer vector with the columns that should be read in. |
Note that when scanning through the complete file next_block is much faster. Also note that random file access can be slow (and is always much slower than sequential file access), especially for certain file types such as comma separated. Reading is generally faster when the lines that should be read are sorted.
Read in random lines from a text file
sample_lines(filename, n, nlines = NULL)
sample_lines(filename, n, nlines = NULL)
filename |
character containing the filename of the file from which the lines should be read. |
n |
The number of lines that should be sampled from the file. |
nlines |
The total number of lines in the file. If not specified or
|
When nlines
is not specified, the total number of lines is first
determined. This can take quite some time. Therefore, specifying the number of
lines can cause a significant speed up. It can also be used to sample lines
from the first nlines
line by specifying a value for nlines
that
is smaller than the number of lines in the file.
Returns a character vector with the sampled lines.
See readLines
to read in all lines a text file;
get_lines
can be used to read in specified lines.
# Create temporary filename tmpcsv <- tempfile(fileext="csv") writeLines(letters[1:20], con=tmpcsv) sample_lines(tmpcsv, 10) # Cleanup file.remove(tmpcsv)
# Create temporary filename tmpcsv <- tempfile(fileext="csv") writeLines(letters[1:20], con=tmpcsv) sample_lines(tmpcsv, 10) # Cleanup file.remove(tmpcsv)
Print the Large File object to screen
Print a column of a Large File object to screen
## S4 method for signature 'laf' show(object) ## S4 method for signature 'laf_column' show(object)
## S4 method for signature 'laf' show(object) ## S4 method for signature 'laf_column' show(object)
object |
the object to print to screen. |