Discussion:
[R] grep won't work finding one column
Kate Ignatius
10 years ago
Permalink
I'm having an issue with grep:

I have numerous columns that end with .at... when I use grep like so:

df[,grep(".at",colnames(df))]

it works fine. When I have one column that ends with .at, it does not
work. Why is that? As this is loop with varying number of columns
ending in .at I would like some code that would work with 1 to n
number of columns.

Is there something more optimal than grep?

Thanks!
John McKown
10 years ago
Permalink
Post by Kate Ignatius
df[,grep(".at",colnames(df))]
it works fine. When I have one column that ends with .at, it does not
work. Why is that? As this is loop with varying number of columns
ending in .at I would like some code that would work with 1 to n
number of columns.
Is there something more optimal than grep?
Thanks!
I can't answer your direct question. But do you realize that your code
does not match your words? The grep show does not _only_ match columns
who name end with the characters '.at'. It matches all column names
which contain any character followed by the characters "at". To do the
match with only columns whose names end with the characters ".at", you
need: grep("\.at$",colnames(df)).

You might want to post an example which fails. Just to be complete, be
sure to use the dput() function so that it is easy for members of the
group to cut'n'paste to get your data into our own R workspace.
--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown
Kate Ignatius
10 years ago
Permalink
For example,

DF will usually have numerous columns with sample1.at sample1.dp
sample1.fg sample2.at sample2.dp sample2.fg and so on....

I'm running this code in R as part of a shell script which runs over
several different file sizes so sometimes it will come across a file
with one sample in it: i.e. sample1: when the R code runs through this
file... trying to grep out the "sample1.at" column does not work and
it will halt and stop.

Here is some sample data... say I want to get out the AT_ only column....


Sample_1 AT_1
A/A RR
G/G AA
T/T AA
G/A RA
G/G RR
C/C AA
C/C AA
C/T RA
A/A AA
T/G RA

it will have a problem grepping out this single column.

On Tue, Oct 14, 2014 at 10:38 AM, John McKown
...
John McKown
10 years ago
Permalink
AT and at are not the same. If you want an case insensitive compare
for the characters "at" you need the "ignore.case=TRUE" added. E.g.:

df[,grep(".at",colnames(df),ignore.case=TRUE)

That should match the column name you gave. Which does not match your
initial description which said "ending with .at". That has an embedded
AT. So I am still a bit confused about your needs.
...
--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown
Ivan Calandra
10 years ago
Permalink
Shouldn't it be
grep("\\.at$",colnames(df))
with double back slash?

Ivan

--
Ivan Calandra
University of Reims Champagne-Ardenne
GEGENA? - EA 3795
CREA - 2 esplanade Roland Garros
51100 Reims, France
+33(0)3 26 77 36 89
ivan.calandra at univ-reims.fr
https://www.researchgate.net/profile/Ivan_Calandra
...
John McKown
10 years ago
Permalink
You're right. I don't use regexps in R very much. In most other
languages, a single \ is needed. The R parser is different and I
forgot. Thanks for the heads up.

On Tue, Oct 14, 2014 at 10:01 AM, Ivan Calandra
...
--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown
Jeff Newmiller
10 years ago
Permalink
Your question is missing a reproducible example, and you don't say how it does not work, so we cannot tell what is going on.

Two things do come to mind, though.

A) Data frame subsets with only one column by default return a vector, which is a different type of object than a single-column data frame. You would need to read ?"[.data.frame" about the "drop" argument if you wanted to consistently get a data frame from this expression.

B) The period is a wildcard in regular expressions. If you expect to limit your search to literal ".at" at the end of the name then you should use the search pattern "\\.at$" instead (the first slash allows the second one to be stored by R in the string, and the second one is the only one seen by grep, which it reads as making the period not act like a wildcard). You really should read about regular expressions before using them. There are many tutorials on the web about this topic.

---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
...
Kate Ignatius
10 years ago
Permalink
In the sense - it does not work. it works when there are 50 samples
in the file, but it does not work when there is one.

The usual headings are: sample1.at sample1.dp
sample1.fg sample2.at sample2.dp sample2.fg.... and so on to a max of
sample50.at sample50.dp sample50.fg

using this greps out all the .at columns perfectly:

df[,grep(".at",colnames(df))]

When I come across a file when there is one sample:

sample1.at sample1.dp sample1.fg

Using this:

df[,grep(".at",colnames(df))]

returns nothing.

Oh - AT/at was just an example... thats not my problem...



On Tue, Oct 14, 2014 at 10:57 AM, Jeff Newmiller
...
Rolf Turner
10 years ago
Permalink
Post by Kate Ignatius
In the sense - it does not work. it works when there are 50 samples
in the file, but it does not work when there is one.
The usual headings are: sample1.at sample1.dp
sample1.fg sample2.at sample2.dp sample2.fg.... and so on to a max of
sample50.at sample50.dp sample50.fg
df[,grep(".at",colnames(df))]
sample1.at sample1.dp sample1.fg
df[,grep(".at",colnames(df))]
returns nothing.
Oh - AT/at was just an example... thats not my problem...
You are being (deliberately?) obtuse.

It's *all* your problem. You have to be precise when working with
computers and when providing examples. Don't build examples with
confusing red herrings.

Your assertion that "df[,grep(".at",colnames(df))] returns nothing" is
simple ***INCORRECT***. It works just fine. See the (tidy, completely
reproducible) example in the attached file "kate.txt".

Note that, with a single ".at" column in your data frame, what is
returned is ***NOT*** a data frame but rather a vector. If you want a
(one-column) data frame you need to use "drop=FALSE" in your
subscripting call.

You need to study up on R and learn how it works (read the Introduction
to R) and stop going off half-cocked.

cheers,

Rolf Turner

P.S. It is a ***bad*** idea to use "df" as the name of a data frame.
The string "df" is the name of a *function* in base R (it is the
probability density function for the F distribution). Although R is
clever enough to distinguish functions from data objects in *most*
circumstances, at the very least confusion could arise.

R. T.
--
Rolf Turner
Technical Editor ANZJS
-------------- next part --------------
#
# Check it out.
#

# Data frame with one ".at" column.
d1 <- as.data.frame(matrix(1,ncol=3,nrow=10))
n1 <- c("sample1.at","sample1.dp","sample1.g")
names(d1) <- n1

# Data frame with many ".at" columns.
d2 <- as.data.frame(matrix(1,ncol=50,nrow=10))
set.seed(42)
n2 <- paste("sample",1:50,sample(c(".at",".dp",".fg"),50,TRUE),sep="")
names(d2) <- n2

# Extract the ".at" columns.
print(d1[,grep(".at",colnames(d1))])
print(d2[,grep(".at",colnames(d2))])
Loading...