[R] grep won't work finding one column

Discussion:

Kate Ignatius

10 years ago

I'm having an issue with grep:

I have numerous columns that end with .at... when I use grep like so:

df[,grep(".at",colnames(df))]

it works fine. When I have one column that ends with .at, it does not
work. Why is that? As this is loop with varying number of columns
ending in .at I would like some code that would work with 1 to n
number of columns.

Is there something more optimal than grep?

Thanks!

John McKown

10 years ago

Permalink

I can't answer your direct question. But do you realize that your code
does not match your words? The grep show does not _only_ match columns
who name end with the characters '.at'. It matches all column names
which contain any character followed by the characters "at". To do the
match with only columns whose names end with the characters ".at", you
need: grep("\.at$",colnames(df)).

You might want to post an example which fails. Just to be complete, be
sure to use the dput() function so that it is easy for members of the
group to cut'n'paste to get your data into our own R workspace.

--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

Kate Ignatius

10 years ago

Permalink

For example,

DF will usually have numerous columns with sample1.at sample1.dp
sample1.fg sample2.at sample2.dp sample2.fg and so on....

I'm running this code in R as part of a shell script which runs over
several different file sizes so sometimes it will come across a file
with one sample in it: i.e. sample1: when the R code runs through this
file... trying to grep out the "sample1.at" column does not work and
it will halt and stop.

Here is some sample data... say I want to get out the AT_ only column....

Sample_1 AT_1
A/A RR
G/G AA
T/T AA
G/A RA
G/G RR
C/C AA
C/C AA
C/T RA
A/A AA
T/G RA

it will have a problem grepping out this single column.

On Tue, Oct 14, 2014 at 10:38 AM, John McKown

...

John McKown

10 years ago

Permalink

AT and at are not the same. If you want an case insensitive compare
for the characters "at" you need the "ignore.case=TRUE" added. E.g.:

df[,grep(".at",colnames(df),ignore.case=TRUE)

That should match the column name you gave. Which does not match your
initial description which said "ending with .at". That has an embedded
AT. So I am still a bit confused about your needs.

...

--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

Ivan Calandra

10 years ago

Permalink

Shouldn't it be
grep("\\.at$",colnames(df))
with double back slash?

Ivan

--
Ivan Calandra
University of Reims Champagne-Ardenne
GEGENA? - EA 3795
CREA - 2 esplanade Roland Garros
51100 Reims, France
+33(0)3 26 77 36 89
ivan.calandra at univ-reims.fr
https://www.researchgate.net/profile/Ivan_Calandra

...

John McKown

10 years ago

Permalink

You're right. I don't use regexps in R very much. In most other
languages, a single \ is needed. The R parser is different and I
forgot. Thanks for the heads up.

On Tue, Oct 14, 2014 at 10:01 AM, Ivan Calandra

...

--
There is nothing more pleasant than traveling and meeting new people!
Genghis Khan

Maranatha! <><
John McKown

Jeff Newmiller

10 years ago

Permalink

Your question is missing a reproducible example, and you don't say how it does not work, so we cannot tell what is going on.

Two things do come to mind, though.

A) Data frame subsets with only one column by default return a vector, which is a different type of object than a single-column data frame. You would need to read ?"[.data.frame" about the "drop" argument if you wanted to consistently get a data frame from this expression.

B) The period is a wildcard in regular expressions. If you expect to limit your search to literal ".at" at the end of the name then you should use the search pattern "\\.at$" instead (the first slash allows the second one to be stored by R in the string, and the second one is the only one seen by grep, which it reads as making the period not act like a wildcard). You really should read about regular expressions before using them. There are many tutorials on the web about this topic.

---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

...

Kate Ignatius

10 years ago

Permalink

In the sense - it does not work. it works when there are 50 samples
in the file, but it does not work when there is one.

The usual headings are: sample1.at sample1.dp
sample1.fg sample2.at sample2.dp sample2.fg.... and so on to a max of
sample50.at sample50.dp sample50.fg

using this greps out all the .at columns perfectly:

df[,grep(".at",colnames(df))]

When I come across a file when there is one sample:

sample1.at sample1.dp sample1.fg

Using this:

df[,grep(".at",colnames(df))]

returns nothing.

Oh - AT/at was just an example... thats not my problem...

On Tue, Oct 14, 2014 at 10:57 AM, Jeff Newmiller

Post by Jeff Newmiller
Your question is missing a reproducible example, and you don't say how it does not work, so we cannot tell what is going on.
Two things do come to mind, though.
A) Data frame subsets with only one column by default return a vector, which is a different type of object than a single-column data frame. You would need to read ?"[.data.frame" about the "drop" argument if you wanted to consistently get a data frame from this expression.
B) The period is a wildcard in regular expressions. If you expect to limit your search to literal ".at" at the end of the name then you should use the search pattern "\\.at$" instead (the first slash allows the second one to be stored by R in the string, and the second one is the only one seen by grep, which it reads as making the period not act like a wildcard). You really should read about regular expressions before using them. There are many tutorials on the web about this topic.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

Post by Kate Ignatius
df[,grep(".at",colnames(df))]
it works fine. When I have one column that ends with .at, it does not
work. Why is that? As this is loop with varying number of columns
ending in .at I would like some code that would work with 1 to n
number of columns.
Is there something more optimal than grep?
Thanks!
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

...

Rolf Turner

10 years ago

Permalink

Post by Kate Ignatius
In the sense - it does not work. it works when there are 50 samples
in the file, but it does not work when there is one.
The usual headings are: sample1.at sample1.dp
sample1.fg sample2.at sample2.dp sample2.fg.... and so on to a max of
sample50.at sample50.dp sample50.fg
df[,grep(".at",colnames(df))]
sample1.at sample1.dp sample1.fg
df[,grep(".at",colnames(df))]
returns nothing.
Oh - AT/at was just an example... thats not my problem...

You are being (deliberately?) obtuse.

It's *all* your problem. You have to be precise when working with
computers and when providing examples. Don't build examples with
confusing red herrings.

Your assertion that "df[,grep(".at",colnames(df))] returns nothing" is
simple ***INCORRECT***. It works just fine. See the (tidy, completely
reproducible) example in the attached file "kate.txt".

Note that, with a single ".at" column in your data frame, what is
returned is ***NOT*** a data frame but rather a vector. If you want a
(one-column) data frame you need to use "drop=FALSE" in your
subscripting call.

You need to study up on R and learn how it works (read the Introduction
to R) and stop going off half-cocked.

cheers,

Rolf Turner

P.S. It is a ***bad*** idea to use "df" as the name of a data frame.
The string "df" is the name of a *function* in base R (it is the
probability density function for the F distribution). Although R is
clever enough to distinguish functions from data objects in *most*
circumstances, at the very least confusion could arise.

R. T.
--
Rolf Turner
Technical Editor ANZJS
-------------- next part --------------
#
# Check it out.
#

# Data frame with one ".at" column.
d1 <- as.data.frame(matrix(1,ncol=3,nrow=10))
n1 <- c("sample1.at","sample1.dp","sample1.g")
names(d1) <- n1

# Data frame with many ".at" columns.
d2 <- as.data.frame(matrix(1,ncol=50,nrow=10))
set.seed(42)
n2 <- paste("sample",1:50,sample(c(".at",".dp",".fg"),50,TRUE),sep="")
names(d2) <- n2

# Extract the ".at" columns.
print(d1[,grep(".at",colnames(d1))])
print(d2[,grep(".at",colnames(d2))])