python-在新 dataframe 中返回第一个匹配的值/列名

import pandas as pd
import numpy as np
rng = pd.date_range('1/1/2011', periods=6, freq='H')
df = pd.DataFrame({'A': [0, 1, 2, 3, 4,5],
                   'B': [0, 1, 2, 3, 4,5],
                   'C': [0, 1, 2, 3, 4,5],
                   'D': [0, 1, 2, 3, 4,5],
                   'E': [1, 2, 3, 3, 7,6],
                   'F': [1, 1, 3, 3, 7,6],
                   'G': [0, 0, 1, 0, 0,0]

                  },
                 index=rng)

一个简单的 dataframe 可以帮助我解释一下:

df


                    A   B   C   D   E   F   G
2011-01-01 00:00:00 0   0   0   0   1   1   0
2011-01-01 01:00:00 1   1   1   1   2   1   0
2011-01-01 02:00:00 2   2   2   2   3   3   1
2011-01-01 03:00:00 3   3   3   3   3   3   0
2011-01-01 04:00:00 4   4   4   4   7   7   0
2011-01-01 05:00:00 5   5   5   5   6   6   0

当我过滤一个大于2的值时,得到以下输出:

df[df >= 2]

                     A  B   C   D   E   F   G
2011-01-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN
2011-01-01 01:00:00 NaN NaN NaN NaN 2.0 NaN NaN
2011-01-01 02:00:00 2.0 2.0 2.0 2.0 3.0 3.0 NaN
2011-01-01 03:00:00 3.0 3.0 3.0 3.0 3.0 3.0 NaN
2011-01-01 04:00:00 4.0 4.0 4.0 4.0 7.0 7.0 NaN
2011-01-01 05:00:00 5.0 5.0 5.0 5.0 6.0 6.0 NaN

对于每一行,我想知道哪一列首先具有匹配值(从左到右工作).因此,在2011年1月1日01:00:00的行中,它是E行,其值是2.0.

enter image description here

所需的输出:

我想得到的是一个新的 dataframe ,它的第一个匹配值在名为“ Value”的列中,而另一列名为“ From Col”,它捕获了来自此列的名称.

如果未找到匹配项,则从最后一列(在这种情况下为G)输出.谢谢你的帮助.

                       "Value" "From Col"   
    2011-01-01 00:00:00    NaN  G
    2011-01-01 01:00:00    2    E
    2011-01-01 02:00:00    2    A
    2011-01-01 03:00:00    3    A
    2011-01-01 04:00:00    4    A
    2011-01-01 05:00:00    5    A

最佳答案

尝试这个:

def get_first_valid(ser):
    if len(ser) == 0:
        return pd.Series([np.nan,np.nan])

    mask = pd.isnull(ser.values)
    i = mask.argmin()
    if mask[i]:
        return pd.Series([np.nan, ser.index[-1]])
    else:
        return pd.Series([ser[i], ser.index[i]])


In [113]: df[df >= 2].apply(get_first_valid, axis=1)
Out[113]:
                       0  1
2011-01-01 00:00:00  NaN  G
2011-01-01 01:00:00  2.0  E
2011-01-01 02:00:00  2.0  A
2011-01-01 03:00:00  3.0  A
2011-01-01 04:00:00  4.0  A
2011-01-01 05:00:00  5.0  A

要么:

In [114]: df[df >= 2].T.apply(get_first_valid).T
Out[114]:
                       0  1
2011-01-01 00:00:00  NaN  G
2011-01-01 01:00:00    2  E
2011-01-01 02:00:00    2  A
2011-01-01 03:00:00    3  A
2011-01-01 04:00:00    4  A
2011-01-01 05:00:00    5  A

PS我采用了Series.first_valid_index()函数的源代码,并对其进行了恶意破解…

说明:

In [221]: ser = pd.Series([np.nan, np.nan, 5, 7, np.nan])

In [222]: ser
Out[222]:
0    NaN
1    NaN
2    5.0
3    7.0
4    NaN
dtype: float64

In [223]: mask = pd.isnull(ser.values)

In [224]: mask
Out[224]: array([ True,  True, False, False,  True], dtype=bool)

In [225]: i = mask.argmin()

In [226]: i
Out[226]: 2

In [227]: ser.index[i]
Out[227]: 2

In [228]: ser[i]
Out[228]: 5.0