I have an sframe
as such:
+---------+------+-------------------------------+-----------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+---------+------+-------------------------------+-----------+------------------+
| IATE-14 | ro | Agenție de aprovizionare | fullForm | 3 |
| IATE-84 | bg | компетенции на държави чле... | fullForm | 3 |
| IATE-84 | cs | příslušnost členských stát... | fullForm | 3 |
| IATE-84 | da | medlemsstatskompetence | fullForm | 3 |
| IATE-84 | de | Zuständigkeit der Mitglied... | fullForm | 3 |
| IATE-84 | el | αρμοδιότητα των κρατών μελ... | fullForm | 3 |
| IATE-84 | en | competence of the Member S... | fullForm | 3 |
| IATE-84 | es | competencias de los Estado... | fullForm | 3 |
| IATE-84 | et | liikmesriikide pädevus | fullForm | 3 |
| IATE-84 | fi | jäsenvaltioiden toimivalta | fullForm | 3 |
| IATE-84 | fr | compétence des États membres | fullForm | 3 |
| IATE-84 | ga | inniúlacht na mBallstát | fullForm | 3 |
| IATE-84 | hu | tagállami hatáskör | fullForm | 3 |
| IATE-84 | it | competenza degli Stati membri | fullForm | 3 |
| IATE-84 | lt | valstybių narių kompetencija | fullForm | 2 |
| IATE-84 | lv | dalībvalstu kompetence | fullForm | 3 |
| IATE-84 | nl | bevoegdheid van de lidstaten | fullForm | 3 |
| IATE-84 | pl | kompetencje państw członko... | fullForm | 3 |
| IATE-84 | pt | competência dos Estados-Me... | fullForm | 3 |
| IATE-84 | ro | competența statelor membre... | fullForm | 3 |
+---------+------+-------------------------------+-----------+------------------+
I need to extract all the rows where lang == 'de' or lang == 'en'
but the rows I extract with lang == 'en'
needs to have a corresponding lang == 'de'
such that they share the same term_id
.
I have been doing it as such with graphlab
and sframe
:
sf = gl.SFrame.read_csv('iate.csv', delimiter='\t', quote_char='\0', column_type_hints=[str,str,unicode,str,int])
de = sf[sf['lang'] == 'de']
de_termids = de['term_id']
and de.print_rows(10)
:
+------------+------+-------------------------------+-----------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+------------+------+-------------------------------+-----------+------------------+
| IATE-84 | de | Zuständigkeit der Mitglied... | fullForm | 3 |
| IATE-290 | de | Schutz der öffentlichen Ge... | fullForm | 3 |
| IATE-662 | de | mengenmäßigen Ausfuhrbesch... | fullForm | 3 |
| IATE-801 | de | Eintragungshindernisse | fullForm | 2 |
| IATE-1326 | de | Sonderregelung für Reisebü... | fullForm | 4 |
| IATE-1702 | de | Erwerbslose | fullForm | 4 |
| IATE-2818 | de | Verwaltungsvorschriften | fullForm | 3 |
| IATE-21139 | de | frisches Obst und Gemüse | fullForm | 3 |
| IATE-21563 | de | chemische Erzeugnisse zur ... | fullForm | 3 |
| IATE-21564 | de | Mineralsäuren | fullForm | 3 |
+------------+------+-------------------------------+-----------+------------------+
And then:
en = sf[sf['lang'] == 'en']
en.print_rows(10)
[out]:
+------------+------+-------------------------------+--------------+------------------+
| term_id | lang | term_str | term_type | reliability_code |
+------------+------+-------------------------------+--------------+------------------+
| IATE-84 | en | competence of the Member S... | fullForm | 3 |
| IATE-254 | en | award of public works cont... | fullForm | 3 |
| IATE-290 | en | public health protection | fullForm | 3 |
| IATE-662 | en | quantitative restriction o... | fullForm | 3 |
| IATE-801 | en | grounds for refusal | fullForm | 2 |
| IATE-1299 | en | CEP | abbreviation | 3 |
| IATE-1326 | en | special scheme for travel ... | fullForm | 3 |
| IATE-2818 | en | regulations | fullForm | 3 |
| IATE-7128 | en | company name | fullForm | 2 |
| IATE-21139 | en | fresh fruits and vegetables | fullForm | 3 |
+------------+------+-------------------------------+--------------+------------------+
I have tried:
en_de = en[en['term_id'] in de_termids]
But I'm getting the syntax wrong, giving me this error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-12-9656091794b8> in <module>()
1 en = sf[sf['lang'] == 'en']
----> 2 en_de = en[en['term_id'] in de_termids]
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __contains__(self, item)
691
692 """
--> 693 return (self == item).any()
694
695 def contains(self, item):
/usr/local/lib/python2.7/dist-packages/graphlab/data_structures/sarray.pyc in __eq__(self, other)
973 return SArray(_proxy = self.__proxy__.vector_operator(other.__proxy__, '=='))
974 else:
--> 975 return SArray(_proxy = self.__proxy__.left_scalar_operator(other, '=='))
976
977
/usr/local/lib/python2.7/dist-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Array size mismatch
How should I filter the sframe such that I get rows with en
and de
and corresponding term_id
?
The resulting dataframe should look something like this:
+---------+-----------------+-------------+
| term_id | term_str_en | term_str_de |
+---------+-------------------------------+
| IATE-999 | something | etwas |
...
+---------+-----------------+-------------+
How do I do the same with pandas
?
Suppose you already have two data frames with filtered data for both languages: df_en
and df_de
. Then you can merge
them:
new_df = pd.merge(df_en[['term_id','term_str']], df_de[['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))
The method inner
takes care of skipping all unmatched rows.
You can find more options to merge
in pandas docs and refs
Edit
The same result without creating two data frames (df
is the original data frame with all entries, possibly containing other languages too):
new_df = pd.merge(df.loc[df['lang']=='en',['term_id','term_str']], df.loc[df['lang']=='de',['term_id','term_str']], how = 'inner', on ='term_id', suffixes = ('_en', '_de'))
Since you've only gotten answers for pandas, here's how to do it in SFrame, given de_termids
and en
in your code samples:
en.filter_by(de_termids, 'term_id')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With